# Project 5 – Ensemble Machine Learning (Wine Quality)
### Blessing Aganaga

This project uses the red wine quality dataset from the UCI Machine Learning Repository.
We explore ensemble models—Random Forest, AdaBoost, Gradient Boosting, Voting, and Bagging—
and evaluate which models generalize best to unseen data.

The target variable is simplified into **low**, **medium**, and **high** quality categories.


In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Section 1: Load and Inspect the Data

In this section, I load the wine quality dataset and display the structure and first few rows.  
The dataset contains 11 physicochemical features and a quality rating from 0 to 10.


In [23]:
df = pd.read_csv("winequality-red.csv", sep=";")
df.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2: Prepare the Data

I convert the original wine quality score into two new columns:

- `quality_label` → text labels (low, medium, high)
- `quality_numeric` → numeric categories (0, 1, 2)

These labels make it easier to train classification models.


In [24]:
# Create text labels for wine quality
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

df["quality_label"] = df["quality"].apply(quality_to_label)

# Create numeric labels for modeling (0 = low, 1 = medium, 2 = high)
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

df["quality_numeric"] = df["quality"].apply(quality_to_number)

df[["quality", "quality_label", "quality_numeric"]].head()


Unnamed: 0,quality,quality_label,quality_numeric
0,5,medium,1
1,5,medium,1
2,5,medium,1
3,6,medium,1
4,5,medium,1


## Section 3: Feature Selection

I use all physicochemical measurements as input features (X).  
The `quality_numeric` column is used as the target (y).  
I drop the original columns to avoid redundancy and data leakage.


In [25]:
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])
y = df["quality_numeric"]

X.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


## Section 4: Train/Test Split

I split the data into 80% training and 20% testing using `train_test_split` with
`stratify=y`. Stratification ensures that the low, medium, and high quality classes
have the same proportions in both the training and testing sets.

This makes the evaluation more reliable and prevents the model from seeing a misleading
distribution of labels.


In [26]:
from sklearn.model_selection import train_test_split

# Stratified split so class ratios stay the same in train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training data shape:", X_train.shape)
print("Test data shape:", X_test.shape)


Training data shape: (1279, 11)
Test data shape: (320, 11)


## Section 5: Ensemble Model Evaluation

In this section, I evaluate different ensemble models using accuracy, F1 score, and
the confusion matrix.

I use a helper function `evaluate_model()` to train each model and store the results
in a list for easy comparison later. This keeps the notebook clean and consistent.

I will evaluate **two chosen models** for the project, but I may test more for comparison.


In [27]:
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )


## Section 6: Model Results Comparison

In this section, I evaluate the performance of my selected ensemble models and summarize
their metrics in a results table. I include gap values between train and test accuracy and
F1 score to check for overfitting. The table is sorted by Test Accuracy to make it easier
to identify the best-performing model.


In [28]:
results = []


In [29]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


In [30]:
# Model 1: Random Forest (200, max_depth=10)
evaluate_model(
    "Random Forest (200, max_depth=10)",
    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    X_train, y_train, X_test, y_test, results
)

# Model 2: Gradient Boosting (100)
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train, y_train, X_test, y_test, results
)



Random Forest (200, max_depth=10) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 255   9]
 [  0  16  27]]
Train Accuracy: 0.9758, Test Accuracy: 0.8812
Train F1 Score: 0.9745, Test F1 Score: 0.8596

Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411


In [31]:
results_df = pd.DataFrame(results)

results_df["Acc Gap (Train-Test)"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap (Train-Test)"] = results_df["Train F1"] - results_df["Test F1"]

results_sorted = results_df.sort_values(by="Test Accuracy", ascending=False)
results_sorted


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Acc Gap (Train-Test),F1 Gap (Train-Test)
0,"Random Forest (200, max_depth=10)",0.975762,0.88125,0.974482,0.859643,0.094512,0.114839
1,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875,0.117304


## Section 7: Conclusions & Insights

In this project, I evaluated two ensemble models—Random Forest (200, max_depth=10) and
Gradient Boosting (100)—to classify red wine quality into low, medium, and high categories.
Both models performed well, but they showed different strengths.

### Random Forest (200, max_depth=10)
- **Test Accuracy:** 0.8812  
- **Test F1 Score:** 0.8596  
- **Train Accuracy:** 0.9758  
- **Train F1 Score:** 0.9745  
- **Gap:** About 0.09  

Random Forest achieved the **highest test accuracy** of the two models. However, it had
a noticeable gap between train and test performance, suggesting **mild overfitting**. This
makes sense because Random Forest uses many deep trees, which can fit training data closely.

The confusion matrix shows that the model performs very well on the "medium" and "high"
categories, but the "low" class is extremely difficult to predict because it has the fewest
samples in the dataset.

### Gradient Boosting (100)
- **Test Accuracy:** 0.8562  
- **Test F1 Score:** 0.8411  
- **Train Accuracy:** 0.9601  
- **Train F1 Score:** 0.9584  
- **Gap:** About 0.10  

Gradient Boosting performed slightly worse in terms of raw accuracy, but its predictions
were still strong. It showed **slightly less overfitting** compared to Random Forest, and
its predictions were more stable across classes. This model may generalize better with
parameter tuning.

### Which Model Performed Best?
Based on **test accuracy**, the **Random Forest (200, max_depth=10)** was the best performer.
It produced the highest predictive performance on unseen data. However, Gradient Boosting
showed more stability and might outperform Random Forest with additional tuning.

### What I Learned
- Ensemble models significantly outperform basic models because they reduce variance and
  capture more complex decision boundaries.
- The dataset is imbalanced, especially for the "low" class, which explains why both models
  struggle with it.
- Overfitting is a real concern even for ensembles and must be monitored using train-test gaps.

### Next Steps (If Improving the Model)
If this were a real ML competition, I would try:
- Hyperparameter tuning using GridSearchCV for both models  
- Trying XGBoost or LightGBM for even stronger boosting performance  
- SMOTE or class weighting to handle the minority "low" class  
- Feature scaling to help boosting methods  
- Increasing the depth and number of estimators for Gradient Boosting  

Overall, Random Forest is the winning model for this project based on the current settings,
but Gradient Boosting shows promise with additional tuning.
