# Project 5 — Ensemble Models (Wine Quality & Spiral)
**Author:** Womenker Karto  
**Date:** 2025-11-21

**Overview:**  
This notebook implements and compares ensemble classifiers on the UCI Wine Quality (red) dataset. I convert the quality score to three classes (low / medium / high) and evaluate multiple ensemble methods. I also load a secondary `spiral.csv` for optional exploration and visualization used in the lab example.

## Imports

In [38]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from typing import List, Dict

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)
import warnings
warnings.filterwarnings("ignore")
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## Section 1 — Load and Inspect the Data
Load `winequality-red.csv` and `spiral.csv`.

In [39]:
df = pd.read_csv("winequality-red.csv", sep=";")
print("Wine dataset shape:", df.shape)
display(df.head())
display(df.describe().T)


Wine dataset shape: (1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed acidity,1599.0,8.319637,1.741096,4.6,7.1,7.9,9.2,15.9
volatile acidity,1599.0,0.527821,0.17906,0.12,0.39,0.52,0.64,1.58
citric acid,1599.0,0.270976,0.194801,0.0,0.09,0.26,0.42,1.0
residual sugar,1599.0,2.538806,1.409928,0.9,1.9,2.2,2.6,15.5
chlorides,1599.0,0.087467,0.047065,0.012,0.07,0.079,0.09,0.611
free sulfur dioxide,1599.0,15.874922,10.460157,1.0,7.0,14.0,21.0,72.0
total sulfur dioxide,1599.0,46.467792,32.895324,6.0,22.0,38.0,62.0,289.0
density,1599.0,0.996747,0.001887,0.99007,0.9956,0.99675,0.997835,1.00369
pH,1599.0,3.311113,0.154386,2.74,3.21,3.31,3.4,4.01
sulphates,1599.0,0.658149,0.169507,0.33,0.55,0.62,0.73,2.0


In [40]:
spiral = pd.read_csv("spiral.csv", sep="\t")
print("Spiral dataset shape:", spiral.shape)
display(spiral.head())


Spiral dataset shape: (798, 3)


Unnamed: 0,A,B,Class
0,3.91203,-1.108531,0
1,2.663918,2.714674,0
2,0.481765,0.088643,0
3,-0.839247,3.163379,0
4,3.915366,-0.939925,0


## Section 2 — Prepare the Data
Create categorical labels (low/medium/high) and a numeric target.

In [41]:
def quality_to_label(q: int) -> str:
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

def quality_to_number(q: int) -> int:
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

df["quality_label"] = df["quality"].apply(quality_to_label)
df["quality_numeric"] = df["quality"].apply(quality_to_number)

print("Class distribution (labels):")
display(df["quality_label"].value_counts())
print("\nNumeric distribution:")
display(df["quality_numeric"].value_counts())

Class distribution (labels):


quality_label
medium    1319
high       217
low         63
Name: count, dtype: int64


Numeric distribution:


quality_numeric
1    1319
2     217
0      63
Name: count, dtype: int64

## Section 3 — Feature Selection and Justification
Use the 11 physicochemical features as input X and `quality_numeric` as target y.

In [42]:
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])
y = df["quality_numeric"]
print("Feature set shape:", X.shape)
print("Target value counts:")
display(y.value_counts())

Feature set shape: (1599, 11)
Target value counts:


quality_numeric
1    1319
2     217
0      63
Name: count, dtype: int64

## Section 4 — Split the Data into Train and Test
We use a stratified split to preserve class balance.

In [43]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

Train shape: (1279, 11) Test shape: (320, 11)


## Section 5 — Evaluate Model Performance
We will run a set of ensemble models and record Train/Test accuracy, Train/Test F1, and gaps.

In [44]:
def evaluate_model(name: str, model, X_tr, y_tr, X_te, y_te, results: List[Dict], use_scaled: bool = False,
                   scaler: StandardScaler = None):
    # Optionally scale inputs (for SVM / MLP / models that benefited from scaling)
    if use_scaled and scaler is not None:
        X_tr_used = scaler.transform(X_tr)
        X_te_used = scaler.transform(X_te)
    else:
        X_tr_used = X_tr
        X_te_used = X_te

    model.fit(X_tr_used, y_tr)
    y_train_pred = model.predict(X_tr_used)
    y_test_pred = model.predict(X_te_used)

    train_acc = accuracy_score(y_tr, y_train_pred)
    test_acc = accuracy_score(y_te, y_test_pred)
    train_f1 = f1_score(y_tr, y_train_pred, average="weighted")
    test_f1 = f1_score(y_te, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_te, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append({
        "Model": name,
        "Train Accuracy": train_acc,
        "Test Accuracy": test_acc,
        "Train F1": train_f1,
        "Test F1": test_f1,
        "Train-Test Acc Gap": round(train_acc - test_acc, 4),
        "Train-Test F1 Gap": round(train_f1 - test_f1, 4),
    })

In [45]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [46]:
# Run the chosen ensemble models (I run all 9 for comparison)
results = []

# 1. Random Forest (100)
evaluate_model(
    "Random Forest (100)",
    RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 2. Random Forest (200, max_depth=10)
evaluate_model(
    "Random Forest (200, max_depth=10)",
    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 3. AdaBoost (100)
evaluate_model(
    "AdaBoost (100)",
    AdaBoostClassifier(n_estimators=100, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 4. AdaBoost (200, lr=0.5)
evaluate_model(
    "AdaBoost (200, lr=0.5)",
    AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 5. Gradient Boosting (100)
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 6. Voting (DT + SVM + NN) - use scaled inputs for SVM and NN
voting1 = VotingClassifier(
    estimators=[
        ("DT", DecisionTreeClassifier(random_state=RANDOM_STATE)),
        ("SVM", SVC(probability=True, random_state=RANDOM_STATE)),
        ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000, random_state=RANDOM_STATE)),
    ],
    voting="soft",
)
# We will train voting1 on scaled arrays
evaluate_model("Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results, use_scaled=True, scaler=scaler)

# 7. Voting (RF + LR + KNN)
voting2 = VotingClassifier(
    estimators=[
        ("RF", RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)),
        ("LR", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)),
        ("KNN", KNeighborsClassifier()),
    ],
    voting="soft",
)
evaluate_model("Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results)

# 8. Bagging (DT, 100)
evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results
)

# 9. MLP Classifier (use scaled)
evaluate_model(
    "MLP Classifier",
    MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=RANDOM_STATE),
    X_train, y_train, X_test, y_test, results, use_scaled=True, scaler=scaler
)



Random Forest (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 256   8]
 [  0  15  28]]
Train Accuracy: 1.0000, Test Accuracy: 0.8875
Train F1 Score: 1.0000, Test F1 Score: 0.8661

Random Forest (200, max_depth=10) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 255   9]
 [  0  16  27]]
Train Accuracy: 0.9758, Test Accuracy: 0.8812
Train F1 Score: 0.9745, Test F1 Score: 0.8596

AdaBoost (100) Results
Confusion Matrix (Test):
[[  2  11   0]
 [  8 214  42]
 [  0  15  28]]
Train Accuracy: 0.7834, Test Accuracy: 0.7625
Train F1 Score: 0.7958, Test F1 Score: 0.7743

AdaBoost (200, lr=0.5) Results
Confusion Matrix (Test):
[[  1  12   0]
 [  7 228  29]
 [  0  18  25]]
Train Accuracy: 0.8038, Test Accuracy: 0.7937
Train F1 Score: 0.8071, Test F1 Score: 0.7938

Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411

Voting (DT + SVM + NN) 

## Section 6 — Compare Results
Create a DataFrame from `results`, compute gaps (already included), sort by Test Accuracy, and save.

In [47]:
results_df = pd.DataFrame(results)
results_df_sorted = results_df.sort_values(by="Test Accuracy", ascending=False).reset_index(drop=True)
display(results_df_sorted)
results_df_sorted.to_csv("ensemble_results_summary.csv", index=False)
print("Saved ensemble_results_summary.csv")

Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Train-Test Acc Gap,Train-Test F1 Gap
0,Random Forest (100),1.0,0.8875,1.0,0.866056,0.1125,0.1339
1,"Bagging (DT, 100)",1.0,0.884375,1.0,0.865452,0.1156,0.1345
2,"Random Forest (200, max_depth=10)",0.975762,0.88125,0.974482,0.859643,0.0945,0.1148
3,Voting (DT + SVM + NN),0.96638,0.871875,0.964747,0.854168,0.0945,0.1106
4,Voting (RF + LR + KNN),0.918686,0.859375,0.901189,0.828047,0.0593,0.0731
5,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.1039,0.1173
6,MLP Classifier,0.971071,0.85,0.969991,0.845761,0.1211,0.1242
7,"AdaBoost (200, lr=0.5)",0.803753,0.79375,0.807132,0.793824,0.01,0.0133
8,AdaBoost (100),0.783425,0.7625,0.795827,0.774253,0.0209,0.0216


Saved ensemble_results_summary.csv


## Section 7 — Conclusions and Insights

- **Overall Model Performance:** 
    After evaluating nine ensemble and advanced models on the Wine Quality dataset, several clear patterns emerged. The strongest performers based on test accuracy and test F1 score were:
    - **Random Forest (100)** – Test Accuracy: 0.8875 | Test F1: 0.8661
    - **Bagging (DT, 100)** – Test Accuracy: 0.8844 | Test F1: 0.8655
    - **Random Forest (200, max_depth=10)** – Test Accuracy: 0.8813 | Test F1: 0.8596

    These models consistently showed strong predictive power while maintaining relatively controlled gaps between training and testing performance.

    The top performer overall was **Random Forest (100)**, which achieved the highest test accuracy and F1 score. However, it also showed a noticeable train-test gap, indicating some degree of overfitting, though still within an acceptable range for ensemble models. 

- **Observations on Model Types:** 
    - **Ensemble tree-based methods (Random Forest & Bagging)** performed best overall. This aligns with expectations, as these methods excel at capturing non-linear relationships and reducing variance through aggregation.
    - **Boosting models (AdaBoost)** showed the lowest performance but had very small gaps, indicating stable generalization but limited learning capacity for this dataset.
    - **Voting classifiers** demonstrated good balance, particularly when diverse model types were combined.
    - **MLP Classifier** performed reasonably well but showed signs of overfitting, as indicated by its larger train-test gap.