# Lab 5: Ensemble Machine Learning – Wine Dataset
**Author:** Trent Rueckert

**Date:** April 11, 2025

**Objective:** Use different ensemble machine learning models and evaluate their performances on predicting wine quality.

## Introduction
In this notebook, I will analyze the UCI Wine Quality Dataset to predict wine quality by preparing/exploring the data, cleaning the data/handling missing values, performing feature engineering, and training classification models based on different selected features. I will be using different ensemble machine learning models and evaluating their accuracy and F1 scores.

## Imports
Import the necessary libraries with the code below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1. Load and Inspect the Data

In [2]:
# Load the dataset (download from UCI and save in the same folder)
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

# The dataset includes 11 physicochemical input variables (features):
# ---------------------------------------------------------------
# - fixed acidity          mostly tartaric acid
# - volatile acidity       mostly acetic acid (vinegar)
# - citric acid            can add freshness and flavor
# - residual sugar         remaining sugar after fermentation
# - chlorides              salt content
# - free sulfur dioxide    protects wine from microbes
# - total sulfur dioxide   sum of free and bound forms
# - density                related to sugar content
# - pH                     acidity level (lower = more acidic)
# - sulphates              antioxidant and microbial stabilizer
# - alcohol                % alcohol by volume

# The target variable is:
# - quality (integer score from 0 to 10, rated by wine tasters)

# We will simplify this target into three categories:
#   - low (3–4), medium (5–6), high (7–8) to make classification feasible.
#   - we will also make this numeric (we want both for clarity)
# The dataset contains 1599 samples and 12 columns (11 features + target).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [3]:
# Check for missing values
print('Missing Values:')
print(df.isnull().sum(), '\n')

Missing Values:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64 



## Section 2. Prepare the Data
Includes cleaning, feature engineering, encoding, splitting, helper functions.

In [5]:
# Define helper function that:

# Takes one input, the quality (which we will temporarily name q while in the function)
# And returns a string of the quality label (low, medium, high)
# This function will be used to create the quality_label column
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"


# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)


# Then, create a numeric column for modeling: 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2


df["quality_numeric"] = df["quality"].apply(quality_to_number)

print(df['quality_numeric'].value_counts(normalize=True))

quality_numeric
1    0.824891
2    0.135710
0    0.039400
Name: proportion, dtype: float64


In the prior cell, we change the quality scores from the 1-10 scale to "low", "medium", and "high". Then we convert these categories to 0, 1, and 2. This is for simplicity's sake for analysis.

## Section 3. Feature Selection and Justification

In [8]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numeric' - drop these from the input array
# Target: quality_label (the new column we just created)
X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

The 3 dropped features are all directly related to the target variable, thus we drop them. The target is the column we created in section 2 by changing the labels of the quallity scores. We will use all 11 of the other features because we believe that they all impact quality.

## Section 4. Split the Data into Train and Test

In [9]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5.  Evaluate Model Performance (Choose 2)
1. Random Forest (100)
   * A strong baseline model using 100 decision trees.
2. Random Forest (200, max_depth=10)
    * Adds more trees, but limits tree depth to reduce overfitting.
3. AdaBoost (100)
   * Boosting method that focuses on correcting previous errors.
4. AdaBoost (200, lr=0.5)
    * More iterations and slower learning for better generalization.
5. Gradient Boosting (100)
    * Boosting approach using gradient descent.
6. Voting (DT + SVM + NN)
    * Combines diverse models by averaging their predictions.
7. Voting (RF + LR + KNN)
    * Another mix of different model types.
8. Bagging (DT, 100)
    * Builds many trees in parallel on different samples.
9. MLP Classifier
   * A basic neural network with one hidden layer.

## Gradient Boosting (100) and Bagging (DT, 100)

In [10]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

In [11]:
# Here's how to create the different types of ensemble models listed above (you don't need to do all of them yourself. 
# Choose 2 - we have a whole team working on this.)

results = []

# 1. Random Forest
# evaluate_model(
    # "Random Forest (100)",
    # RandomForestClassifier(n_estimators=100, random_state=42),
    # X_train,
    # y_train,
    # X_test,
    # y_test,
    # results,
# )

# 2. Random Forest (200, max depth=10) 
# evaluate_model(
    # "Random Forest (200, max_depth=10)",
    # RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    # X_train,
    # y_train,
    # X_test,
    # y_test,
    # results,
# )

# 3. AdaBoost 
# evaluate_model(
    # "AdaBoost (100)",
    # AdaBoostClassifier(n_estimators=100, random_state=42),
    # X_train,
    # y_train,
    # X_test,
    # y_test,
    # results,
# )

# 4. AdaBoost (200, lr=0.5) 
# evaluate_model(
    # "AdaBoost (200, lr=0.5)",
    # AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42),
    # X_train,
    # y_train,
    # X_test,
    # y_test,
    # results,
# )

# 5. Gradient Boosting
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(
        n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 6. Voting Classifier (DT, SVM, NN) 
# voting1 = VotingClassifier(
    # estimators=[
        # ("DT", DecisionTreeClassifier()),
        # ("SVM", SVC(probability=True)),
        # ("NN", MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)),
    # ],
    # voting="soft",
# )
# evaluate_model(
    # "Voting (DT + SVM + NN)", voting1, X_train, y_train, X_test, y_test, results
# )

# 7. Voting Classifier (RF, LR, KNN) 
# voting2 = VotingClassifier(
    # estimators=[
        # ("RF", RandomForestClassifier(n_estimators=100)),
        # ("LR", LogisticRegression(max_iter=1000)),
        # ("KNN", KNeighborsClassifier()),
    # ],
    # voting="soft",
# )
# evaluate_model(
    # "Voting (RF + LR + KNN)", voting2, X_train, y_train, X_test, y_test, results
# )

# 8. Bagging 
evaluate_model(
    "Bagging (DT, 100)",
    BaggingClassifier(
        estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42
    ),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# 9. MLP Classifier 
# evaluate_model(
    # "MLP Classifier",
    # MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42),
    # X_train,
    # y_train,
    # X_test,
    # y_test,
    # results,
# )


Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411

Bagging (DT, 100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 252  12]
 [  0  12  31]]
Train Accuracy: 1.0000, Test Accuracy: 0.8844
Train F1 Score: 1.0000, Test F1 Score: 0.8655


## Section 6. Compare Results

In [13]:
# Create a table of results 
results_df = pd.DataFrame(results)

results_df["Accuarcy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap"] = results_df["Train F1"] - results_df["Test F1"]

results_df = results_df.sort_values(by="Test Accuracy", ascending=False)

print("\nSummary of All Models:")
display(results_df)


Summary of All Models:


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuarcy Gap,F1 Gap
1,"Bagging (DT, 100)",1.0,0.884375,1.0,0.865452,0.115625,0.134548
0,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875,0.117304


## Section 7. Conclusions and Insights
### My Results - Gradient Boosting (100)
* **Train Accuracy:** 96.01%

    The model performs very well on the training data.
* **Test Accuracy:** 85.63%
    
    The model generalizes very well to the unseen data.
* **Accuracy Gap:** 10.39%

    There are some signs of overfitting, as the data performs relatively better on the training data than the test data.
* **Train F1:** 95.84%

    The model performs excellently on the training data.
* **Test F1:** 84.11%

    The model generalizes very well overall on the unseen data, balancing the precision and recall well.
* **F1 Gap:** 11.73%

    There is some overfitting due to the significant drop off in performance from the training to the test data.

### My Results - Bagging (DT, 100)
* **Train Accuracy:** 100%

    The model performs perfectly on the training data, likely overfitting.
* **Test Accuracy:** 88.44%
    
    The model generalizes very well, better than the gradient boosting, but this could be due to the overfitting.
* **Accuracy Gap:** 11.56%

    The model is overfitting, as it performs much better on the training data than the test data.
* **Train F1:** 100%

    The model is a perfect predictor of the training data, which strongly suggests overfitting.
* **Test F1:** 86.54%

    The model generalizes very well overall on the unseen data, balancing the precision and recall well. It performs better on the unseen data than the gradient boosting.
* **F1 Gap:** 13.45%

    There is some overfitting due to the significant drop off in performance from the training to the test data. Even more so than the gradient boosting.

### Comparing the Models
* Bagging (DT, 100) performs better in terms of test accuracy and F1 score, but shows major signs of overfitting.
* Gradient Boosting (100) generalizes better and is more stable, despite the lower test performance.

## Comparing Other Classmates' Results
**Binyam Ware**
(https://github.com/bware7/applied-ml-binware/blob/main/lab05/ml05_binware.ipynb)
* Random Forest (100):
  * Train Accuracy: 100%
  * Test Accuracy: 88.75%
  * Accuracy Gap: 11.25%
  * Train F1: 100%
  * Test F1: 86.61%
  * F1 Gap: 13.39%
* Voting (DT+SVM+NN):
  * Train Accuracy: 92.26%
  * Test Accuracy: 86.56%
  * Accuracy Gap: 5.70%
  * Train F1: 90.61%
  * Test F1: 84.34%
  * F1 Gap: 6.26%

**Craig Wilcox**
(https://github.com/s256657/applied-ml-craigwilcox/blob/main/lab05/ensemble_craigwilcox.ipynb)
* Random Forest (200, max_depth=10)
  * Train Accuracy: 97.58%
  * Test Accuracy: 88.13%
  * Accuracy Gap: 9.45%
  * Train F1: 97.45%
  * Test F1: 85.96%
  * F1 Gap: 11.49%

**Justin Schroder**
(https://github.com/SchroderJ-pixel/applied-ml-justin/blob/main/lab05/ensemble-justin.ipynb)
* AdaBoost (100)
  * Train Accuracy: 83.42%
  * Test Accuracy: 82.50%
  * Accuracy Gap: 0.92%
  * Train F1: 82.09%
  * Test F1: 81.58%
  * F1 Gap: 0.51%

**Ryan Krabbe**
(https://github.com/ryankrabbe/applied-ml-krabbe/blob/main/lab05/ensemble-krabbe.ipynb)
* Voting (RF + LR + KNN)
  * Train Accuracy: 91.32%
  * Test Accuracy: 85.31%
  * Accuracy Gap: 6.01%
  * Train F1: 89.29%
  * Test F1: 82.10%
  * F1 Gap: 7.19%

**Brett Neely**
(https://github.com/bncodes19/applied-ml-bneely/blob/main/lab05/ensemble-neely.ipynb)
* MLP Classifier
  * Train Accuracy: 85.14%
  * Test Accuracy: 84.38%
  * Accuracy Gap: 0.76%
  * Train F1: 81.41%
  * Test F1: 80.73%
  * F1 Gap: 0.68%

**Kersha Broussard**
(https://github.com/kersha0530/ml-05/blob/main/ensemble-kbroussard.ipynb)
* AdaBoost (200, lr=0.5)
  * Train Accuracy: 83.97%
  * Test Accuracy: 85.62%
  * Accuracy Gap: 1.65%
  * Train F1: 82.00%
  * Test F1: 83.00%
  * F1 Gap: 1.00%

### Best Models
I think the best models are Voting (DT+SVM+NN) and Random Forest (200, max_depth=10) due to their high test accuracies, seemingly no or less overfitting/underfitting, and low F1/Accuracy gaps. Random Forest works well because it balances bias and variance.
Voting (DT + SVM + NN) works well because it combines diverse perspectives, reducing the chance of consistent error.

### Next Steps
If I were in a competition to make the best predictor, I would run tests with all of the models and do extensive research on each and the dataset I am working with to decide which is the best, and then go with that one.