# Project 5: Ensemble Models on Wine Quality

**Name:** Saratchandra Golla    
**Date:** November 15, 2025

**Introduction:**   
This project explores the use of ensemble machine learning models to classify the quality of red wine using physicochemical properties from the UCI Wine Quality Dataset. Ensemble methods, which combine multiple models, are powerful tools for improving predictive performance by reducing overfitting and enhancing generalization . The goal is to compare the performance of selected ensemble models and determine the best approach for this multi-class classification problem.

## Imports

We import all necessary libraries for data loading, preprocessing, model building (including ensemble methods), cross-validation, and evaluation.

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Ensemble Models
from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
# Base Estimators for Voting and Bagging
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

# Utilities and Metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# Set random seed for reproducibility
RANDOM_STATE = 42

## Section 1. Load and Inspect the Data
We load the Red Wine Quality Dataset. The original dataset contains 11 physicochemical features and a quality target variable . We use a semicolon (;) as a separator for the CSV file

In [18]:
# Load the wine quality dataset
try:
    df = pd.read_csv("winequality-red.csv", sep=";")
except FileNotFoundError:
    print("Error: 'winequality-red.csv' not found. Please ensure the file is in the same directory.")
    df = None # Handle case where file is missing

if df is not None:
    print("--- Wine Quality Dataset Info ---")
    df.info()
    print("\n--- Wine Quality Dataset Head ---")
    print(df.head())
    print(f"\nDataset loaded with {len(df)} samples and 12 columns (11 features + quality).")

--- Wine Quality Dataset Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

--- Wine Quality Dataset Head ---
   fixed acidity  volatile acidity  citric acid  residual

## Section 2. Prepare the Data
The original quality target variable is an integer score from 0 to 10. For a more practical classification task, we simplify this into three categorical levels: low (quality $\le 4$), medium (quality $\le 6$), and high (quality $\ge 7$) 

We create two new columns:
1. **quality_label:** a categorical string (low, medium, high)
2. **quality_numeric:** a numeric encoding of the label ($0=$ low, $1=$ medium, $2=$ high)

In [19]:
if df is not None:
    # 1. Helper function to map quality score to a string label
    def quality_to_label(q):
        """Returns the quality label (low, medium, high) based on the quality score q."""
        if q <= 4:
            return "low"
        elif q <= 6:
            return "medium"
        else: # q >= 7
            return "high"
    
    # Apply the function to create the new label column
    df["quality_label"] = df["quality"].apply(quality_to_label)

    # 2. Helper function to map quality score to a numeric target
    def quality_to_number(q):
        """Returns the numeric quality target (0=low, 1=medium, 2=high)."""
        if q <= 4:
            return 0
        elif q <= 6:
            return 1
        else: # q >= 7
            return 2
            
    # Apply the function to create the numeric target column
    df["quality_numeric"] = df["quality"].apply(quality_to_number)
    
    print("\n--- Data Preparation Complete ---")
    print(df[['quality', 'quality_label', 'quality_numeric']].head())
    print("\nDistribution of the new target variable:")
    print(df['quality_label'].value_counts())


--- Data Preparation Complete ---
   quality quality_label  quality_numeric
0        5        medium                1
1        5        medium                1
2        5        medium                1
3        6        medium                1
4        5        medium                1

Distribution of the new target variable:
quality_label
medium    1319
high       217
low         63
Name: count, dtype: int64


## Section 3. Feature Selection and Justification
We define the feature set (X) to include all physicochemical columns and the target variable (y) as the newly created quality_numeric column . We justify excluding the original quality and the new categorical columns (quality_label and quality_numeric) from the input features, as they are either the target itself or a direct transformation of the target, and would cause data leakage.

In [20]:
if df is not None:
    # Features (X): all physicochemical columns
    X = df.drop(columns=["quality", "quality_label", "quality_numeric"])

    # Target (y): the new numeric quality category
    y = df["quality_numeric"]

    print("\n--- Feature and Target Shape ---")
    print(f"Features (X) shape: {X.shape}")
    print(f"Target (y) shape: {y.shape}")


--- Feature and Target Shape ---
Features (X) shape: (1599, 11)
Target (y) shape: (1599,)
