# Ryan's Lab 4 Project
**Author:** Ryan Krabbe  
**Date:** 4/08/2025

**Objective:** Use the Titantic dataset to predict `fare` using different regression models.


## Imports
In the code cell below, import the necessary Python libraries for this notebook.  

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import (
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
    VotingClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

## Section 1. Load and Inspect the Data

In [3]:
# Load the dataset
df = pd.read_csv("winequality-red.csv", sep=";")

# Display structure and first few rows
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Section 2. Prepare the Data

In [None]:
# transform the variable target, quality into three categories: low (3–4), medium (5–6), high (7–8) to make classification feasible
def quality_to_label(q):
    if q <= 4:
        return "low"
    elif q <= 6:
        return "medium"
    else:
        return "high"

# Call the apply() method on the quality column to create the new quality_label column
df["quality_label"] = df["quality"].apply(quality_to_label)
    
# create a numeric column for modeling 0 = low, 1 = medium, 2 = high
def quality_to_number(q):
    if q <= 4:
        return 0
    elif q <= 6:
        return 1
    else:
        return 2

# Call the apply() method on the quality column to create the new quality_numeric column    
df["quality_numeric"] = df["quality"].apply(quality_to_number)

## Section 3. Feature Selection and Justification

In [5]:
# Define input features (X) and target (y)
# Features: all columns except 'quality' and 'quality_label' and 'quality_numberic' - drop these from the input array

X = df.drop(columns=["quality", "quality_label", "quality_numeric"])  # Features
y = df["quality_numeric"]  # Target

## Section 4. Split the Data into Train and Test

In [6]:
# Train/test split (stratify to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Section 5.  Evaluate Model Performance

In [7]:
# Helper function to train and evaluate models
def evaluate_model(name, model, X_train, y_train, X_test, y_test, results):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_f1 = f1_score(y_train, y_train_pred, average="weighted")
    test_f1 = f1_score(y_test, y_test_pred, average="weighted")

    print(f"\n{name} Results")
    print("Confusion Matrix (Test):")
    print(confusion_matrix(y_test, y_test_pred))
    print(f"Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")
    print(f"Train F1 Score: {train_f1:.4f}, Test F1 Score: {test_f1:.4f}")

    results.append(
        {
            "Model": name,
            "Train Accuracy": train_acc,
            "Test Accuracy": test_acc,
            "Train F1": train_f1,
            "Test F1": test_f1,
        }
    )

### Choose 2 models to run. I chose Gradient Boosting & Voting (RF + LR + KNN)

In [21]:
results = []

# Evaluate Gradient Boosting Model with 100 tries
evaluate_model(
    "Gradient Boosting (100)",
    GradientBoostingClassifier(n_estimators=100, random_state=42),
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)

# Define Voting (RF + LR + KNN) model
voting2 = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100)),
        ('lr', LogisticRegression(max_iter=2000)),
        ('knn', KNeighborsClassifier())
    ],
    voting='soft'
)

# Evaluate model
evaluate_model(
    "Voting (RF + LR + KNN)",
    voting2,
    X_train,
    y_train,
    X_test,
    y_test,
    results,
)


Gradient Boosting (100) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  3 247  14]
 [  0  16  27]]
Train Accuracy: 0.9601, Test Accuracy: 0.8562
Train F1 Score: 0.9584, Test F1 Score: 0.8411

Voting (RF + LR + KNN) Results
Confusion Matrix (Test):
[[  0  13   0]
 [  0 257   7]
 [  0  27  16]]
Train Accuracy: 0.9132, Test Accuracy: 0.8531
Train F1 Score: 0.8929, Test F1 Score: 0.8210


## Section 6. Compare Results

In [23]:
# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Add gap columns based on actual keys
results_df["Accuracy Gap"] = results_df["Train Accuracy"] - results_df["Test Accuracy"]
results_df["F1 Gap"] = results_df["Train F1"] - results_df["Test F1"]

# Sort by test accuracy
results_df_sorted = results_df.sort_values(by="Test Accuracy", ascending=False)

# Display sorted results
print("\nSummary of All Models (Sorted by Test Accuracy, with Gaps):")
display(results_df_sorted)


Summary of All Models (Sorted by Test Accuracy, with Gaps):


Unnamed: 0,Model,Train Accuracy,Test Accuracy,Train F1,Test F1,Accuracy Gap,F1 Gap
0,Gradient Boosting (100),0.960125,0.85625,0.95841,0.841106,0.103875,0.117304
1,Voting (RF + LR + KNN),0.913213,0.853125,0.892945,0.821034,0.060088,0.071911


## Section 7. Conclusion and Insights

## Comparing Gradient Boosting model & Voting (RF + LR + KNN) model

**Gradient Boosting**
- Test Accuracy: 85.6%
- Test F1: 84.1%

**Conclusion**: The Gradient Boosting model performed better than the Voting model indicated by the Test Accuracy and Test F1 scores. This model performed well on the complex patterns in the dataset, but showed some signs of overfitting. Overeall, the model performed much better on the training data than the test data.

**Voting (RF + LR + KNN)**
- Test Accuracy: 85.3%
- Test F1: 82.1%

**Conclusion**: Although the Voting model scores were lower than the Gradient Boosting scores the model still performed quite well. By adding the gap scores I was able to see how this model might've outperformed the Gradient Boosting model in some aspects. There are smaller gaps within this model meaning that it's less likely to overfit and may perform better in the future on unseen data. While the Gradient Boosting model's raw scores were better the Voting model ws more consistent between the training and test data.