# Comparing Models: Logistic vs 1-Layer vs 3-Layer Neural Networks vs Random Forest
**Dataset**: Wine Quality (UCI)  
**Dataset Details**:  
- **Source**: UCI Machine Learning Repository (Red Wine Variants)  
- **Samples**: 1,599 red wines with 11 physicochemical features  
- **Features**: Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol  
- **Target**: Binary quality classification (0=low quality ≤5, 1=high quality >5)  

Physical-chemical properties interact nonlinearly to determine quality. Simple logistic regression assumes linear feature relationships, while neural networks and random forests can model complex interactions.

Progress cell-by-cell. Execute each cell where code is already written.


In [None]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.exceptions import ConvergenceWarning

# Suppress only ConvergenceWarnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## 1️⃣ Load Wine Quality Dataset

- Load dataset from UCI URL
- Inspect class distribution
- Separate features (X) and labels (y)
- Show `df.head()`, shapes, and correlation matrix


In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, delimiter=';')
X = df.drop('quality', axis=1).values
y = df['quality'].apply(lambda x: 1 if x > 5 else 0).values  # Binary

print(df.head())
print(f"Features shape: {X.shape}")
print("Label distribution:", np.unique(y, return_counts=True))

# Correlation matrix
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

## 2️⃣ Train/Test Split & Normalization

- Split data into 70% train / 30% test
- Apply standardization

In [None]:
# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## 3️⃣ Model Definitions

- Initialize Logistic Regression
- Initialize 1-layer MLP (32 neurons)
- Initialize 3-layer MLP (128,64,32)
- Initialize Random Forest (100 trees, OOB)

In [None]:
# Logistic
log_reg = LogisticRegression(
    tol=1e-4, max_iter=10000, random_state=42)

# 1-layer NN
mlp_1layer = MLPClassifier(
    hidden_layer_sizes=(32,), activation='relu',
    solver='adam', tol=1e-4, max_iter=10000,
    random_state=42)

# 3-layer NN
mlp_3layer = MLPClassifier(
    hidden_layer_sizes=(128,64,32), activation='relu',
    solver='adam', tol=1e-4, max_iter=10000,
    random_state=42)

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100, oob_score=True,
    random_state=42)

## 4️⃣ Train & Evaluate Once

- Fit each model on training data
- Predict on test set
- Compute and print test accuracy (and OOB for RF)

In [None]:
# Fit
log_reg.fit(X_train, y_train)
mlp_1layer.fit(X_train, y_train)
mlp_3layer.fit(X_train, y_train)
rf.fit(X_train, y_train)

# Predict
y_log = log_reg.predict(X_test)
y_mlp1 = mlp_1layer.predict(X_test)
y_mlp3 = mlp_3layer.predict(X_test)
y_rf   = rf.predict(X_test)

# Accuracies
acc_log  = accuracy_score(y_test, y_log)
acc_mlp1 = accuracy_score(y_test, y_mlp1)
acc_mlp3 = accuracy_score(y_test, y_mlp3)
acc_rf   = accuracy_score(y_test, y_rf)

print(f"Logistic  Test  Acc     : {acc_log:.3f}")
print(f"MLP 1-layer Test Acc    : {acc_mlp1:.3f}")
print(f"MLP 3-layer Test Acc    : {acc_mlp3:.3f}")
print(f"Random Forest Test Acc  : {acc_rf:.3f}")
print(f"Random Forest OOB Score : {rf.oob_score_:.3f}")

## 5️⃣ Feature Importances (Random Forest)
Inspect the top-5 features by mean decrease in impurity from the RF model trained above. Feature importance is a way to quantify how much each input feature contributes to a model’s predictions.  In the context of tree-based models (like random forests), one common arroadch is impurity-based (a.k.a. “Gini” or “Mean Decrease in Impurity”):

- Every time a tree node splits on feature j, the impurity (Gini or entropy for classification, variance for regression) is reduced.
- You sum those impurity reductions over all nodes in all trees where j is used, then normalize.
- Features with large total impurity reduction are deemed more “important.”

We use the RandomForestClassifier’s built-in attribute `feature_importances_` to measure this.

In [None]:
# Use the last trained rf_
importances = rf.feature_importances_
feat_names = df.columns[:-1]
imp_df = pd.DataFrame({
    'feature': feat_names,
    'importance': importances
}).sort_values('importance', ascending=False)
print(imp_df.head(5))
sns.barplot(x='importance', y='feature', data=imp_df.head(5))
plt.title('Top-5 Feature Importances (RF)')
plt.show()

## 6️⃣ Mean Performance Over 10 Experiments

- Repeat split, train, eval with seeds 0–9
- Report mean test accuracy for all four models

In [None]:
acc_log_list  = []
acc_mlp1_list = []
acc_mlp3_list = []
acc_rf_list   = []

for seed in range(10):
    # split
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.3, random_state=seed, stratify=y)
    # scale
    scaler = StandardScaler().fit(X_tr)
    X_tr = scaler.transform(X_tr)
    X_te = scaler.transform(X_te)

    # models
    lr = LogisticRegression(max_iter=5000, tol=1e-4, random_state=seed)
    m1 = MLPClassifier(hidden_layer_sizes=(32,), max_iter=100000,
                      tol=1e-4, random_state=seed)
    m3 = MLPClassifier(hidden_layer_sizes=(128,64,32), max_iter=100000,
                      tol=1e-4, random_state=seed)
    rf_ = RandomForestClassifier(
        n_estimators=100, oob_score=False,
        random_state=seed)

    # fit
    lr.fit(X_tr, y_tr)
    m1.fit(X_tr, y_tr)
    m3.fit(X_tr, y_tr)
    rf_.fit(X_tr, y_tr)

    # predict
    a_lr = accuracy_score(y_te, lr.predict(X_te))
    a_m1 = accuracy_score(y_te, m1.predict(X_te))
    a_m3 = accuracy_score(y_te, m3.predict(X_te))
    a_rf = accuracy_score(y_te, rf_.predict(X_te))

    acc_log_list.append(a_lr)
    acc_mlp1_list.append(a_m1)
    acc_mlp3_list.append(a_m3)
    acc_rf_list.append(a_rf)

print("\nMean Accuracies over 10 runs:")
print(f"Logistic    : {np.mean(acc_log_list):.3f}")
print(f"MLP 1-layer : {np.mean(acc_mlp1_list):.3f}")
print(f"MLP 3-layer : {np.mean(acc_mlp3_list):.3f}")
print(f"RandomForest: {np.mean(acc_rf_list):.3f}")

## 7️⃣ ROC and AUC

- Plot ROC Curves & compute AUC for each classifier

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

# Dictionary of fitted models
models = {
    'Logistic Regression': log_reg,
    'MLP 1-layer': mlp_1layer,
    'MLP 3-layer': mlp_3layer,
    'Random Forest': rf
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    # get probability estimates for the positive class
    y_proba = model.predict_proba(X_test)[:, 1]
    # compute ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    # compute AUC
    auc = roc_auc_score(y_test, y_proba)
    # plot
    plt.plot(fpr, tpr, lw=2, label=f"{name} (AUC = {auc:.3f})")

# plot the random-chance diagonal
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Chance')

plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves for Wine Quality Classifiers")
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()