#### Classification comparison
Anna Chen

#### Build Models with SciKit-Learn
Compare the K-Nearest Neighbors, Gaussian Naïve Bayes, and Decision Tree classifiers using the Wine Quality 
dataset, linked here: https://archive.ics.uci.edu/dataset/186/wine+quality.

1. Preprocess the data and binarize the target (quality) attribute: good wine has a quality greater than or equal to 6 and bad wine is less than or equal to 5. 
2. Split the dataset into train-test-validation or split train-test with cross-validation. 
3. Perform a thorough hyperparameter search on each model. 
4. Compare the three models based on their performance with the testing set. 
    - Each model should be assessed with a confusion matrix, accuracy, precision, recall, and F1-score.

#### Laod and porcess the data

In [1]:
# import the needed libraries

# For data processing and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# For spliting data and training models
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [2]:
# load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, sep=';')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [3]:
# Check for missing values 
missing_values = df.isnull().sum() 
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [4]:
# At a glance, here isn't any missing value.

In [5]:
# Check for duplicated rows
duplicates = df[df.duplicated()]
print("Duplicated Rows:\n", duplicates)

Duplicated Rows:
       fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
4               7.4             0.700         0.00            1.90      0.076   
11              7.5             0.500         0.36            6.10      0.071   
27              7.9             0.430         0.21            1.60      0.106   
40              7.3             0.450         0.36            5.90      0.074   
65              7.2             0.725         0.05            4.65      0.086   
...             ...               ...          ...             ...        ...   
1563            7.2             0.695         0.13            2.00      0.076   
1564            7.2             0.695         0.13            2.00      0.076   
1567            7.2             0.695         0.13            2.00      0.076   
1581            6.2             0.560         0.09            1.70      0.053   
1596            6.3             0.510         0.13            2.30      0.076   

      fre

In [6]:
# The chance of two wines having the exact features and quality by chance is rare.
# This dataset only has 240 rows that are exact duplicates, removing them doesn't effect the size of the dataset much.
# Drop duplicate rows
df = df.drop_duplicates()
print(df.shape)

(1359, 12)


In [7]:
# Binarize the target attribute
# quality greater than or equal to 6: good wine (class 1)
# quality less than or equal to 5: bad wine (class 0)
df.loc[:, 'quality'] = np.where(df['quality'] >= 6, 1, 0)  # 1 for good, 0 for bad
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,0
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,0
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,0
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,1
5,7.4,0.660,0.00,1.8,0.075,13.0,40.0,0.99780,3.51,0.56,9.4,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1593,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,1
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,0
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,1
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,0


Note:
According to Copilit (Microsoft)
- KNN: Works well with continuous data; can use categorical data with encoding.
- Gaussian Naïve Bayes: Best suited for continuous numerical data.
- Decision Trees: Flexible and can handle both continuous and categorical data.

So aside from binarizing the target "quality", I will leave the features as is.

In [8]:
# Split features and target
# X holds the features, Y holds the target "quality"
X = df.drop('quality', axis=1)
y = df['quality']

In [9]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#### Split the data

In [10]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#### Train the model
    - For each model, perform hyperparameter tuning using grid search or random search with cross-validation.

#### KKN
- For the K, " 3, 5, 7, 9" are common options. Using odd numbers avoids a tie when voting.
- weights: This list ['uniform', 'distance'] specifies how to weigh the contributions of the neighbors.
    - Uniform means no weight.
    - Distance means the closer ones have a higher weight.
- GridSearchCV is for hyperparameter tuning.
    - Will test different configurations of n_neighbors and weights to find the best combination.
- cv=5: Specifies 5-fold cross-validation, meaning the training data is split into 5 parts to validate the model.

In [11]:
# K-Nearest Neighbors
knn_params = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
knn = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5)
knn.fit(X_train, y_train)

#### Naïve Bayes
Here, Gaussian Naive Bayes is used. This works best with continous data.
##### Hyperparameters:
- var_smoothing: This parameter adds a small value to the variance to avoid division by zero. It helps in numeric stability.

In [12]:
# Create the GaussianNB model
gnb = GaussianNB()

# Define the parameter grid for var_smoothing
param_grid = {
    'var_smoothing': np.logspace(0, -9, num=100)  # Try values from 1 to 1e-9
}

# Setup GridSearchCV
grid_search_gnb = GridSearchCV(estimator=gnb, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search_gnb.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search_gnb.best_params_
best_score = grid_search_gnb.best_score_

print(f"Best parameters: {best_params}")
print(f"Best score: {best_score}")


Best parameters: {'var_smoothing': np.float64(0.1)}
Best score: 0.7285756563649431


#### Decision Tree
Hyperparameters
- Tree depth: max_depth
- Min sample size: min_samples_split

In [13]:
# Decision Tree
dt_params = {'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10]}
dt = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_params, cv=5)
dt.fit(X_train, y_train)

#### Evaluate Models
Evaluation metrics:
- Accuracy
    - (Number of correctly predicted instances) / (number of the total instances)
    - Example: An accuracy of 0.75 means that 75% of the total predictions made by the model are correct.
- Precision
    - (Number of true positive predictions) / ( number of all positive predictions)
    - Example: A precision of 0.75 means that 75% of the instances predicted as positive by the model are actually positive.
- Recall
    - (number of true positive predictions) / ( Number of all actual positive instances. Same as True Positive + False Negative)
- F1-Score
    - The harmonic mean of precision and recall, providing a balance between the two. 
- Confusion Matrix
    - Rows: Represent the actual classes.
    - Columns: Represent the predicted classes. 

In [14]:
# A function to help evaluate the model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    return acc, precision, recall, f1, cm

In [15]:
# The classification models and their estimators
# Use the best estimators found by GridSearchCV for each model in the evaluation loop.
models = {
    'KNN': knn.best_estimator_,
    'Gaussian Naive Bayes': grid_search_gnb.best_estimator_,
    'Decision Tree': dt.best_estimator_,
}

# Store the results of training the model and hyperparameter tuning
results = {}
for name, model in models.items():
    acc, prec, rec, f1, cm = evaluate_model(model, X_test, y_test)
    results[name] = {'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1-Score': f1, 'Confusion Matrix': cm}

In [16]:
# Define class names based on unique labels in y_test 
class_names = np.unique(y_test)
# Ensures that class_names contains all unique labels from the test set.
# For the wine review: 1 for good, 0 for bad

# Print the results
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"Accuracy: {metrics['Accuracy']:.2f}")
    print(f"Precision: {metrics['Precision']:.2f}")
    print(f"Recall: {metrics['Recall']:.2f}")
    print(f"F1-Score: {metrics['F1-Score']:.2f}")

    # Convert confusion matrix to DataFrame for better display
    cm_df = pd.DataFrame(metrics['Confusion Matrix'], index=class_names, columns=class_names)
    print("Confusion Matrix:")
    print(cm_df)

    print("\n-----------------\n")

Model: KNN
Accuracy: 0.75
Precision: 0.74
Recall: 0.78
F1-Score: 0.76
Confusion Matrix:
    0    1
0  98   37
1  30  107

-----------------

Model: Gaussian Naive Bayes
Accuracy: 0.74
Precision: 0.72
Recall: 0.79
F1-Score: 0.75
Confusion Matrix:
    0    1
0  93   42
1  29  108

-----------------

Model: Decision Tree
Accuracy: 0.73
Precision: 0.70
Recall: 0.82
F1-Score: 0.75
Confusion Matrix:
    0    1
0  87   48
1  25  112

-----------------



#### Comparison according to the result
All three classification models have similar accuracy and precision. The Decision Tree has a slightly lower accuracy of 0.73 and a precision of 0.70, while the other two models hover around 0.75 for both metrics. However, the Decision Tree achieves a slightly higher recall of 0.82 compared to 0.79 for Gaussian Naive Bayes and 0.78 for KNN. Despite these differences, the F1-score is about the same across all three models at 0.75.

In terms of the confusion matrix, all three models exhibit similar performance in correctly classifying both classes, though the Decision Tree tends to overclassify the positive class slightly more than the others. These metrics suggest that the models may need further hyperparameter tuning or additional training data, as an accuracy of around 0.7 is not that good, and is harder to apply to other ears when out of four times it will provide a wrong prediction.

#### AI usage
The code portion was generated by AI tools such as ChatGPT and Copilot (Microsoft). These tools assisted in creating a base cod and providing insights, while I handled troubleshooting, breaking the program into sections, adding documentation, and asking targeted questions to refine the solution. This collaborative approach helped save time and improve the quality of the work.