# Wine Quality Analysis Exercise Solution

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).

## Overview

This module focuses on building predictive models to predict wine quality (low, medium and high) based on other features, following the standard classification Machine Learning pipeline.

## Learning Objectives

- Build and evaluate predictive models for wine quality classification
- Apply and compare different machine learning algorithms:
  - Decision Trees
  - Random Forests
  - Extreme Gradient Boosting
- Interpret model results using:
  - Feature importance analysis
  - ROC curves
  - Decision surfaces
  - Partial dependence plots

### Tasks to complete

- Train and evaluate models using:
  - Decision Trees
  - Random Forests
  - XGBoost
- Generate model interpretations and visualizations
- Compare model performances

## Prerequisites

- Python programming environment
- Basic understanding of statistical and machine learning concepts
- Familiarity with common ML libraries

## Get Started

- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.

In [None]:
%pip install xgboost graphviz shap lime

In [None]:
# Import necessary dependencies
import os
import warnings
from collections import Counter

import lime
import lime.lime_tabular
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shap
import xgboost as xgb
from graphviz import Source
from IPython.display import Image
from numpy import interp
from sklearn import metrics, tree
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc, roc_curve
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize
from sklearn.tree import DecisionTreeClassifier

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Set matplotlib inline
%matplotlib inline

## Load and merge datasets


In [None]:
# Define data paths
white_wine_path = "../../Data/winequality-white.csv"
red_wine_path = "../../Data/winequality-red.csv"

# Load data
try:
    white_wine = pd.read_csv(white_wine_path, sep=";")
    red_wine = pd.read_csv(red_wine_path, sep=";")
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    exit()

# Preprocessing functions
def preprocess_wine_data(wine_df, color):
    wine_df["wine_type"] = color
    wine_df["quality_label"] = wine_df["quality"].apply(
        lambda v: "low" if v <= 5 else "medium" if v <= 7 else "high"
    )
    wine_df["quality_label"] = pd.Categorical(
        wine_df["quality_label"], categories=["low", "medium", "high"]
    )
    return wine_df

# Preprocess data
red_wine = preprocess_wine_data(red_wine, "red")
white_wine = preprocess_wine_data(white_wine, "white")

# Combine datasets
wines = pd.concat([red_wine, white_wine], ignore_index=True)
wines = wines.sample(frac=1, random_state=42).reset_index(drop=True)

### Understand dataset features and values


In [None]:
print(white_wine.shape, red_wine.shape)
print(wines.info())

We have 4898 white wine data points and 1599 red wine data points. The
merged dataset contains a total of 6497 data points and we also get an idea of numeric and categorical
attributes.


In [None]:
# Let’s take a peek at our dataset to see some sample data points.
wines.head()

## Utilty functions for model evaluation


In [None]:
def get_metrics(true_labels, predicted_labels):
    print(
        "Accuracy:", np.round(metrics.accuracy_score(true_labels, predicted_labels), 4)
    )
    print(
        "Precision:",
        np.round(
            metrics.precision_score(true_labels, predicted_labels, average="weighted"),
            4,
        ),
    )
    print(
        "Recall:",
        np.round(
            metrics.recall_score(true_labels, predicted_labels, average="weighted"), 4
        ),
    )
    print(
        "F1 Score:",
        np.round(
            metrics.f1_score(true_labels, predicted_labels, average="weighted"), 4
        ),
    )


def display_classification_report(true_labels, predicted_labels, classes=None): # classes can be None
    report = metrics.classification_report(y_true=true_labels, y_pred=predicted_labels, labels=classes)
    print(report)


def display_confusion_matrix(true_labels, predicted_labels, classes=[1, 0]):
    total_classes = len(classes)
    level_labels = [total_classes * [0], list(range(total_classes))]
    cm = metrics.confusion_matrix(
        y_true=true_labels, y_pred=predicted_labels, labels=classes
    )
    cm_frame = pd.DataFrame(
        data=cm,
        columns=pd.MultiIndex(levels=[["Predicted:"], classes], codes=level_labels),
        index=pd.MultiIndex(levels=[["Actual:"], classes], codes=level_labels),
    )
    print(cm_frame)


def display_model_performance_metrics(true_labels, predicted_labels, classes=[1, 0]):
    print("Model Performance metrics:")
    print("-" * 30)
    get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
    print("\nModel Classification report:")
    print("-" * 30)
    display_classification_report(
        true_labels=true_labels, predicted_labels=predicted_labels, classes=classes
    )
    print("\nPrediction Confusion Matrix:")
    print("-" * 30)
    display_confusion_matrix(
        true_labels=true_labels, predicted_labels=predicted_labels, classes=classes
    )

def plot_model_roc_curve(
    clf, features, true_labels, label_encoder=None, class_names=None
):
    ## Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    if hasattr(clf, "classes_"):
        class_labels = clf.classes_
    elif label_encoder:
        class_labels = label_encoder.classes_
    elif class_names:
        class_labels = class_names
    else:
        raise ValueError(
            "Unable to derive prediction classes, please specify class_names!"
        )
    n_classes = len(np.unique(true_labels)) # directly get the number of classes
    y_test = label_binarize(true_labels, classes=np.unique(true_labels)) # binarize based on actual classes
    if n_classes == 2:
        if hasattr(clf, "predict_proba"):
            prob = clf.predict_proba(features)
            y_score = prob[:, prob.shape[1] - 1]
        elif hasattr(clf, "decision_function"):
            prob = clf.decision_function(features)
            y_score = prob[:, prob.shape[1] - 1]
        else:
            raise AttributeError(
                "Estimator doesn't have a probability or confidence scoring system!"
            )

        fpr, tpr, _ = roc_curve(y_test, y_score)
        roc_auc = auc(fpr, tpr)
        plt.plot(
            fpr,
            tpr,
            label="ROC curve (area = {0:0.2f})" "".format(roc_auc),
            linewidth=2.5,
        )

    elif n_classes > 2:
        if hasattr(clf, "predict_proba"):
            y_score = clf.predict_proba(features)
        elif hasattr(clf, "decision_function"):
            y_score = clf.decision_function(features)
        else:
            raise AttributeError(
                "Estimator doesn't have a probability or confidence scoring system!"
            )

        for i in range(n_classes):
            fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
            roc_auc[i] = auc(fpr[i], tpr[i])

        ## Compute micro-average ROC curve and ROC area
        fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
        roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

        ## Compute macro-average ROC curve and ROC area
        # First aggregate all false positive rates
        all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
        # Then interpolate all ROC curves at this points
        mean_tpr = np.zeros_like(all_fpr)
        for i in range(n_classes):
            mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])  # Use numpy.interp instead of scipy.interp
        # Finally average it and compute AUC
        mean_tpr /= n_classes
        fpr["macro"] = all_fpr
        tpr["macro"] = mean_tpr
        roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

        ## Plot ROC curves
        plt.figure(figsize=(6, 4))
        plt.plot(
            fpr["micro"],
            tpr["micro"],
            label="micro-average ROC curve (area = {0:0.2f})" "".format(
                roc_auc["micro"]
            ),
            linewidth=3,
        )

        plt.plot(
            fpr["macro"],
            tpr["macro"],
            label="macro-average ROC curve (area = {0:0.2f})" "".format(
                roc_auc["macro"]
            ),
            linewidth=3,
        )

        for i, label in enumerate(class_labels):
            plt.plot(
                fpr[i],
                tpr[i],
                label="ROC curve of class {0} (area = {1:0.2f})" "".format(
                    label, roc_auc[i]
                ),
                linewidth=2,
                linestyle=":",
            )
    else:
        raise ValueError("Number of classes should be atleast 2 or more")

    plt.plot([0, 1], [0, 1], "k--")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.legend(loc="lower right")
    plt.show()




def plot_model_decision_surface(
    clf,
    train_features,
    train_labels,
    plot_step=0.02,
    cmap=plt.cm.RdYlBu,
    markers=None,
    alphas=None,
    colors=None,
):
    if train_features.shape[1] != 2:
        raise ValueError("X_train should have exactly 2 columnns!")

    x_min, x_max = (
        train_features[:, 0].min() - plot_step,
        train_features[:, 0].max() + plot_step,
    )
    y_min, y_max = (
        train_features[:, 1].min() - plot_step,
        train_features[:, 1].max() + plot_step,
    )
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
    )

    clf_est = clone(clf)
    clf_est.fit(train_features, train_labels)
    if hasattr(clf_est, "predict_proba"):
        Z = clf_est.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    else:
        Z = clf_est.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=cmap)

    le = LabelEncoder()
    y_enc = le.fit_transform(train_labels)
    n_classes = len(le.classes_)
    plot_colors = "".join(colors) if colors else [None] * n_classes
    label_names = le.classes_
    markers = markers if markers else [None] * n_classes
    alphas = alphas if alphas else [None] * n_classes
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y_enc == i)
        plt.scatter(
            train_features[idx, 0],
            train_features[idx, 1],
            c=color,
            label=label_names[i],
            cmap=cmap,
            edgecolors="black",
            marker=markers[i],
            alpha=alphas[i],
        )
    plt.legend()
    plt.show()

## Predicting Wine Quality

We will predict the wine quality ratings based on other features.

### Prepare features

#### Feature selection

To start with, we
will first select our necessary features and separate out the prediction class labels and prepare train and test
datasets. We use the prefix **wqp\_** in our variables to easily identify them as needed, where **wqp** depicts wine
quality prediction.


In [None]:
# Prepare features and labels
wqp_features = wines.iloc[:, :-3]
wqp_class_labels = wines["quality_label"]
wqp_feature_names = list(wqp_features.columns)

# Train-test split
wqp_train_X, wqp_test_X, wqp_train_y, wqp_test_y = train_test_split(
    wqp_features, wqp_class_labels, test_size=0.3, random_state=42, stratify=wqp_class_labels
)

print(Counter(wqp_train_y), Counter(wqp_test_y))
print("Features:", wqp_feature_names)

The numbers show us the wine samples for each class and we can also see the feature names which will
be used in our feature set.


#### Feature Scaling

We will be using a standard scaler in this
scenario.


In [None]:
# Standardize features
wqp_ss = StandardScaler()
wqp_train_SX = wqp_ss.fit_transform(wqp_train_X)
wqp_test_SX = wqp_ss.transform(wqp_test_X)

### Train, Predict & Evaluate Model using Decision Tree

The main advantage of decision tree based models is model
interpretability, since it is quite easy to understand and interpret the decision rules which led to a specific
model prediction. Besides this, other advantages include the model’s ability to handle both categorical
and numeric data with ease as well as multi-class classification problems. Trees can be even visualized to
understand and interpret decision rules better.


#### Train the model using DecisionTreeClassifier


In [None]:
# --------------------------
# Decision Tree
# --------------------------
print("\nTraining Decision Tree...")
wqp_dt = DecisionTreeClassifier(random_state=42)
wqp_dt.fit(wqp_train_SX, wqp_train_y)

#### Evaluate model performance


In [None]:
# Evaluate Decision Tree
wqp_dt_predictions = wqp_dt.predict(wqp_test_SX)
print("\nDecision Tree Performance:")
display_model_performance_metrics(wqp_test_y, wqp_dt_predictions, wqp_dt.classes_)

We get an overall F1 Score and model accuracy of approximately 72%.

Looking at the class based statistics; we can see the recall for the high quality
wine samples is pretty bad since a lot of them have been misclassified into medium and low quality ratings.
This is kind of expected since we do not have a lot of training samples for high quality wine if you remember
our training sample sizes from earlier. Considering low and high quality rated wine samples, we should at
least try to see if we can prevent our model from predicting a low quality wine as high and similarly prevent
predicting a high quality wine as low.


#### Model Interpretation


##### Visualize Feature Importances from Decision Tree Model


In [None]:
wqp_dt_feature_importances = wqp_dt.feature_importances_
wqp_dt_feature_names, wqp_dt_feature_scores = zip(
    *sorted(zip(wqp_feature_names, wqp_dt_feature_importances), key=lambda x: x[1])
)
y_position = list(range(len(wqp_dt_feature_names)))
plt.barh(y_position, wqp_dt_feature_scores, height=0.6, align="center")
plt.yticks(y_position, wqp_dt_feature_names)
plt.xlabel("Relative Importance Score")
plt.ylabel("Feature")
t = plt.title("Feature Importances for Decision Tree")

We can clearly observe that the most important features have changed as compared to
our previous model. _Alcohol_ and _volatile acidity_ occupy the top two ranks and _total sulfur dioxide_
seems to be one of the most important features for classifying both wine type and quality.


##### Visualize the Decision Tree


In [None]:
# Visualize Decision Tree
graph = Source(tree.export_graphviz(
    wqp_dt,
    out_file=None,
    feature_names=wqp_feature_names,
    class_names=wqp_dt.classes_,
    filled=True,
    rounded=True,
    special_characters=True,
    max_depth=3
))
png_data = graph.pipe(format='png')
Image(png_data)

Our decision tree model has a huge number of nodes and branches hence we visualized our tree for a
max depth of three.

You can start observing the decision rules from the tree
in the figure where the starting split is determined by the rule of alcohol <= -0.128 and with each
yes\no decision branch split, we have further decision nodes as we descend into the tree at each depth level.
The class variable is what we are trying to predict, i.e. wine quality being low, medium, or high and value
determines the total number of samples at each class present in the current decision node at each instance.

The gini parameter is basically the criterion which is used to determine and measure the quality of the split
at each decision node. Best splits can be determined by metrics like gini impurity\gini index or information
gain, a metric that helps in minimizing the probability of
misclassification.


### Train, Predict & Evaluate Model using Random Forests

In the random
forest model, each base learner is a decision tree model trained on a bootstrap sample of the training data.
Besides this, when we want to split a decision node in the tree, the split is chosen from a random subset of all
the features instead of taking the best split from all the features.


#### Train the model using RandomForestClassifier


In [None]:
# --------------------------
# Random Forest
# --------------------------
print("\nTraining Random Forest...")
wqp_rf = RandomForestClassifier(n_estimators=500, max_features="sqrt", random_state=42)
wqp_rf.fit(wqp_train_SX, wqp_train_y)

#### Evaluate model performance


In [None]:
# Evaluate Random Forest
wqp_rf_predictions = wqp_rf.predict(wqp_test_SX)
print("\nRandom Forest Performance:")
display_model_performance_metrics(wqp_test_y, wqp_rf_predictions, wqp_rf.classes_)

The model prediction results on the test dataset depict an overall F1 Score and model accuracy of
approximately 80%. This is definitely an improvement of 7% from what we obtained
with just decision trees proving that ensemble learning is working better.


#### Hyperparameter tuning with Grid Search & Cross Validation

Another way to further improve on this result is model tuning. To be more specific, models have
hyperparameters that can be tuned.

Hyperparameters are also known as meta-parameters
and are usually set before we start the model training process. These hyperparameters do not have any
dependency on being derived from the underlying data on which the model is trained. Usually these
hyperparameters represent some high level concepts or knobs, which can be used to tweak and tune the
model during training to improve its performance. Our random forest model has several hyperparameters as shown below.


In [None]:
print(wqp_rf.get_params())

##### Get the best hyperparameter values using Grid Search

TODO: this fit gives quite a few warnings/errors.


In [None]:
# Optimized Grid Search for Random Forest
param_grid = {
    "n_estimators": [100, 200, 500], # Removed 300
    "max_features": ["sqrt", "log2"], # Removed None, often not beneficial
    "min_samples_split": [2, 5, 10],   # Added regularization parameters
    "min_samples_leaf": [1, 2, 4]
}

wqp_clf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1) # n_jobs for parallel processing
wqp_clf.fit(wqp_train_SX, wqp_train_y)
print(wqp_clf.best_params_)

We have 500 estimators and auto maximum features which represents the square root of the total
number of features to be considered during the best split operations.


#### View grid search results


In [None]:
results = wqp_clf.cv_results_
for param, score_mean, score_sd in zip(
    results["params"], results["mean_test_score"], results["std_test_score"]
):
    print(param, round(score_mean, 4), round(score_sd, 4))

The output shows the selected hyperparameter combinations and its corresponding mean
accuracy and standard deviation values across the grid.


### Train, Predict & Evaluate Random Forest Model with tuned hyperparameters


In [None]:
wqp_rf = RandomForestClassifier(n_estimators=500, max_features="sqrt", random_state=42)
wqp_rf.fit(wqp_train_SX, wqp_train_y)

wqp_rf_predictions = wqp_rf.predict(wqp_test_SX)
display_model_performance_metrics(
    true_labels=wqp_test_y, predicted_labels=wqp_rf_predictions, classes=wqp_rf.classes_
)

The model prediction results on the test dataset improved the overall F1 Score and model accuracy a little bit from the initial random forest model.


## Train, Predict & Evaluate Model using Extreme Gradient Boosting

Another way of modeling ensemble based methods is boosting. A very popular method is XGBoost
which stands for Extreme Gradient Boosting. It is a variant of the Gradient Boosting Machines (GBM)
model. This model is extremely popular in the Data Science community owing to its superior performance
in several Data Science challenges and competitions especially on Kaggle.


### Load and set dependencies


In [None]:
# --------------------------
# XGBoost
# --------------------------
print("\nTraining XGBoost...")
label_encoder = LabelEncoder()
wqp_train_y_encoded = label_encoder.fit_transform(wqp_train_y)
wqp_test_y_encoded = label_encoder.transform(wqp_test_y)


### Train the model


In [None]:
wqp_xgb = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=10,
    learning_rate=0.3,
    random_state=42,
    eval_metric='mlogloss'
)
wqp_xgb.fit(wqp_train_SX, wqp_train_y_encoded)


### Predict and Evaluate Model


In [None]:
# Evaluate XGBoost
wqp_xgb_predictions = wqp_xgb.predict(wqp_test_SX)
wqp_xgb_predictions_decoded = label_encoder.inverse_transform(wqp_xgb_predictions)
print("\nXGBoost Performance:")
display_model_performance_metrics(wqp_test_y, wqp_xgb_predictions_decoded, label_encoder.classes_)


### Tuning hyperparameters

#### Get the best hyperparameter values


In [None]:

param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [5, 7, 10], # Removed 15
    "learning_rate": [0.1, 0.3], # Added 0.1
    "gamma": [0, 0.1, 0.2],       # Added regularization
    "subsample": [0.8, 1.0],     # Added subsampling
    "colsample_bytree": [0.8, 1.0] # Added column sampling
}

wqp_clf = GridSearchCV(xgb.XGBClassifier(seed=42, eval_metric='mlogloss'), param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1)
wqp_clf.fit(wqp_train_SX, wqp_train_y_encoded)
print(wqp_clf.best_params_)

#### View grid search results


In [None]:
results = wqp_clf.cv_results_
for param, score_mean, score_sd in zip(
    results["params"], results["mean_test_score"], results["std_test_score"]
):
    print(param, round(score_mean, 4), round(score_sd, 4))

### Train, Predict & Evaluate Extreme Gradient Boosted Model with tuned hyperparameters


In [None]:
# Train the XGBoost model with the best hyperparameters
wqp_xgb_model = xgb.XGBClassifier(
    seed=42, max_depth=10, learning_rate=0.3, n_estimators=100
)
wqp_xgb_model.fit(wqp_train_SX, wqp_train_y_encoded)  # Use encoded labels here

# Predict on the test set
wqp_xgb_predictions = wqp_xgb_model.predict(wqp_test_SX)

# Decode the predictions back to string labels
wqp_xgb_predictions_decoded = label_encoder.inverse_transform(wqp_xgb_predictions)

# Display model performance metrics
display_model_performance_metrics(
    true_labels=wqp_test_y,
    predicted_labels=wqp_xgb_predictions_decoded,  # Use decoded predictions
    classes=label_encoder.classes_,
)


The model prediction results on the test dataset depict an overall F1 Score and model accuracy of
approximately 78%. Though random forests perform slightly better, it definitely
performs better than a basic model like a decision tree.


## Model Interpretation


### Comparative analysis of Model Feature importances


In [None]:
# SHAP Explanations for XGBoost
print("\nGenerating SHAP explanations for XGBoost...")
wqp_feature_names = np.array(wqp_feature_names)  # Ensure it's a NumPy array

explainer_xgb = shap.TreeExplainer(wqp_xgb)
shap_values_xgb = explainer_xgb(wqp_train_SX)  # Use explainer directly

# SHAP Summary Plot for XGBoost
shap.summary_plot(shap_values_xgb, wqp_train_SX, feature_names=wqp_feature_names, class_names=label_encoder.classes_)


### Key Observations in the Plot
- Fixed acidity interacts with volatile acidity and citric acid, as seen in their wider spreads. 
- Volatile acidity and citric acid have strong interactions, indicated by denser, spread-out points.
- Color gradients suggest a non-linear relationship, meaning changes in one feature affect predictions differently based on the value of the other feature.

In [None]:
# Ensure feature names are a NumPy array
wqp_feature_names = np.array(wqp_feature_names)

# Check feature name validity
assert "alcohol" in wqp_feature_names, "Feature 'alcohol' not found in feature names!"

# If multi-class classification, select a specific class index (e.g., class 0)
if len(shap_values_xgb.values.shape) == 3:  # Multi-class case
    shap_values_xgb_class = shap_values_xgb.values[..., 0]  # Select one class
else:
    shap_values_xgb_class = shap_values_xgb.values  # Single output case

# SHAP Dependence Plot (Fixed)
shap.dependence_plot("alcohol", shap_values_xgb_class, wqp_train_SX, feature_names=wqp_feature_names)



### Key Observations in the Plot
- Clear Positive Relationship
    - As alcohol increases, its SHAP value increases.
    - This suggests the model associates higher alcohol with higher predictions (e.g., better wine quality).
- Grouped & Stepped Pattern
    - Some areas show distinct clusters, possibly due to how the data is encoded or specific decision splits in the model.
- Interaction with Total Sulfur Dioxide
    - At lower alcohol values, high sulfur dioxide (red points) leads to lower SHAP values.
    - At higher alcohol values, high sulfur dioxide slightly increases SHAP values.
    - This suggests sulfur dioxide modifies the impact of alcohol on predictions.
### Conclusion
- Alcohol is a strong predictor in your model.
- The interaction with total sulfur dioxide matters, especially at lower alcohol levels.
- If you want to explore further, consider plotting SHAP interaction values specifically for alcohol & sulfur dioxide.

### Visualize Model ROC Curve


In [None]:
plot_model_roc_curve(wqp_rf, wqp_test_SX, wqp_test_y)

The AUC is pretty good based on what we see. The dotted lines indicate the per-class ROC curves and
the lines in bold are the macro and micro-average ROC curves.


### Visualize Model Decision Surface


In [None]:
feature_indices = [
    i
    for i, feature in enumerate(wqp_feature_names)
    if feature in ["alcohol", "volatile acidity"]
]
plot_model_decision_surface(
    clf=wqp_rf,
    train_features=wqp_train_SX[:, feature_indices],
    train_labels=wqp_train_y,
    plot_step=0.02,
    cmap=plt.cm.RdYlBu,
    markers=[",", "d", "+"],
    alphas=[1.0, 0.8, 0.5],
    colors=["r", "b", "y"],
)

### Interpreting Model Predictions


In [None]:
# --------------------------
# LIME Explanations
# --------------------------
print("\nGenerating LIME explanations...")
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=wqp_train_SX,
    feature_names=wqp_feature_names,
    class_names=label_encoder.classes_,
    discretize_continuous=True,
    verbose=True,
    mode='classification'
)

In [None]:
# Explain instance for Random Forest
def explain_instance_rf(instance_index):
    exp = lime_explainer.explain_instance(
        wqp_test_SX[instance_index],
        wqp_rf.predict_proba,
        num_features=5,
        top_labels=1
    )
    return exp.show_in_notebook()
# Explain instance for XGBoost
def explain_instance_xgb(instance_index):
    exp = lime_explainer.explain_instance(
        wqp_test_SX[instance_index],
        wqp_xgb.predict_proba,
        num_features=5,
        top_labels=1
    )
    return exp.show_in_notebook()

In [None]:
# Explain sample predictions
print("\nLIME Explanation for Random Forest (Low Quality Wine):")
explain_instance_rf(10)   # Low quality wine


1. Prediction Probabilities (Left)
- The model predicted this wine sample as medium quality with a probability of 0.82 (82%).
- The probabilities for other classes:
    - Low quality: 18%
    - High quality: 1%
- This suggests that the model is fairly confident in predicting this sample as "medium."


2. Feature Contributions (Middle)
- The bar chart in the middle explains which features influenced the prediction.
- Green bars (right side): Features that support the prediction (i.e., medium quality).
- Cyan bar (left side): Features that contradict the prediction.

  Breakdown of Feature Contributions
    - Alcohol (-0.58) → Negative contribution: Since alcohol is low, the model sees it as evidence against "medium quality."
    - Sulphates (0.54) → Positive contribution: Higher sulphate levels push the prediction towards "medium quality."
    - Volatile Acidity (-0.48) → Positive contribution: A moderate value of volatile acidity helps predict "medium quality."
    - Free Sulfur Dioxide (1.62) → Positive contribution:  A high level supports "medium quality."
    - pH (-0.84) → Negative contribution: A lower pH slightly contradicts the "medium" classification.

Conclusion
- The model predicts medium quality wine with high confidence.
- Key factors supporting "medium" prediction:
    - Higher sulphates
    - Higher free sulfur dioxide
    - Moderate volatile acidity
Key factor against "medium" prediction:
    - Low alcohol content
      
If you want to further validate these insights, you could compare this with SHAP values to see if both methods agree on important features!

In [None]:
print("\nLIME Explanation for XGBoost (High Quality Wine):")
explain_instance_xgb(747)  # High quality wine

1. Prediction Probabilities (Left)
- The model is 99% confident that this wine is medium quality (green bar).
- The probabilities for other classes:
    - Low quality: 1%
    - High quality: 0%
- Despite analyzing a high-quality wine, the model strongly favors the "medium" classification.

2. Feature Contributions (Middle)
- The bar chart explains which features influenced the prediction.
- Green bars (right side): Features that support the "medium quality" prediction.
- Blue bars (left side): Features that contradict "medium quality" and could suggest another classification.

  Breakdown of Feature Contributions
    - Alcohol (1.26) → Positive contribution (0.32): A higher alcohol content increases the likelihood of being classified as medium quality.
    - Residual Sugar (1.96) → Positive contribution (0.10): More residual sugar also supports the "medium" prediction.
    - Chlorides (-0.96) → Positive contribution (0.08): A low chloride level pushes the prediction toward medium quality.
    - Volatile Acidity (0.79) → Negative contribution (0.26): Higher volatile acidity contradicts "medium" and leans toward another classification (possibly low quality).
    - Density (0.17) → Negative contribution (0.08): A moderate density is not fully aligned with the "medium" classification.

Conclusion
- The model strongly predicts medium quality despite the wine being labeled as high quality.
- Key factors supporting "medium" prediction:
    - High alcohol
    - High residual sugar
    - Low chlorides
- Key factors contradicting "medium" prediction:
    - High volatile acidity
    - Moderate density
      
This suggests the model may be biased toward predicting medium quality, even for wines that should be high quality. You might want to analyze SHAP values or retrain the model with different feature weighting to improve its ability to distinguish high-quality wines.

## Conclusion

Through this module, we covered:

- Building and evaluating multiple classification models
- Applying different machine learning algorithms and comparing their performance
- Using model interpretation techniques including:
  - Feature importance analysis
  - ROC curves and performance metrics
  - Decision surface visualization
  - Partial dependence plots
- Understanding how different features influence model predictions

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
