# Model Tuning, Interpretation and Deployment

## Overview

This tutorial covers essential aspects of the machine learning workflow after initial model building, focusing on model interpretation techniques, tuning approaches, and deployment considerations. We'll explore how to make machine learning models more interpretable, optimize their performance through tuning, and prepare them for real-world deployment.

## Learning Objectives

- Understand the importance of model interpretation in machine learning
  - Learn how interpretability benefits analytics teams and stakeholders
  - Recognize how interpretability bridges technical and business understanding
- Master techniques for model tuning and optimization
- Learn best practices for model deployment
- Develop skills to explain model decisions to non-technical stakeholders

### Tasks to complete

- Implement model interpretation techniques
- Perform model tuning exercises
- Practice model deployment steps
- Create interpretability visualizations

## Prerequisites

- A working Python environment and familiarity with Python
- Basic understanding of machine learning concepts
- Familiarity with pandas and numpy libraries
- Knowledge of basic statistical concepts


## Get Started

- Please select kernel "conda_tensorflow2_p310" from SageMaker notebook instance.


### Import libraries


In [None]:
%pip install lime shap

In [None]:
import warnings
import joblib
import numpy as np
import pandas as pd
import scipy
import shap  # Import SHAP for explanation
from lime.lime_tabular import LimeTabularExplainer  # Import LIME for explanation
from sklearn import linear_model, metrics
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.svm import SVC

# Model Tunning, Interpretation and Deployment

Adapted from Dipanjan Sarkar et al. 2018. [Practical Machine Learning with Python](https://link.springer.com/book/10.1007/978-1-4842-3207-1).


In this tutorial, we will learn:

- How to tune the hyperparameters of Machine Learning algorithms
- How to interpret models using open source frameworks
- How to persist and deploy the developed models


## Model tuning

Model tuning is one of the
most important concepts of Machine Learning and it does require some knowledge of the underlying math
and logic of the algorithm in focus. In this tutorial, we will delve deeper into the models that we are targeting, look at the knobs
that can be tuned and set to extract the best performance out of any given models. This process of iterative
experimentation with dataset, model parameters, and features is the very core of the model tuning process.


### Build and Evaluate Default Model

We will use Wisconsin Breast Cancer Dataset as an example. We first split the breast cancer datast variables X and y into train and test datasets and build an SVM model with default parameters. Then we will evaluate its performance on the test dataset.


In [None]:
# Load Wisconsin Breast Cancer Dataset

# load data
bc = load_breast_cancer()

X = bc.data
y = bc.target

print(X.shape, bc.feature_names)

In [None]:
# Utility functions for model evaluation

# Get model performance evaluation matrics
def get_metrics(true_labels, predicted_labels):
    print(
        "Accuracy:", np.round(metrics.accuracy_score(true_labels, predicted_labels), 4)
    )
    print(
        "Precision:",
        np.round(
            metrics.precision_score(true_labels, predicted_labels, average="weighted"),
            4,
        ),
    )
    print(
        "Recall:",
        np.round(
            metrics.recall_score(true_labels, predicted_labels, average="weighted"), 4
        ),
    )
    print(
        "F1 Score:",
        np.round(
            metrics.f1_score(true_labels, predicted_labels, average="weighted"), 4
        ),
    )


# Show the classification report
def display_classification_report(true_labels, predicted_labels, classes=[1, 0]):
    # Build a text report showing the main classification metrics
    # The reported averages include macro average (averaging the unweighted
    # mean per label), weighted average (averaging the support-weighted mean
    # per label), and sample average (only for multilabel classification).
    # Micro average (averaging the total true positives, false negatives and
    # false positives) is only shown for multi-label or multi-class
    # with a subset of classes, because it corresponds to accuracy
    # otherwise and would be the same for all metrics.
    report = metrics.classification_report(
        y_true=true_labels, y_pred=predicted_labels, labels=classes
    )
    print(report)


# Show the confusion matrix
def display_confusion_matrix(true_labels, predicted_labels, classes=[1, 0]):
    total_classes = len(classes)
    level_labels = [total_classes * [0], list(range(total_classes))]
    # Compute confusion matrix to evaluate the accuracy of a classification.
    cm = metrics.confusion_matrix(
        y_true=true_labels, y_pred=predicted_labels, labels=classes
    )
    cm_frame = pd.DataFrame(
        data=cm,
        columns=pd.MultiIndex(levels=[["Predicted:"], classes], codes=level_labels),
        index=pd.MultiIndex(levels=[["Actual:"], classes], codes=level_labels),
    )
    print(cm_frame)


# Show the model performace matrics
def display_model_performance_metrics(true_labels, predicted_labels, classes=[1, 0]):
    print("Model Performance metrics:")
    print("-" * 30)
    get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
    print("\nModel Classification report:")
    print("-" * 30)
    display_classification_report(
        true_labels=true_labels, predicted_labels=predicted_labels, classes=classes
    )
    print("\nPrediction Confusion Matrix:")
    print("-" * 30)
    display_confusion_matrix(
        true_labels=true_labels, predicted_labels=predicted_labels, classes=classes
    )

In [None]:
# prepare datasets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# build default SVM model
# C-Support Vector Classification
def_svc = SVC(random_state=42)
def_svc.fit(X_train, y_train)

# predict and evaluate performance
def_y_pred = def_svc.predict(X_test)
print("Default Model Stats:")
display_model_performance_metrics(
    true_labels=y_test, predicted_labels=def_y_pred, classes=[0, 1]
)

### Tune Model with Grid Search

Since we have chosen a SVM model, we specify some hyperparameters specific
to it, which includes the Regularization parameter C (deals with the margin parameter in SVM), the kernel function (used
for transforming data into a higher dimensional feature space) and gamma (determines the influence a
single training data point has). There are a lot of other hyperparameters to tune, which you can check out [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for further details.

We will build a grid by supplying some pre-set values. The next choice is selecting the score or metric we want
to maximize here we have chosen to maximize accuracy of the model. Once that is done, we will be using
five-fold cross-validation to build multiple models over this grid and evaluate them to get the best model.

(This step may take a few minutes to complete.)


In [None]:
# setting the parameter grid
grid_parameters = {
    "kernel": ["linear", "rbf"],
    "gamma": [1e-3, 1e-4],
    "C": [1, 10, 50, 100],
}

# perform hyperparameter tuning
print("# Tuning hyper-parameters for accuracy\n")

# Exhaustive search over specified parameter values for an estimator.
clf = GridSearchCV(SVC(random_state=42), grid_parameters, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)

# view accuracy scores for all the models
print("Grid scores for all the models based on CV:\n")
means = clf.cv_results_["mean_test_score"]
stds = clf.cv_results_["std_test_score"]
for mean, std, params in zip(means, stds, clf.cv_results_["params"]):
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))

# check out best model performance
print("\nBest parameters set found on development set:", clf.best_params_)
print("Best model validation accuracy:", clf.best_score_)

We can see the best model parameters were obtained
based on cross-validation accuracy and we get a pretty awesome validation accuracy of 96%.


### Evaluate Grid Search Tuned Model

Let’s take this
optimized and tuned model and put it to the test on our test data!


In [None]:
gs_best = clf.best_estimator_
tuned_y_pred = gs_best.predict(X_test)

print("\n\nTuned Model Stats:")
display_model_performance_metrics(
    true_labels=y_test, predicted_labels=tuned_y_pred, classes=[0, 1]
)

Our model gives an overall F1 Score and model
accuracy of 97% on the test dataset too. This should give you a clear indication of
the power of hyperparameter tuning! This scheme of things can be extended for different models and their
respective hyperparameters. We can also play around with the evaluation measure we want to optimize.
The scikit-learn framework provides us with different values that we can optimize. Some of them are
adjusted_rand_score, average_precision, f1, average_recall, and so on.


### Tune Model with Randomized Search

Grid search suffers from some major shortcomings, the most important one being
the limitation of manually specifying the grid. This brings a human element into a process that could benefit
from a purely automatic mechanism.

Randomized parameter search is a modification to the traditional grid search. It takes input for
grid elements as in normal grid search but it can also take distributions as input. For example consider
the parameter gamma whose values we supplied explicitly in the last section instead we can supply a
distribution from which to sample gamma.

The efficacy of randomized parameter search is based on the
proven (empirically and mathematically) result that the hyperparameter optimization functions normally
have low dimensionality and the effect of certain parameters are more than others.

We control the number
of times we want to do the random parameter sampling by specifying the number of iterations we want to
run (n_iter). Normally a higher number of iterations mean a more granular parameter search but higher
computation time.

(This step may take a few minutes to complete.)


In [None]:
# replace the gamma and C values with a distribution (exponential distribution)
param_grid = {
    "C": scipy.stats.expon(scale=10),
    "gamma": scipy.stats.expon(scale=0.1),
    "kernel": ["rbf", "linear"],
}

# Randomized search on hyperparameters.
random_search = RandomizedSearchCV(
    SVC(random_state=42), param_distributions=param_grid, n_iter=50, cv=5
)
random_search.fit(X_train, y_train)
print("Grid scores for all the models based on CV:\n")
means = random_search.cv_results_["mean_test_score"]
stds = random_search.cv_results_["std_test_score"]
for mean, std, params in zip(means, stds, random_search.cv_results_["params"]):
    print("%0.5f (+/-%0.05f) for %r" % (mean, std * 2, params))
print("\nBest parameters set found on development set:", random_search.best_params_)
print("Best model validation accuracy:", random_search.best_score_)

## Evaluate Randomized Search Tuned Model

Get the best model, predict and evaluate performance


In [None]:
rs_best = random_search.best_estimator_
rs_y_pred = rs_best.predict(X_test)
get_metrics(true_labels=y_test, predicted_labels=rs_y_pred)

We are getting the values of parameter C and gamma from an exponential distribution
and we are controlling the number of iterations of model search by the parameter n_iter. While the overall
model performance is similar to grid search, the intent is to be aware of the different strategies in model tuning.


## Model Interpretation

The ability to interpret Machine Learning models in an easy to understand way will benefit not only analytics teams but also key stakeholders in trying to explain how models really work.

Some Machine Learning models use interpretable algorithms, for example a decision tree will give you
the importance of all the variables as an output. Unfortunately, this
can’t be said for a lot of models, especially for the ones who have no notion of variable importance.

The lack of understanding of the complex nature
of Machine Learned decision policies makes predictive models to be still viewed as black boxes. Model
interpretations can help a data scientist and an end user in a variety of ways.

- It will help bridge the gap that
  often exists between the technology teams and the business. For example, it can help identify the reason
  why a particular prediction is being made and it can be verified using the domain knowledge of the end
  user by leveraging that easy to understand interpretation.
- It can also help the data scientists understand the
  interactions among features that can lead to better feature engineering and enhanced performance.
- It can
  also help in model comparisons and explaining the results better to the business stakeholders.


### Understanding LIME and SHAP

LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) are two powerful tools designed to explain machine learning model predictions in an interpretable manner, making complex models more transparent. LIME works by approximating a machine learning model with a simple, interpretable model (such as a linear regression) around the instance being predicted. It perturbs the input data, generates predictions, and learns a locally faithful model that explains the specific prediction for that instance. This method provides insights into how a model makes decisions for individual instances. SHAP, on the other hand, is grounded in game theory and computes the contribution of each feature to the model’s prediction by distributing the prediction's total impact fairly across all features. SHAP values are additive, and the method provides both local and global interpretability, making it highly effective for understanding the overall impact of features on model predictions. While LIME focuses on local explanations, SHAP excels in providing consistent global feature importance, making both techniques complementary for enhancing model transparency and trustworthiness.


In [None]:
# we will use a logistic regression model to do classification.

warnings.filterwarnings("ignore")

logistic = linear_model.LogisticRegression()
logistic.fit(X_train, y_train)

In [None]:
# SHAP Explanation (instead of Skater's Interpretation)
background = shap.kmeans(X_train, 50)  # Summarize background using 50 clusters
explainer = shap.KernelExplainer(logistic.predict_proba, background)  # Use KernelExplainer
shap_values = explainer.shap_values(X_test)

In [None]:
# Plot SHAP summary plot
shap.summary_plot(shap_values, X_test, feature_names=bc.feature_names)

### Key Takeaways from the Plot
- The SHAP interaction values are centered around 0, meaning no strong interaction effects dominate.
- Mean radius and mean texture seem to have moderate interactions, with red and blue points somewhat spread out.
- If you see larger deviations from 0 in any interaction, it indicates that the combination of those features significantly impacts predictions (either increasing or decreasing the model output more than their individual contributions).

### Explaining Predictions

Let’s try to interpret some actual predictions now. We will predict two data points, one not
having cancer (label 1) and one having cancer (label 0), and try to interpret the prediction making process.


In [None]:
# LIME Explanation (instead of Skater's LimeTabularExplainer)
lime_explainer = LimeTabularExplainer(
    X_train,
    feature_names=bc.feature_names,
    discretize_continuous=True,
    class_names=["0", "1"],
)

In [None]:
# Generate explanation for an individual prediction (LIME)
lime_exp = lime_explainer.explain_instance(X_test[0], logistic.predict_proba)
lime_exp.show_in_notebook()

### Key Takeaways from the LIME Explanation above
1. Prediction Probabilities
    - The model predicts 87% (0.87) probability for Malignant (1).
    - The probability for Benign (0) is only 13% (0.13).
    - This means the model strongly believes the tumor is malignant.

2. Feature Contributions
    - Features supporting Malignant (1) are in orange (positive contribution to malignancy).
    - Features supporting Benign (0) are in blue (negative contribution to malignancy, pushing toward benign).
    - The most important malignant indicators are:
        - Worst perimeter (96.05)
        - Worst area (677.90)
        - Mean area (481.90)
        - Area error (30.29)
    - The most important benign indicators (blue) are:
        - Worst radius (14.97)
        - Mean perimeter (81.09)
        - Mean radius (12.47)

3. Feature Value Ranges
    - The middle section lists decision splits from the model (e.g., “84.54 < worst perimeter <= 97.75” means a higher perimeter increases malignancy probability).
    - Larger values for worst perimeter, worst area, and mean area strongly push the prediction toward malignancy.

#### Final Interpretation

Even though some features (blue) support the benign classification, the dominant malignant-supporting features outweigh them. Therefore, the model predicts Malignant (1) with high confidence (87%).


Let’s run a similar interpretation on a data point with malignant
cancer.


In [None]:
lime_exp = lime_explainer.explain_instance(X_test[1], logistic.predict_proba)
lime_exp.show_in_notebook()

### Key Takeaways from the LIME Explanation above
1. Prediction Probabilities
    - The model is 100% certain this is a benign tumor (0.00 probability for malignant (1)).
    - The blue bar (0) is fully filled, meaning all supporting features push toward benign.
2. Feature Contributions
    - Blue bars (supporting benign classification):
        - Worst perimeter (165.90)
        - Mean area (1130.00)
        - Worst area (1866.00)
        - Worst texture (26.58)
    - Orange bars (supporting malignant classification):
        - Mean perimeter (123.60)
        - Worst radius (24.86)
        - Mean radius (18.94)
        - Area error (96.05)
    - Since the blue features dominate, the model predicts benign (0) with full confidence.
3. Decision Splits & Feature Importance
    - The middle section shows decision splits used by the model.
        - For example, “mean perimeter > 105.62” slightly supports malignancy (orange).
        - However, features like “worst perimeter > 125.30” strongly push toward benign.
#### Final Interpretation

Even though a few features (like mean perimeter and worst radius) slightly support malignancy, the overall strong benign indicators (worst perimeter, mean area, worst area) outweigh them completely.
The model is highly confident this tumor is benign (100% probability).


## Model Deployment

The final piece of the Machine Learning modeling puzzle is that of
deploying the model in production so that we actually start using it.


### Persist model to disk

For persisting our model to disk, we can leverage libraries like pickle or joblib, which is also available
with scikit-learn. This allows us to deploy and use the model in the future, without having to retrain it
each time we want to use it.


In [None]:
joblib.dump(logistic, "lr_model.pkl")

### Load model from disk

So whenever we will load
this object in memory again we will get the logistic regression model object.


In [None]:
lr = joblib.load("lr_model.pkl")
lr

### Predict with loaded model

We can now use this lr object, which is our model loaded from the disk, and make predictions.


In [None]:
print("True value: ", y_test[10:11])
print("Predicted value: ", lr.predict(X_test[10:11]))

## Conclusion

Through this tutorial, you have gained practical experience in making machine learning models more transparent and interpretable, optimizing their performance through proper tuning, and preparing them for real-world deployment. These skills are essential for bridging the gap between technical implementation and business understanding in machine learning projects.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.
