# Week 17, Lecture 02: CodeAlong


## **Part 2: Explaining Models with Model Explainers**

### Lesson Objectives

- By the end of this lesson, students will be able to:
    - Load variables and models from a joblib file into a new notebook.
    - Apply permutation importance
    - Apply shap analysis 
    - Visualize global and local explanations.


### Continuing with Life Expectancy Prediction

> Task Inspired by: https://medium.com/@shanzehhaji/using-a-linear-regression-model-to-predict-life-expectancy-de3aef66ac21

- Kaggle Dataset on Life Expectancy:
    - https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

In [None]:
## Our standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as miss

## Preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

## Models & evaluation metrics
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


## setting random state for reproducibility
SEED = 321
np.random.seed(SEED)
## Matplotlib style
fav_style = ('ggplot','tableau-colorblind10')
fav_context  ={'context':'notebook', 'font_scale':1.1}
plt.style.use(fav_style)
sns.set_context(**fav_context)
plt.rcParams['savefig.transparent'] = False
plt.rcParams['savefig.bbox'] = 'tight'


import joblib, os

In [None]:
## Importing Custom Functions
import sys,os
# sys.path.append(os.path.abspath("../"))
%load_ext autoreload
%autoreload 2
from CODE import data_enrichment as de

### Functionized Code From Part 1

In [None]:
def evaluate_regression(model, X_train,y_train, X_test, y_test,for_slides=True): 
    """Evaluates a scikit learn regression model using r-squared and RMSE
    FOR SLIDES VERS DOES MULTIPLE PRINT STATEMENTS FOR VERTICAL DISPLAY OF INFO"""
    
    ## Training Data
    y_pred_train = model.predict(X_train)
    r2_train = metrics.r2_score(y_train, y_pred_train)
    rmse_train = metrics.mean_squared_error(y_train, y_pred_train, 
                                            squared=False)
    mae_train = metrics.mean_absolute_error(y_train, y_pred_train)
    

    ## Test Data
    y_pred_test = model.predict(X_test)
    r2_test = metrics.r2_score(y_test, y_pred_test)
    rmse_test = metrics.mean_squared_error(y_test, y_pred_test, 
                                            squared=False)
    mae_test = metrics.mean_absolute_error(y_test, y_pred_test)
    
    if for_slides:
        df_version =[['Split','R^2','MAE','RMSE']]
        df_version.append(['Train',r2_train, mae_train, rmse_train])
        df_version.append(['Test',r2_test, mae_test, rmse_test])
        df_results = pd.DataFrame(df_version[1:], columns=df_version[0])
        df_results = df_results.round(2)
        display(df_results.style.hide(axis='index').format(precision=2, thousands=','))
        
    else: 
        print(f"Training Data:\tR^2 = {r2_train:,.2f}\tRMSE = {rmse_train:,.2f}\tMAE = {mae_train:,.2f}")
        print(f"Test Data:\tR^2 = {r2_test:,.2f}\tRMSE = {rmse_test:,.2f}\tMAE = {mae_test:,.2f}")

def get_coefficients(lin_reg):
    coeffs = pd.Series(lin_reg.coef_, index= lin_reg.feature_names_in_)
    coeffs.loc['intercept'] = lin_reg.intercept_
    return coeffs

def plot_coefficients(coeffs, sort_values=True, top_n=None, figsize=(6,4),
                     title="Linear Regression Coefficients", xlabel='Coefficient'):
    """Plots a Series of coefficients as horizotal bar chart, with option to sort
    and to only keep top_n coefficients"""
        
    if top_n is not None:
        top_n = coeffs.abs().rank().sort_values(ascending=False).head(top_n)
        coeffs = coeffs.loc[top_n.index]
        
    if sort_values:
        coeffs = coeffs.sort_values()

        
        
    ax = coeffs.plot(kind='barh', figsize=figsize)
    ax.axvline(0, color='k')
    ax.set(xlabel=xlabel, title=title);
    plt.show()
    return ax


def get_importances(rf_reg):
    importances = pd.Series(rf_reg.feature_importances_, index= rf_reg.feature_names_in_)
    return importances


def plot_importances(importances, sort_values=True, top_n=None, figsize=(6,4),
                     title="Feature Importance", xlabel='Importance'):
    if sort_values:
        importances = importances.sort_values()
        
    if top_n is not None:
        importances = importances.tail(top_n)
        
        
    ax = importances.plot(kind='barh', figsize=figsize)
    ax.axvline(0, color='k')
    ax.set(xlabel=xlabel, title=title);
    plt.show()
    return ax

##  🕹️ Loading Objects from a Joblib

In [None]:
## Load the joblib file stored in the models folder
fname = "Models/wk1-lect01-codealong.joblib"


# Preview the contents of the loaded joblib objects


In [None]:
## Saving the loaded objects as separate varaibles



> Let's evaluate our models to prove they saved correctly.

In [None]:
## Use our evaluate_regression function to evalaute the linear regression


> ***Q: what happened??***

In [None]:
## let's check X_train


> **Q: What is missing/wrong?**
....

...


### Re-Creating X_train_df & X_test_df

In [None]:
## Get feature names from already-fit preprocessor



## Use the preprocessor to transform X_train into X_train_df


## Use the preprocessor to transform X_test into X_test_df 


### Evaluating Our LinearRegression

In [None]:
## Use our evaluate_regression function to evalaute the linear regression
evaluate_regression(lin_reg, X_train_df, y_train, X_test_df, y_test)

In [None]:
## Setting float format for readability
pd.set_option('display.float_format',lambda x: f"{x:,.2f}")

In [None]:
## Get the coefficients from the lin reg
coeffs = get_coefficients(lin_reg)
coeffs

In [None]:
## plot the coefficients
plot_coefficients(coeffs)

### Evaluating Our Random Forest

In [None]:
## evaluate the random forest
evaluate_regression(rf_reg,X_train_df,y_train, X_test_df, y_test)

## extract the plot the feature importances
importances = get_importances(rf_reg)
plot_importances(importances)

## 🕹️  Permutation Importance

In [None]:
from sklearn.inspection import permutation_importance

### RandomForest Permutation Importance

>  Apply permutation importance to the random forest

In [None]:
## run performatation_importance on the rf  using the test data and random_state=SEED


In [None]:
## save the average importances as a Series


In [None]:
# Use our plot_importances function, but change title to "Permutation Importance"


In [None]:
# Compare to the random forst feature importance


> Permutation Importance Can Be Applied to ANY Model!

### LinearRegression Permutation Importance

In [None]:
## run performatation_importance on the lin_reg  using the test data and random_state=SEED


In [None]:
## Make into a series called perm_importances_linreg


In [None]:
# final_plot_df = pd.concat([X_train_df, y_train], axis=1)
# corr = final_plot_df.corr()
# corr

In [None]:
# corr['Life expectancy'].sort_values(ascending=False).to_frame().style.bar()

# 🕹️Global Model Explanations

##  Applying Shap

In [1]:
# Import and init shap


In [None]:
# Take a sample of the training data (nsampel=500, use random_state=SEED)
X_shap = None
y_shap = None


In [None]:
# Instantiate a Model Explainer with the model


## Get shap values from the explainer


In [None]:
## create a summary plot (bar version)


In [None]:
## create a summary plot (dot/default version)


In [None]:
## Create an explainer for the lin_reg


## get shap values for linreg



In [None]:
## create a summary plot (bar version)


In [None]:
## create a summary plot (dot/default version)


> So why is our LinReg predicting a high life expectancy when infant deaths are high?

## Local Explanations

In [None]:
## Making a vers of shap vars with 0-based integer index 
# so what it matches the index for the shap_vallues
X_shap_local = None
y_shap_local = None


### Finding a Meaningful  Example to Explain

- Let's find the example with the most infant deaths.

In [None]:
# what is the max/range of infant deaths


In [None]:
## saving the index of the most deaths
idx_high_deaths = None
idx_high_deaths

In [None]:
# checking the feature values for selected example


In [None]:
## what was the actual life expectancy?


## Shap Force Plots

### Force Plot - Linear Reg explanation

In [None]:
## plotting example force plot for most inf.deaths (from linreg)




### Force Plot - RandomForest explanation

In [None]:
## plotting example force plot for most inf.deaths (from rf)




> ***Q: What do you notice when comparing the lin reg and rf reg force plots?***

# LIME

In [None]:
from lime.lime_tabular import LimeTabularExplainer

## Create a lime explainer with X_shap_local with regression as the mode


In [None]:
## Use the lime explainer to explain the selected example used above 



___
# APPENDIX

### Global Force Plots

In [None]:
shap.force_plot(explainer_linreg.expected_value,shap_values.values,X_shap_local,)

In [None]:
shap.force_plot(explainer.expected_value,shap_values.values,X_shap_local)