# Modeling Explainability (Assignment 9)

## Student: Rodolfo Lerma

In this assignment, you will use `LIME` library to perform local explanations using surrogate modelsto explain the results of Random Forest Classifier models.

In [None]:
# pip install plotly

In [None]:
# !pip install lime

[BOOK for this](https://github.com/ajaymache/machine-learning-yearning)

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
np.random.seed(0)
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly
import matplotlib
matplotlib.rcParams['figure.figsize'] = [10, 7]

# warnings.filterwarnings('ignore')
# import lime
# import lime.lime_tabular

In [None]:
df_data = pd.read_csv('./music.xls')
df_data.head()

In [None]:
df_data.shape

The dataframe contains 35 features/columns and 10,000 datapoints/rows.

In [None]:
df_data.dtypes

Apart from `artist.id` & `song.id` all the other columns are either *float64* or *int64*.

In [None]:
df_data.isnull.sum()

### Question 1.1: Create the target of popular artists where artist familiarity is greater than 0.8 and artist hotttness is greater than 0.6.

In [None]:
df_data['class'] = np.where((df_data['artist.familiarity'] > 0.8) & (df_data['artist.hotttnesss'] > 0.6), 1, 0)

In [None]:
#Use Grouby on class and count artist.id
artists = df_data.groupby(['class']).count()['artist.id']
artists

In [None]:
import seaborn as sns
sns.catplot(x="class", kind="count", palette="ch:.1", data=df_data)

Looking at the distribution of the `class` column, we can see there is a class imbalance present. For this reason I will test 2 models, one with the class imbalance present and another without.

In [None]:
def hist_plot(df,a):
    df[a].hist(figsize = (5,3))
    plt.xlabel(a, fontsize = 15)
    plt.ylabel('Frequency',fontsize = 15)
    plt.tick_params(axis="x", labelsize=10)
    plt.tick_params(axis="y", labelsize=10)
    plt.grid(True)
    plt.show()

In [None]:
vars_keep = ['song.bars_confidence', 
             'song.bars_start', 
             'song.beats_confidence', 
             'song.beats_start', 
             'song.duration', 
             'song.end_of_fade_in', 
             'song.hotttnesss', 
             'song.key_confidence', 
             'song.loudness', 
             'song.mode', 
             'song.mode_confidence', 
             'song.start_of_fade_out', 
             'song.tatums_confidence', 
             'song.tatums_start', 
             'song.tempo', 
             'song.time_signature', 
             'song.time_signature_confidence']

In [None]:
df_data[vars_keep].dtypes

In [None]:
for i in vars_keep:
    hist_plot(df_data, i)

### Question 1.2: Train a Random Forest Classifier with 100 estimators considering these variables:
* vars_keep = ['song.bars_confidence', 'song.bars_start', 'song.beats_confidence', 'song.beats_start', 'song.duration', 'song.end_of_fade_in', 'song.hotttnesss', 'song.key_confidence', 'song.loudness', 'song.mode', 'song.mode_confidence', 'song.start_of_fade_out', 'song.tatums_confidence', 'song.tatums_start', 'song.tempo', 'song.time_signature', 'song.time_signature_confidence']


In [None]:
X = df_data[vars_keep]
y = df_data['class']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print('Training set size - X_train: {} '.format(X_train.shape))
print('Training set size - X_test: {} '.format(X_test.shape))

In [None]:
#RandomForestClassifier n_estimates =100, oob_score = True, random_state = 123456
rf = RandomForestClassifier(n_estimators = 100, oob_score = True, random_state = 123456)
rf.fit(X_train, Y_train)

In [None]:
def performance_values(X_test, y_test, trained_model):
    
    from sklearn import preprocessing, metrics
    from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
    
    y_pred = trained_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    probs = trained_model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.roc_auc_score(y_test, preds)
    return (fpr, tpr, accuracy, roc_auc)

In [None]:
def ROC_curve(fpr,tpr,roc_auc):
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [None]:
fpr, tpr, rf_acc, rf_roc_auc = performance_values(X_test, Y_test, rf)
print("Accuracy: {}".format(rf_acc))
print("AUC: {}".format(rf_roc_auc))

In [None]:
ROC_curve(fpr,tpr,rf_roc_auc)

### Question 2.1: Initializing the LIME explainer. You need to include the following conditions - feature_names, class_names, verbose, discretize_continuous, and mode. It is important to note that when you tune class_name that the order is important.

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train, 
                                                   feature_names = vars_keep, 
                                                   class_names = [0,1],
                                                   verbose = True,
                                                   mode = "classification",
                                                   discretize_continuous=False)

For this assignment, you need to visit the [documentation](https://github.com/marcotcr/lime) for `LIME` and find out how you can pass an instance to get a local explanation and produce some visualizations. Since we are using `LimeTabularExplainer`, you can focus on that in the documentation (example notebooks are provided).

### Question 2.2: Choose an instance from the test data, and obtain explanations for it. The explanations should include no more than 5 features (the top 5).

In [None]:
instance_num = np.random.randint(0, X_test.shape[0])

local_exp = explainer.explain_instance(X_test.iloc[instance_num], rf.predict_proba, num_features = 5)

### Question 2.3: Produce a feature importance plot for the explanation. HINT: `LIME` has a method for this. You only need to call it. <span style="color:red" float:right>; # you need the semi-colon otherwise two dublicate plots are produced</span>

In [None]:
fig = local_exp.as_pyplot_figure()

In [None]:
#Show local_exp data frame as a list
local_exp.as_list()

### Quesiont 3: Call the `show_in_notebook` method to show a summary of the explanation. Set show_table = True, show_all = True

In [None]:
local_exp.show_in_notebook(show_table=True, show_all=False)

In [None]:
local_exp.show_in_notebook(show_table=True, show_all=True)

### Question 4: Interpret the results shown by calling `show_in_notebook`. Confirm that the predicted probability shown on the left matches the predicted probability we get by calling the model directly on the instance.

These are the input values for this particular entry:

In [None]:
X_test.iloc[instance_num]

In [None]:
example = rf.predict_proba(X_test)
example[instance_num]

In [None]:
prediction = rf.predict(X_test)
prediction[instance_num]

As we can see the probabilities for both classes are the same as the ones we saw from the LIME plots above.

### [Bonus] Question 5: Repeat the above steps with a Support Vector Machine Classifier. What conclusions to you draw about model explainablity.

In [None]:
parameters = {"C": [0, 0.2, 0.4, 0.6, 0.8, 1.0],
              "kernel": ['linear', 'poly', 'rbf', 'sigmoid'],
              "probability": [True]}

svm_base = SVC()

svm_grid = GridSearchCV(estimator = svm_base, param_grid = parameters,
                        cv = 10, verbose = 2, n_jobs = -1)

svm_grid.fit(X_train, Y_train)

In [None]:
#Hyperparameters Best Parameters
best_params = svm_grid.best_params_
print("Best Hyperparameters: {}".format(best_params))

In [None]:
fpr, tpr, svm_acc, svm_roc_auc = performance_values(X_test, Y_test, svm_grid)
print("Accuracy: {}".format(svm_acc))
print("AUC: {}".format(svm_roc_auc))

In [None]:
ROC_curve(fpr,tpr,svm_roc_auc)

#### LIME 

In [None]:
explainer_svm = lime.lime_tabular.LimeTabularExplainer(X_train, 
                                                   feature_names = vars_keep, 
                                                   class_names = [0,1],
                                                   verbose = True,
                                                   mode = "classification",
                                                   discretize_continuous=False)

In [None]:
local_exp_svm = explainer_svm.explain_instance(X_test.iloc[instance_num], svm.predict_proba, num_features = 5)

In [None]:
fig_svm = local_exp_svm.as_pyplot_figure()

In [None]:
local_exp_svm.as_list()

In [None]:
local_exp_svm.show_in_notebook(show_table=True, show_all=False)

### Question 6: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: 
- What was your incoming experience with this model, if any? 
- what steps you took, what obstacles you encountered?
- How you link this exercise to real-world, machine learning problem-solving?
- What steps were missing? 
- What else do you need to learn?

As discussed in class the topic of Explainability & Interpretability are taking more and more importance as we see the use & impact of Machine Learning more and more in our daily life. In my experience working with thermodynamic data of commercial engines to create Machine Learning Models to predict behavior and/or potential safety issues (such as engine failures) it was always preferred a simpler model that offer better Explainability & Interpretability, even if that meant lower performance values. This, because having a clear understanding of the variables at play and how they interact with each other to produce the final prediction was more important that the prediction itself, especially since almost all the issues were related to safety.

This assignment was really interesting as I have never worked with LIME to help to explain a complex model that otherwise would be very difficult (if not impossible) to clearly explain the reason on why is predicting a particular result. 

[Article](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f)

**Accounting for the context of the problem.**

In most problems, you are working with a dataset that is only a rough representation of the problem you are trying to solve and a machine learning model can typically not capture the full complexity of the real-life task. An interpretable model helps you to understand and account for the factors that are (not) included in the model and account for the context of the problem when taking actions based on model predictions.

**Improving generalisation and performance.**

A high interpretability typically leads to a model that generalises better. Interpretability is not about understanding every single detail of the model for all of the data points. The combination of solid data, model and problem understanding is necessary to have a solution that performs better.

**Ethical and legal reasons.**

In industries like finance and healthcare it is essential to audit the decision process and ensure it is e.g. not discriminatory or violating any laws. With the rise of data and privacy protection regulation like GDPR, interpretability becomes even more essential. In addition, in medical applications or self-driving cars, a single incorrect prediction can have a significant impact and being able to ‘verify’ the model is critical. Therefore the system should be able to explain how it reached a given recommendation.


LIME is a great tool to explain what machine learning classifiers (or models) are doing. It is model-agnostic, leverages simple and understandable idea’s and does not require a lot of effort to run. As always, even when using LIME, it is still important to correctly interpret the output.