# Learning Activity: Interpretable machine learning

by Sam Edeh  
October 2020


This notebook models loan data and attempts to interpret or explain the model's behavior.

Interpret means to explain or to present in understandable terms. In the context of ML systems, 
interpretability is the ability to explain or to present in understandable terms to a human. [Finale Doshi-Velez](https://arxiv.org/abs/1702.08608)

In [None]:
# data analysis and manipulation tool
import pandas as pd

# machine learning libraries
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# graphical tools
import seaborn
from matplotlib import pyplot

# machine learning interpretability tools
from interpret.blackbox import LimeTabular
from interpret import show

## Load data

In [None]:
# pre-cleaned data
url = 'https://raw.githubusercontent.com/sedeh/Datasets/main/loan_data_25mb.csv'
df = pd.read_csv(url)

In [None]:
df.shape

Due to the small size of the machine we are using, we'll use only a sample of the available data.

In [None]:
# n = 10000
n = len(df)

In [None]:
sample = df.sample(n=n, random_state=1)

## Explore data

In [None]:
sample.head()

In [None]:
sample = sample.drop('loan_id', axis=1)

In [None]:
sample.describe()

In [None]:
pyplot.figure(figsize=(15, 10))
seaborn.heatmap(sample.corr(), annot=True)
pyplot.show()

## Select target

In [None]:
y = sample['interest_rate']

## Select features

In [None]:
features = sample.columns.to_list()
features.remove('interest_rate')
X = sample[features]

## Split data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)

In [None]:
X_test_matrix = X_test.values
X_train_matrix = X_train.values

## Fit model

In [None]:
model = XGBRegressor()
model.fit(X_train_matrix, y_train)


## Model feature importance provides a first level of transparency

In [None]:
feature_importances = {'features':  X_train.columns, 
                       'importance': model.feature_importances_}

In [None]:
feature_importances_df = pd.DataFrame(feature_importances, columns=['features', 'importance'])
feature_importances_df = feature_importances_df.sort_values('importance', ascending=0)
feature_importances_df

In [None]:
ax = seaborn.barplot(x="features", y="importance", data=feature_importances_df)
ax = pyplot.xticks(rotation=90)

## Predict

In [None]:
# y_pred = model.predict(X_test)
y_pred = model.predict(X_test_matrix)

## Evaluation metric provides a second level of transparency

In [None]:
mean_absolute_error(y_test, y_pred)

In [None]:
y_test.describe()

## Local explanations provides a third level of transparency

In [None]:
# Blackbox explainers need a predict function, and optionally a dataset
lime = LimeTabular(predict_fn=model.predict, data=X_train, random_state=1)

# Pick the instances to explain, optionally pass in labels if you have them
lime_local = lime.explain_local(X_test[:5], y_test[:5], name='LIME')

show(lime_local)