# Training an interpretable model
In this model we're going to train an interpretable model.

We'll cover the following topics in this notebook:

* [Loading and preprocessing the data](#loading-and-preprocessing-the-data)
* [Training the model](#training-the-model)
* [Interpreting the model](#interpreting-the-model)

## Loading and preprocessing the data
First, we're going to load and preprocess the data for our model. We'll perform the following steps:

* First, we load the dataset and split it into a  training and validation set.
* Next, we collect the input variables for the model to train on.
* After that, we collect the output variable for the model to predict.

### Loading and splitting the dataset
We're loading the training set from `data/processed/train.csv` and set aside 30% of the data for validation purposes.
The rest we're using to train the model.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [5]:
df = pd.read_csv('../data/processed/train.csv')
df_train, df_val = train_test_split(df, test_size=0.3)

### Extracting features for training
After we've loaded and split the dataset, we're going to extract the features from the dataset for training.
We already know that we shouldn't be using the `SEX` column, because it's a protected attribute. So we'll drop it. 

Also, note that we're dropping the `default.payment.next.month` column from the feature set as we don't want our predicted variable to be part of the input variables for the model.

In [6]:
x_train = df_train.drop(['default.payment.next.month'], axis=1)
x_val = df_val.drop(['default.payment.next.month'], axis=1)

### Extracting the output variable
Once we have the features for training, we're extracting the output variable that we want to predict.

In [7]:
y_train = df_train['default.payment.next.month']
y_val = df_val['default.payment.next.month']

## Training the model
Now that we have the data ready for training, let's train the model.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [9]:
pca = PCA()
classifier = RandomForestClassifier(n_estimators=100, n_jobs=-1)

model = Pipeline([('pca', pca), ('clf', classifier)])
model.fit(x_train, y_train)

Pipeline(steps=[('pca', PCA()), ('clf', RandomForestClassifier(n_jobs=-1))])

## Interpreting the model
After training, we're checking to make sure the performance of the model is what we expect it to be.
We're using two measures for performance: 

* Accuracy
* Receiver Operator Curve

In [10]:
import matplotlib.pyplot as plt

In [11]:
model.score(x_val, y_val)

0.8529629629629629

In [14]:
from interpret import show
from interpret.blackbox import LimeTabular

explainer = LimeTabular(predict_fn=model.predict_proba, data=x_train, random_state=1337)
explanation = explainer.explain_local(x_val[:5], y_val[:5])

In [15]:
show(explanation)