# Uncover model structure

In this notebook we're going to explore methods to uncover the structure of your machine learning model. We'll cover the following topics:

* Calculating feature importance
* Creating a model profile

If you haven't read the README.md file yet, we encourage that you read that document first. It provides an important introduction to the use-case and dataset used in this notebook and other notebooks in the project.

Let's start by preparing everything we need for this notebook.

## Preparation steps

Before we can use an explainer on our model, we need to load a test dataset and the model itself. The model for this demo is based on the [UCI Credit card defaulters dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). You can learn more about training the model in `tasks/train-model.py`. For more details on the data preparation you can check out the `tasks/prepare-data.py` and `tasks/split-data.py` scripts.

Let's start by loading up the necessary Python packages:

In [1]:
import joblib
import pandas as pd

After loading the libraries for the model, let's load the model and some test data.
You can use `df_test.head()` to see what's in the dataset if you want.

In [2]:
model = joblib.load('../models/classifier.bin')
df_test = pd.read_csv('../data/processed/test.csv').sample(385)

x_test = df_test.drop('LABEL', axis=1)
y_test = df_test['LABEL']

With the dataset and model ready to go, let's begin calculating feature importance information for the model.

## Calculating feature importance

One method to get more insight into the behavior of your model is to use a [permutation feature importance](https://scikit-learn.org/stable/modules/permutation_importance.html) algorithm. This algorithm calculates the impact of each feature on the outcome of your model. From the feature importance you can learn which features matter the most to your model. You can use this knowledge to target the model to the most important features and remove others that don't make a lot of sense. This can reduce the error margin in your model. Feature importance can also help improve the dataset by improving the data quality of the most important features.

In Dalex, you can use the `model_parts` method to get the feature importance of your model:

In [3]:
import dalex as dx

In [4]:
explainer = dx.Explainer(model, x_test, y_test, model_type='classifier', label='LABEL')

Preparation of a new explainer is initiated

  -> data              : 385 rows 29 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 385 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : LABEL
  -> predict function  : <function yhat_proba_default at 0x000002E3613BFCA0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.233, max = 0.89
  -> model type        : classifier will be used
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.81, mean = -0.00922, max = 0.97
  -> model_info        : package sklearn

A new explainer has been created!


In [5]:
explainer.model_parts(type='variable_importance').plot()

The explainer shows the features in order of importance. The key feature for this model is the `PAY_1` feature. This feature indicates whether someone payed on time in the first month, or later. Value -1 is on time, values higher indicate a delay in payments for the bill given in the first month. As this is a model that predicts defaulters, you could say that the payment status for the first month is a good indicator.

Let's continue exploring the structure of the model by diving into the features a bit more. 

## Creating a model profile

You can learn more about features by profiling them on a model-level. This gives you information about the different values of a feature and the influence of that value on the outcome of the model.
We're using the [accumulated local effects](https://arxiv.org/abs/1612.08468) algorithm to get the information. The model profile allows you to ask the question: "What happens if I change the value of this one feature?":

In [6]:
explainer.model_profile(type='accumulated').plot(variables=['PAY_1'])

Calculating ceteris paribus: 100%|██████████| 29/29 [00:06<00:00,  4.18it/s]
Calculating accumulated dependency: 100%|██████████| 29/29 [00:02<00:00, 10.42it/s]


The model profile shows that the model does indeed depend a lot on the value of the `PAY_1` feature. So much so that if the payment delay is larger than one month, it's very likely you're a defaulter.

Feel free to try out other variables from the dataset to see more profiles. You can leave the `variables=` argument out of the explainer call to show all variables.

## Summary

In this notebook we've explored how to use explainers to uncover the structure of a machine-learning model. We've used feature importance permutation to get a better understanding of how important each feature is for the outcome of the model. We then used the Accumulated Local Effects plot to better understand the impact of the `PAY_1` feature on the outcome of the model. 

In the next notebook `debu-model-predictions.ipynb` we're going to dive into prediction-level explanations. The prediction-level explanations will tell us more about the behavior of the model when we present it with a single data sample.