# Interpreting Models

You can use Azure Machine Learning to interpret a model by using an *explainer* that quantifies the amount of influence each feature contribues to the predicted label. There are many common explainers, each suitable for different kinds of modeling algorithm; but the basic approach to using them is the same.


## Explain a Model

Let's start with a model that is trained outside of Azure Machine Learning - Run the cell below to train a decision tree classification model.

In [16]:
import shutil
import pandas as pd
import numpy as np
from azureml import core
from azureml.core import Environment, Experiment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.estimator import Estimator
from azureml.widgets import RunDetails
from azureml.contrib.interpret.explanation.explanation_client import (
    ExplanationClient,
)
from sklearn import model_selection, tree, metrics
import interpret_community
import os


In [2]:
print('Loading Data...')
data: pd.DataFrame = pd.read_csv('data/diabetes.csv')

features = [
    'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
    'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age'
]
labels = ['not-diabetic', 'diabetic']
X = data[features].to_numpy()
y = data['Diabetic'].to_numpy()

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.30, random_state=0
)

print('Training a decision tree model')
model = tree.DecisionTreeClassifier().fit(X_train, y_train)

y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)

y_scores = model.predict_proba(X_test)
auc = metrics.roc_auc_score(y_test,y_scores[:,1])
print('AUC:', auc)

print('Model trained.')


Loading Data...
Training a decision tree model
Accuracy: 0.89
AUC: 0.8775909249216378
Model trained.


The training process generated some model evaluation metrics based on a hold-back validation dataset, so you have an idea of how accurately it predicts; but how do the features in the data influence the prediction?

### Get an Explainer for our Model

Let's get a suitable explainer for the model from the Azure ML interpretability library you installed earlier. There are many kinds of explainer. In this example you'll use a *Tabular Explainer*, which is a "black box" explainer that can be used to explain many kinds of model by invoking an appropriate [SHAP](https://github.com/slundberg/shap) model explainer.

In [3]:
tab_explainer = interpret_community.TabularExplainer(
    model,
    X_train,
    features=features,
    classes=labels
)
print(tab_explainer, "ready!")


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


TabularExplainer ready!


### Get Global Feature Importance

The first thing to do is try to explain the model by evaluating the overall *feature importance* - in other words, quantifying the extent to which each feature influences the prediction based on the whole training dataset.

In [4]:
global_tab_explanation = tab_explainer.explain_global(X_train)

global_tab_feature_importance = (
    global_tab_explanation.get_feature_importance_dict()
)
for feature, importance in global_tab_feature_importance.items():
    print(f'{feature}: {importance}')


Pregnancies: 0.21785922910201155
Age: 0.10598837969058147
BMI: 0.09417181850965754
SerumInsulin: 0.06888700310619172
PlasmaGlucose: 0.04956340844806506
TricepsThickness: 0.021350502953945423
DiastolicBloodPressure: 0.01709250842302002
DiabetesPedigree: 0.012945884096178856


The feature importance is ranked, with the most important feature listed first.

### Get Local Feature Importance

So you have an overall view, but what about explaining individual observations? Let's generate *local* explanations for individual predictions, quantifying the extent to which each feature influenced the decision to predict each of the possible label values. In this case, it's a binary model, so there are two possible labels (non-diabetic and diabetic); and you can quantify the influence of each feature for each of these label values for individual observations in a dataset. You'll just evaluate the first two cases in the test dataset.

In [5]:
X_explain = X_test[0:2]

predictions = model.predict(X_explain)

local_tab_explanation = tab_explainer.explain_local(X_explain)

local_tab_features = local_tab_explanation.get_ranked_local_names()
local_tab_importance = local_tab_explanation.get_ranked_local_values()

for l in range(len(labels)):
    print('Support for', labels[l])
    label = local_tab_features[l]
    for o in range(len(label)):
        print('\tObservation', o + 1)
        feature_list = label[o]
        total_support = 0
        for f in range(len(feature_list)):
            print(f'\t\t{feature_list[f]}: {local_tab_importance[l][o][f]}')
            total_support += local_tab_importance[l][o][f]
        print(
            '\t\t ----------\n\t\t Total:', total_support, 
            'Prediction:', labels[predictions[o]]
        )


Support for not-diabetic
	Observation 1
		SerumInsulin: 0.38226306716502617
		Age: 0.23401419123401868
		TricepsThickness: 0.02372568084381155
		BMI: 0.009781979685997106
		DiabetesPedigree: 0.00032437561716795965
		DiastolicBloodPressure: -0.015072666151411911
		PlasmaGlucose: -0.04457745516406177
		Pregnancies: -0.2570306018019769
		 ----------
		 Total: 0.33342857142857085 Prediction: not-diabetic
	Observation 2
		BMI: 0.3455558113786633
		Pregnancies: 0.02847656984423532
		Age: 0.016679974987571556
		PlasmaGlucose: 0.010502289531511659
		DiabetesPedigree: 0.003690753722564006
		DiastolicBloodPressure: 0.0014083273816485585
		TricepsThickness: -0.025738110934097516
		SerumInsulin: -0.04714704448352626
		 ----------
		 Total: 0.33342857142857063 Prediction: not-diabetic
Support for diabetic
	Observation 1
		Pregnancies: 0.2570306018019765
		PlasmaGlucose: 0.04457745516406178
		DiastolicBloodPressure: 0.015072666151411934
		DiabetesPedigree: -0.0003243756171679545
		BMI: -0.0097819796

## Adding Explainability to Azure ML Models Training Experiments

As you've seen, you can generate explanations for models trained outside of Azure ML; but when you use experiments to train models in your Azure ML workspace, you can generate model explanations and log them.

### Connect to Your Workspace

To run an experiment, you need to connect to your workspace using the Azure ML SDK.

> **Note**: You may be prompted to authenticate. Just copy the code and click the link provided to sign into your Azure subscription, and then return to this notebook.

In [8]:
ws = core.Workspace.from_config()
print(f'Ready to use Azure ML {core.VERSION} to work with {ws.name}')


Ready to use Azure ML 1.12.0 to work with workspace


### Train and Explain a Model using an Experiment

OK, let's create an experiment and put the files it needs in a local folder - in this case we'll just use the same CSV file of diabetes data to train the model.

In [11]:
experiment_folder = 'diabetes-train-and-explain'
os.makedirs(experiment_folder, exist_ok=True)

shutil.copy(
    'data/diabetes.csv', os.path.join(experiment_folder, "diabetes.csv")
)


'diabetes-train-and-explain/diabetes.csv'

Now we'll create a training script that looks similar to any other Azure ML training script except that is includes the following features:

- The same libraries to generate model explanations we used before are imported and used to generate a global explanation
- The **ExplanationClient** library is used to upload the explanation to the experiment output

In [12]:
%%writefile $experiment_folder/diabetes_training.py
import os
import pandas as pd
import numpy as np
import joblib
from azureml.contrib.interpret.explanation.explanation_client import (
    ExplanationClient,
)
from interpret_community import TabularExplainer
from sklearn import model_selection, metrics

from azureml import core

from sklearn.tree import DecisionTreeClassifier

run = core.run.Run.get_context()

print("Loading Data...")
data = pd.read_csv('diabetes.csv')

features = [
    'Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
    'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age',
]
labels = ['not-diabetic', 'diabetic']

X = data[features].to_numpy()
y = data['Diabetic'].to_numpy()

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.30, random_state=0
)

print('Training a decision tree model')
model: DecisionTreeClassifier = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', acc)

y_scores = model.predict_proba(X_test)
auc = metrics.roc_auc_score(y_test, y_scores[:, 1])
run.log('AUC', auc)

os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes.pkl')

explainer = TabularExplainer(model, X_train, features=features, classes=labels)
explanation = explainer.explain_global(X_test)

explain_client = ExplanationClient.from_run(run)
explain_client.upload_model_explanation(
    explanation, comment='Tabular Explanation'
)

run.complete()


Writing diabetes-train-and-explain/diabetes_training.py


Now you can run the experiment, using an estimator to run the training script. Note that the **azureml-interpret** library is included in the training environment so the script can create a **TabularExplainer**, and the **azureml-contrib-interpret** package is included so the script can use the **ExplainerClient** class.

In [15]:
env = Environment('diabetes-interpret-env')
env.python.user_managed_dependencies = False
env.docker.enabled = True

packages = CondaDependencies.create(
    conda_packages=['scikit-learn', 'pandas'],
    pip_packages=[
        'azureml-defaults', 'azureml-interpret', 'azureml-contrib-interpret',
    ]
)

env.python.conda_dependencies = packages

estimator = Estimator(
    experiment_folder,
    compute_target='local',
    environment_definition=env,
    entry_script='diabetes_training.py'
)

experiment = Experiment(ws, 'diabetes_train_and_explain')
run = experiment.submit(estimator)
RunDetails(run).show()
run.wait_for_completion()


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes_train_and_explain_1598294829_0b4d793a',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2020-08-24T18:51:18.117484Z',
 'endTimeUtc': '2020-08-24T18:51:25.326183Z',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': '7ca3d1bb-fc3e-4987-be39-cc55ef804d22',
  'azureml.git.repository_uri': 'https://github.com/susumuasaga/mslearn-aml-labs',
  'mlflow.source.git.repoURL': 'https://github.com/susumuasaga/mslearn-aml-labs',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '3c7ba4c1aa5e32db1bef7a95b42a0d109e7b2760',
  'mlflow.source.git.commit': '3c7ba4c1aa5e32db1bef7a95b42a0d109e7b2760',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'scriptType': None,
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'loca

## Retrieve the Feature Importance Values

With the experiment run completed, you can use the **ExplanationClient** class to retrieve the feature importance from the explanation registered for the run.

In [17]:
client = ExplanationClient.from_run(run)
engineered_explanations = client.download_model_explanation()
feature_importances = engineered_explanations.get_feature_importance_dict()

print('Feature\tImportance')
for key, value in feature_importances.items():
    print(f'{key}\t{value}')


Feature	Importance
Pregnancies	0.22222578576717011
Age	0.10486983748522587
BMI	0.09623613192635179
SerumInsulin	0.06965552107721515
PlasmaGlucose	0.050188321430330016
TricepsThickness	0.02233690774872693
DiastolicBloodPressure	0.016299865650036834
DiabetesPedigree	0.015240766647596868


## View the Model Explanation in Azure Machine Learning studio

You can also click the link in the Run Details widget to see the run in Azure Machine Learning studio, and view the **Explanations** tab. Then:

1. Select the **Tabular Explanation** explainer.
2. View the **Global Importance** chart, which shows the overall global feature importance.
3. View the **Summary Importance** chart, which shows each data point from the test data in a *swarm*, *violin*, or *box* plot.
4. Select an individual point to see the **Local Feature Importance** for the individual prediction for the selected data point.


**More Information**: For more information about using explainers in Azure ML, see [the documentation](https://docs.microsoft.com/azure/machine-learning/how-to-machine-learning-interpretability). 