# Interpreting Models

You can use Azure Machine Learning to interpret a model by using an *explainer* that quantifies the amount of influence each feature contribues to the predicted label. There are many common explainers, each suitable for different kinds of modeling algorithm; but the basic approach to using them is the same.

Let's start by ensuring that you have the latest version of the Azure ML SDK installed.

In [1]:
!pip install --upgrade azureml-sdk[notebooks,automl,explain]

Requirement already up-to-date: azureml-sdk[automl,explain,notebooks] in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.5.0)


## Explain a Model

Let's start with a model that is trained outside of Azure Machine Learning - Run the cell below to train a decision tree classification model.

In [2]:
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# load the diabetes dataset
print("Loading Data...")
data = pd.read_csv('data/diabetes.csv')

# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
labels = ['not-diabetic', 'diabetic']
X, y = data[features].values, data['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))

print('Model trained.')

Loading Data...
Training a decision tree model
Accuracy: 0.8876666666666667
AUC: 0.8731504054928105
Model trained.


The training process generated some model evaluation meytrics based on a hold-back validation dataset, so you have an idea of how accurately it predicts; but how do the features in the data influence the prediction?

### Install the Azure ML Interpretability Library
To find out, first install the Azure ML Interpretability library. You can use this to interpret many typical kinds of model, even if they haven't been trained in an Azure ML experiment or registered in an Azure ML workspace.

In [3]:
!pip install --upgrade azureml-interpret

Requirement already up-to-date: azureml-interpret in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.5.0)


### Get an Explainer for our Model

Noe that you have the library installed, let's get a suitable explainer for the model. There are many kinds of explainer. In this example you'll use a *Tabular Explainer*, which is a "black box" explainer that can be used to explain many kinds of model by invoking an appropriate [SHAP](https://github.com/slundberg/shap) model explainer.

In [4]:
from interpret.ext.blackbox import TabularExplainer

# "features" and "classes" fields are optional
tab_explainer = TabularExplainer(model, 
                             X_train, 
                             features=features, 
                             classes=labels)
print(tab_explainer, "ready!")

TabularExplainer ready!


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


### Get Global Feature Importance

The first thing to do is try to explain the model by evaluating the overall *feature importance* - in other words, quantifying the extent to which each feature influences the prediction based on the whole training dataset.

In [5]:
# you can use the training data or the test data here
global_tab_explanation = tab_explainer.explain_global(X_train)

# Get the top features by importance
global_tab_feature_importance = global_tab_explanation.get_feature_importance_dict()
for feature, importance in global_tab_feature_importance.items():
    print(feature,":", importance)

Pregnancies : 0.21806172702688592
Age : 0.10533980231801054
BMI : 0.09509198075591635
SerumInsulin : 0.06689060616225376
PlasmaGlucose : 0.049688206336490084
TricepsThickness : 0.022089639552305182
DiastolicBloodPressure : 0.016887187217064795
DiabetesPedigree : 0.013604338101275072


The feature importance is ranked, with the most important feature listed first.

### Get Local Feature Importance

So you have an overall view, but what about explaining individual observations? Let's generate *local* explanations for individual predictions, quantifying the extent to which each feature influenced the decision to predict each of the possible label values. In this case, it's a binary model, so there are two possible labels (non-diabetic and diabetic); and you can quantify the influence of each feature for each of these label values for individual observations in a dataset. You'll just evaluate the first two cases in the test dataset.

In [6]:
# Get the observations we want to explain (the first two)
X_explain = X_test[0:2]

# Get predictions
predictions = model.predict(X_explain)

# Get local explanations
local_tab_explanation = tab_explainer.explain_local(X_explain)

# Get feature names and importance for each possible label
local_tab_features = local_tab_explanation.get_ranked_local_names()
local_tab_importance = local_tab_explanation.get_ranked_local_values()

for l in range(len(local_tab_features)):
    print('Support for', labels[l])
    label = local_tab_features[l]
    for o in range(len(label)):
        print("\tObservation", o + 1)
        feature_list = label[o]
        total_support = 0
        for f in range(len(feature_list)):
            print("\t\t", feature_list[f], ':', local_tab_importance[l][o][f])
            total_support += local_tab_importance[l][o][f]
        print("\t\t ----------\n\t\t Total:", total_support, "Prediction:", labels[predictions[o]])



Support for not-diabetic
	Observation 1
		 SerumInsulin : 0.3588980112757934
		 Age : 0.2460533448962042
		 TricepsThickness : 0.026715187972958618
		 BMI : 0.013590133009282707
		 DiabetesPedigree : 2.8565518216583565e-05
		 DiastolicBloodPressure : -0.015862127825512162
		 PlasmaGlucose : -0.0394931929954951
		 Pregnancies : -0.25650135042287736
		 ----------
		 Total: 0.33342857142857085 Prediction: not-diabetic
	Observation 2
		 BMI : 0.3605349183615101
		 DiabetesPedigree : 0.03323198603288227
		 Pregnancies : 0.016694856768813975
		 Age : 0.01329940156039576
		 PlasmaGlucose : 0.003740518532332468
		 DiastolicBloodPressure : -0.00973146020278818
		 TricepsThickness : -0.02919016667307313
		 SerumInsulin : -0.05515148295150268
		 ----------
		 Total: 0.3334285714285706 Prediction: not-diabetic
Support for diabetic
	Observation 1
		 Pregnancies : 0.25650135042287714
		 PlasmaGlucose : 0.03949319299549512
		 DiastolicBloodPressure : 0.015862127825512187
		 DiabetesPedigree : -2.8565

## Adding Explainability to Azure ML Models Training Experiments

As you've seen, you can generate explanations for models trained outside of Azure ML; but when you use experiments to train models in your Azure ML workspace, you can generate model explanations and log them.

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [7]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.5.0 to work with azure-ml-demo


### Train and Explain a Model using an Experiment

OK, let's create an experiment and put the files it needs in a local folder - in this case we'll just use the same CSV file of diabetes data to train the model.

In [8]:
import os, shutil
from azureml.core import Experiment

# Create a folder for the experiment files
experiment_folder = 'diabetes_train_and_explain'
os.makedirs(experiment_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(experiment_folder, "diabetes.csv"))

'diabetes_train_and_explain/diabetes.csv'

Now we'll create a training script that looks similar to any other Azure ML training script except that is includes the following features:

- The same libraries to generate model explanations we used before are imported and used to generate a global explanation
- The **ExplanationClient** library is used to upload the explanation to the experiment output

In [9]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Import Azure ML run library
from azureml.core.run import Run

# Import libraries for model explanation
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient
from interpret.ext.blackbox import TabularExplainer

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
data = pd.read_csv('diabetes.csv')

features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
labels = ['not-diabetic', 'diabetic']

# Separate features and labels
X, y = data[features].values, data['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes.pkl')

# Get explanation
explainer = TabularExplainer(model, X_train, features=features, classes=labels)
explanation = explainer.explain_global(X_test)

# Get an Explanation Client and upload the explanation
explain_client = ExplanationClient.from_run(run)
explain_client.upload_model_explanation(explanation, comment='Tabular Explanation')

# Complete the run
run.complete()

Writing diabetes_train_and_explain/diabetes_training.py


Now you can run the experiment, using an estimator to run the training script. Note that the **azureml-interpret** library is included in the training environment so the script can create a **TabularExplainer**, and the **azureml-contrib-interpret** package is included so the script can use the **ExplainerClient** class.

In [10]:
from azureml.train.estimator import Estimator
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.widgets import RunDetails

# Create a Python environment for the experiment
env = Environment('diabetes-interpret-env')
env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
env.docker.enabled = True # Use a docker container

# Create a set of package dependencies (including the azureml-contrib-interpret package)
packages = CondaDependencies.create(conda_packages=['scikit-learn','pandas'],
                                    pip_packages=['azureml-defaults','azureml-interpret','azureml-contrib-interpret'])

# Add the dependencies to the environment
env.python.conda_dependencies = packages

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
              compute_target = 'local', # Use local compute
              environment_definition = env,
              entry_script='diabetes_training.py')

# Run the experiment
experiment = Experiment(workspace = ws, name = 'diabetes_train_and_explain')
run = experiment.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes_train_and_explain_1589816842_2361e23e',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2020-05-18T15:53:27.69825Z',
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'ad8c7823-1dc1-4251-bfca-b6573e25f96c',
  'azureml.git.repository_uri': 'https://github.com/shaikh-rashid/DP100.git',
  'mlflow.source.git.repoURL': 'https://github.com/shaikh-rashid/DP100.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': '46f2d801178a96e5935eab54add5366265ad790b',
  'mlflow.source.git.commit': '46f2d801178a96e5935eab54add5366265ad790b',
  'azureml.git.dirty': 'True',
  'model_type': 'classification',
  'explainer': 'tabular'},
 'inputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'local',
  'dataReferences': {},
  'data': {},

## Retrieve the Feature Importance Values

With the experiment run completed, you can use the **ExplanationClient** class to retrieve the feature importance from the explanation registered for the run.

In [11]:
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient

# Get the feature explanations
client = ExplanationClient.from_run(run)
engineered_explanations = client.download_model_explanation()
feature_importances = engineered_explanations.get_feature_importance_dict()

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)

Feature	Importance
Pregnancies 	 0.22049884867286124
Age 	 0.10434542909233196
BMI 	 0.09869942434800999
SerumInsulin 	 0.06960860799173321
PlasmaGlucose 	 0.04836219464568636
TricepsThickness 	 0.02242271361083492
DiastolicBloodPressure 	 0.015483749144572794
DiabetesPedigree 	 0.013382666697861915


## View the Model Explanation in Azure Machine Learning studio

You can also click the link in the Run Details widget to see the run in Azure Machine Learning studio, and view the **Explanations** tab. Then:

1. Select the **Tabular Explanation** explainer.
2. View the **Global Importance** chart, which shows the overall global feature importance.
3. View the **Summary Importance** chart, which shows each data point from the test data in a *swarm*, *violin*, or *box* plot.
4. Select an individual point to see the **Local Feature Importance** for the individual prediction for the selected data point.


**More Information**: For more information about using explainers in Azure ML, see [the documentation](https://docs.microsoft.com/azure/machine-learning/how-to-machine-learning-interpretability). 