# Understanding the automated ML generated model using model explainability 
In this notebook, you will retrieve the best model from the automated machine learning experiment you performed previously. Then you will use the model interpretability features of the Azure Machine Learning Python SDK to indentify which features had the most impact on the prediction.

**Please be sure you have completed Exercise 1 before continuing**

Begin by running the following cell to ensure your environment has the required modules installed and updated.

In [None]:
!pip install azureml-interpret

Next run the following cell to import all the modules used in this notebook.

In [None]:
import pandas as pd
pd.options.display.float_format = '{:.10g}'.format

import azureml
from azureml.core import Run
from azureml.core import Workspace
from azureml.core import Model
from azureml.core import Experiment

from azureml.train.automl.run import AutoMLRun

from azureml.train.automl.automl_explain_utilities import AutoMLExplainerSetupClass, automl_setup_model_explanations
from interpret_community.mimic.models import LGBMExplainableModel
from azureml.interpret.mimic_wrapper import MimicWrapper
from azureml.contrib.interpret.visualize import ExplanationDashboard
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient

# Verify AML SDK Installed
# view version history at https://pypi.org/project/azureml-sdk/#history 
print("SDK Version:", azureml.core.VERSION)

### Configure access to your Azure Machine Learning Workspace
To begin, you will need to provide the following information about your Azure Subscription.

**If you are using your own Azure subscription, please provide names for subscription_id, resource_group, workspace_name and workspace_region to use.** Note that the workspace needs to be of type [Machine Learning Workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace).

**If an environment is provided to you be sure to replace XXXXX in the values below with your unique identifier.**

In the following cell, be sure to set the values for `subscription_id`, `resource_group`, `workspace_name` and `workspace_region` as directed by the comments (*these values can be acquired from the Azure Portal*).

To get these values, do the following:
1. Navigate to the Azure Portal and login with the credentials provided.
2. From the left hand menu, under Favorites, select `Resource Groups`.
3. In the list, select the resource group with the name similar to `tech_immersion_XXXXXimmersion_XXXXX`.
4. From the Overview tab, capture the desired values.

Execute the following cell by selecting the `>|Run` button in the command bar above.

In [None]:
#Provide the Subscription ID of your existing Azure subscription
subscription_id = "" # <- needs to be the subscription with the resource group

#Provide values for the existing Resource Group 
resource_group = "tech_immersion_XXXXX" # <- replace XXXXX with your unique identifier

#Provide the Workspace Name and Azure Region of the Azure Machine Learning Workspace
workspace_name = "tech_immersion_aml_XXXXX" # <- replace XXXXX with your unique identifier (should be lowercase)
workspace_region = "eastus2" # <- region of your resource group

#Provide the name of the Experiment you used with Automated Machine Learning
experiment_name = 'automl-regression'

# the train data is available here
train_data_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                  'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/'
                  'training-formatted.csv')

# this is the URL to the CSV file containing a small set of test data
test_data_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                  'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/'
                  'fleet-formatted.csv')


## Connect to the Azure Machine Learning Workspace

Run the following cell to connect the Azure Machine Learning **Workspace**.

**Important Note**: You will be prompted to login in the text that is output below the cell. Be sure to navigate to the URL displayed and enter the code that is provided. Once you have entered the code, return to this notebook and wait for the output to read `Workspace Provisioning complete`.

In [None]:
# By using the exist_ok param, if the worskpace already exists we get a reference to the existing workspace
ws = Workspace.create(
    name = workspace_name,
    subscription_id = subscription_id,
    resource_group = resource_group, 
    location = workspace_region,
    exist_ok = True)

print("Workspace Provisioning complete.")

Find the run id of your Automated ML experiment in the Azure Machine Learning studio

In the following cell, be sure to set the value for `run_id` as directed by the comments (*this value can be acquired from the Azure Machine Learning Portal*).
To get these values, do the following:
1. Navigate to your Azure Machine Learning workspace in the Azure Portal and login with the credentials provided.
2. From the left navigation bar select `Overwiew` and then select `Launch the Azure Machine Learning studio`.
3. From the left navigation bar select `Experiments` and then identify the first run in the `automl-regression` experiment at the bottom of the run list. This should be have `Run 1` in the `Run` column and `automl` in the `Run type` column.
4. Click on `Run 1` link to open the run details screen where you can capture the `Run ID` value which should be an identifier starting with `AutoML_`.

In [None]:
#Provide the Run Id of the automl type run in your experiment 
run_id = 'AutoML_....'

# Get the best model trained with automated machine learning

Retrieve the Run from the Experiment and then get the underlying AutoMLRun to get at the best model and child run objects:

In [None]:
existing_experiment = Experiment(ws,experiment_name)

automl_run = AutoMLRun(existing_experiment, run_id)
automl_run

Retrieve the best run and best model from the automated machine learning run by executing the following cell:

In [None]:
import azureml.automl
best_run, best_model = automl_run.get_output()

## Load the train and test data

Model interpretability works by passing training and test data thru the created model and evaluating the result of which values had a given impact. 

Load the training and test data by running the following cell.

In [None]:
# load the original training data
train_data = pd.read_csv(train_data_url)
X_train = train_data.iloc[:,1:74]
y_train = train_data.iloc[:,0].values.flatten()

# load some test vehicle data that the model has not seen
X_test = pd.read_csv(test_data_url)
X_test = X_test.drop(columns=["Car_ID", "Battery_Age"])
X_test.rename(columns={'Twelve_hourly_temperature_forecast_for_next_31_days_reversed': 'Twelve_hourly_temperature_history_for_last_31_days_before_death_last_recording_first'}, inplace=True)
X_test


# Get the explanations for each model produced by the Automated ML experiment

For automated machine learning models, you can use `ExplanationClient` to examine the features that were most impactful to the model.

The best run already has explanations computed, so we only need to download them. For all other models, we need to calculate the explanations on the spot. The `LGBMExplainableModel` is used to mimic the behavor of each trained model.
Run the following cell perform the evaluation.

In [None]:
def get_explanation_dataframe(model_name, is_best, feature_dict):
    
    df = pd.DataFrame(list(zip(list(feature_dict.keys()), list(feature_dict.values()))), dtype=float,
                     columns=['FeatureName', 'FeatureImportance'])
    df['ModelName'] = model_name
    df['IsBest'] = is_best
    
    print(df.columns)
    return df

explanation_df = pd.DataFrame(columns=['ModelName', 'IsBest', 'FeatureName', 'FeatureImportance'])

for child_run in list(automl_run.get_children()):
    grand_child_runs = list(child_run.get_children())
    
    needs_explanation = True
    if len(grand_child_runs) > 0:
        
        #attempt to find an explainability run
        
        explain_runs = list(filter(lambda x: x.type == 'automl.model_explain', grand_child_runs))
        if len(explain_runs) > 0:
            needs_explanation = False
            
    if needs_explanation:
        print('Creating explanation for model {}...'.format(child_run.id))

        iteration = child_run.properties['iteration']
        iteration_run, iteration_model = automl_run.get_output(iteration=iteration)
        
        automl_explainer_setup_obj = automl_setup_model_explanations(iteration_model, X=X_train, 
                                                             X_test=X_test, y=y_train, 
                                                             task='regression')
        
        explainer = MimicWrapper(ws, automl_explainer_setup_obj.automl_estimator, LGBMExplainableModel, 
                         init_dataset=automl_explainer_setup_obj.X_transform, run=automl_run,
                         features=automl_explainer_setup_obj.engineered_feature_names, 
                         feature_maps=[automl_explainer_setup_obj.feature_map],
                         classes=automl_explainer_setup_obj.classes)
        
        raw_explanations = explainer.explain(['local', 'global'], get_raw=True, 
                                     raw_feature_names=automl_explainer_setup_obj.raw_feature_names,
                                     eval_dataset=automl_explainer_setup_obj.X_test_transform)

        features_dict = raw_explanations.get_feature_importance_dict()
        features_df = get_explanation_dataframe(child_run.id, False, feature_dict)
         
    else:
        print ('Model {} has already an explanation.'.format(child_run.id))
        
        client = ExplanationClient.from_run(child_run)
        engineered_explanations = client.download_model_explanation(raw=False)
        feature_dict = engineered_explanations.get_feature_importance_dict()
        
        features_df = get_explanation_dataframe(child_run.id, True, feature_dict)

    explanation_df = pd.concat([explanation_df, features_df])

Run the following cell to render the feature importance of the `best model` using the features Pandas DataFrame created above. Which feature had the greatest importance globally on the model?

In [None]:
# View the feature importances for the best model
explanation_df[explanation_df.IsBest == True].head(10)


#  Display the explanation dashboard for the best model

The following command must be run to activate the dashboard in Jupyter notebooks.

Note: If the dashboard does not display, you might need to refresh the content in your browser.

In [None]:
!jupyter nbextension enable --py widgetsnbextension

In [None]:
automl_explainer_setup_obj = automl_setup_model_explanations(best_model, X=X_train, 
                                                             X_test=X_test, y=y_train, 
                                                             task='regression')
        
explainer = MimicWrapper(ws, automl_explainer_setup_obj.automl_estimator, LGBMExplainableModel, 
                 init_dataset=automl_explainer_setup_obj.X_transform, run=automl_run,
                 features=automl_explainer_setup_obj.engineered_feature_names, 
                 feature_maps=[automl_explainer_setup_obj.feature_map],
                 classes=automl_explainer_setup_obj.classes)

raw_explanations = explainer.explain(['local', 'global'], get_raw=True, 
                             raw_feature_names=automl_explainer_setup_obj.raw_feature_names,
                             eval_dataset=automl_explainer_setup_obj.X_test_transform)

ExplanationDashboard(raw_explanations, automl_explainer_setup_obj.automl_pipeline, automl_explainer_setup_obj.X_test_raw)
        