# Predicting NYC Taxi Fares And Model Explainer

In this quickstart, we will be using a subset of NYC Taxi & Limousine Commission - green taxi trip records available from [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/). The data is enriched with holiday and weather data. We will use data transformations and the GradientBoostingRegressor algorithm from the scikit-learn library to train a regression model to predict taxi fares in New York City based on input features such as, number of passengers, trip distance, datetime, holiday information and weather information.

The primary goal of this quickstart is to explain the predictions made by our trained model with the various [Azure Model Interpretability](https://docs.microsoft.com/en-us/azure/machine-learning/service/machine-learning-interpretability-explainability) packages of the Azure Machine Learning Python SDK.

### Install required libraries

After your install these libraries it is recommended that you **restart** the notebook kernel from the **Kernel** menu above. After restarting the kernel, start from the **Azure Machine Learning and Model Interpretability SDK-specific Imports** section.

You can ignore any incompatibility errors. Please run the cell below only once.

In [None]:
!pip install --upgrade interpret-community
!pip install flask-cors

### Azure Machine Learning and Model Interpretability SDK-specific Imports

Remember to restart the kernel before proceeding.

Run the following cell to import the modules used in this notebook.

In [None]:
import os
import numpy as np
import pandas as pd
import pickle
import sklearn
from sklearn.externals import joblib
import math

print("pandas version: {} numpy version: {}".format(pd.__version__, np.__version__))

sklearn_version = sklearn.__version__
print('The scikit-learn version is {}.'.format(sklearn_version))

import azureml
from azureml.core import Workspace, Experiment, Run
from azureml.core.model import Model

from interpret.ext.blackbox import TabularExplainer
from azureml.interpret.scoring.scoring_explainer import TreeScoringExplainer, save

print('The azureml.core version is {}.'.format(azureml.core.VERSION))

### Setup
To begin, you will need to provide the following information about your Azure Subscription.

In the following cell, be sure to set the values for `subscription_id`, `resource_group`, `workspace_name` and `workspace_region` as directed by the comments (*these values can be acquired from the Azure Portal*).

You can get all of these values from the lab guide on the right:
1. In the tabs at the top of the lab guide, select `Environment Details`.
2. Copy the values from SubscriptionID, ResourceGroup, WorkspaceName and WorkspaceRegion and paste them as the values in the cell below.

Execute the following cell by selecting the `>|Run` button in the command bar above.

In [None]:
#Provide the Subscription ID of your existing Azure subscription
subscription_id = "" # <- needs to be the subscription within the Azure resource group for this lesson

#Provide values for the existing Resource Group 
resource_group = "" # <- enter the name of your Azure Resource Group

#Provide the Workspace Name and Azure Region of the Azure Machine Learning Workspace
workspace_name = "" # <- enter the name of the Azure Machine Learning workspace
workspace_region = "eastus" # <- region of your Azure Machine Learning workspace 

experiment_name = "lab-explainability"

### Create and connect to an Azure Machine Learning Workspace

Run the following cell to connect to your existing Azure Machine Learning **Workspace** and save the configuration to disk (next to the Jupyter notebook). 

**Important Note**: You may be prompted to login in the text that is output below the cell. If you are, be sure to navigate to the URL displayed and enter the code that is provided. Once you have entered the code, return to this notebook and wait for the output to read `Workspace configuration succeeded`.

In [None]:
ws = Workspace.create(
    name = workspace_name,
    subscription_id = subscription_id,
    resource_group = resource_group, 
    location = workspace_region,
    exist_ok = True)

ws.write_config()
print('Workspace configuration succeeded')

### Train the Model

Run the following cell to download the dataset, split the data into training and test sets and create a pipeline that includes a few steps to clean and standardize the data and ultimately train the model. 

NOTE: Do not get too concerned about the details of the following code. If you take anything away from this cell, it should be that a model has been trained and stored in the variable `clf`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.metrics import mean_squared_error

data_url = ('https://introtomlsampledata.blob.core.windows.net/data/nyc-taxi/nyc-taxi-sample-data.csv')

df = pd.read_csv(data_url)
x_df = df.drop(['totalAmount'], axis=1)
y_df = df['totalAmount']

X_train, X_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=0)

categorical = ['normalizeHolidayName', 'isPaidTimeOff']
numerical = ['vendorID', 'passengerCount', 'tripDistance', 'hour_of_day', 'day_of_week', 
             'day_of_month', 'month_num', 'snowDepth', 'precipTime', 'precipDepth', 'temperature']

numeric_transformations = [([f], Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])) for f in numerical]
    
categorical_transformations = [([f], OneHotEncoder(handle_unknown='ignore', sparse=False)) for f in categorical]

transformations = numeric_transformations + categorical_transformations

clf = Pipeline(steps=[('preprocessor', DataFrameMapper(transformations)),
                      ('regressor', GradientBoostingRegressor())])

clf.fit(X_train, y_train)

y_predict = clf.predict(X_test)
y_actual = y_test.values.flatten().tolist()
rmse = math.sqrt(mean_squared_error(y_actual, y_predict))
print('The RMSE score on test data for GradientBoostingRegressor: ', rmse)

## Global Explanation Using a Meta Explainer (TabularExplainer)

**Global Model Explanation** is a holistic understanding of how the model makes decisions. It provides you with insights on what features are most important and their relative strengths in making model predictions.

To initialize an explainer object, you need to pass your model and some training data to the explainer's constructor.

*Note that you can pass in your feature transformation pipeline to the explainer to receive explanations in terms of the raw features before the transformation (rather than engineered features).*

In [None]:
# "features" and "classes" fields are optional
trained_gradient_boosting_regressor = clf.steps[-1][1]
tabular_explainer = TabularExplainer(trained_gradient_boosting_regressor, 
                                     initialization_examples=X_train, 
                                     features=X_train.columns,  
                                     transformations=transformations)

[TabularExplainer](https://docs.microsoft.com/en-us/python/api/azureml-explain-model/azureml.explain.model.tabularexplainer?view=azure-ml-py) uses one of three explainers: TreeExplainer, DeepExplainer, or KernelExplainer, and is automatically selecting the most appropriate one for our use case. 

You can learn more about the underlying model explainers at [Azure Model Interpretability](https://docs.microsoft.com/en-us/azure/machine-learning/service/machine-learning-interpretability-explainability).

### Get the global feature importance values

Run the below cell and observe the sorted global feature importance. You will note that `tripDistance` is the most important feature in predicting the taxi fares, followed by `hour_of_day`, and `day_of_week`.

In [None]:
# You can use the training data or the test data here
global_explanation = tabular_explainer.explain_global(X_test)

# Sorted feature importance values and feature names
sorted_global_importance_values = global_explanation.get_ranked_global_values()
sorted_global_importance_names = global_explanation.get_ranked_global_names()
dict(zip(sorted_global_importance_names, sorted_global_importance_values))

## Visualizing the Global Explanation

Run the following cell to create a dashbaord the enables you to explore the data and visualize the explanation.

In the Dashboard that is displayed, try answering the following questions:

1. Select the `Data Exploration` tab, the set the `X value` to `tripDistance` and the `Y value` to `PredictedY` (this is predicted fare mount). What happens to the predicted fare as the trip distance increases? 
2. Select the `Global Importance` tab. Drag the slider under Top K Features so its value is set to `3`. What are the top 3 most important features? Which feature has the highest feature importance (and is therefore the most important feature)?

In [None]:
from interpret_community.widget import ExplanationDashboard

ExplanationDashboard(global_explanation, model=clf, datasetX=X_test)

## Local Explanation

You can use the [TabularExplainer](https://docs.microsoft.com/en-us/python/api/azureml-explain-model/azureml.explain.model.tabularexplainer?view=azure-ml-py) for a single prediction. You can focus on a single instance and examine model prediction for this input, and explain why.

We will create two sample inputs to explain the individual predictions.

- **Data 1**
 - 4 Passengers at 3:00PM, Friday July 5th, temperature 80F, travelling 10 miles

- **Data 2**
 - 1 Passenger at 6:00AM, Monday January 20th, rainy, temperature 35F, travelling 5 miles

In [None]:
# Create the test dataset
columns = ['vendorID', 'passengerCount', 'tripDistance', 'hour_of_day', 'day_of_week', 'day_of_month', 
           'month_num', 'normalizeHolidayName', 'isPaidTimeOff', 'snowDepth', 'precipTime', 
           'precipDepth', 'temperature']

data = [[1, 4, 10, 15, 4, 5, 7, 'None', False, 0, 0.0, 0.0, 80], 
        [1, 1, 5, 6, 0, 20, 1, 'Martin Luther King, Jr. Day', True, 0, 2.0, 3.0, 35]]

data_df = pd.DataFrame(data, columns = columns)

In [None]:
# explain the test data
local_explanation = tabular_explainer.explain_local(data_df)

# sorted feature importance values and feature names
sorted_local_importance_names = local_explanation.get_ranked_local_names()
sorted_local_importance_values = local_explanation.get_ranked_local_values()

# package the results in a DataFrame for easy viewing
results = pd.DataFrame([sorted_local_importance_names[0][0:5], sorted_local_importance_values[0][0:5], 
                        sorted_local_importance_names[1][0:5], sorted_local_importance_values[1][0:5]], 
                       columns = ['1st', '2nd', '3rd', '4th', '5th'], 
                       index = ['Data 1', '', 'Data 2', ''])
print('Top 5 Local Feature Importance')
results

As we saw from the Global Explanation that the **tripDistance** is the most important global feature. Other than `tripDistance`, the rest of the top 5 important features were different for the two samples.

- Data 1: Passenger count 4 and 3:00 PM on Friday were also important features in the prediction.
- Data 2: The weather-related features (rainy, temperature 35F), day of the week (Monday) and month (January) were also important.

## You're Done!
Congratulations you have finished with this lab.