Do you wish to use complex machine learning models and explain it to business ? Machine learning explainability comes to your rescue. The kernel uses 2 different case studies 
* New York Home Mortgage data          
* Housing prices in Melbourne 

The kernel makes use generously of the wonderful Kaggle course of `Machine Learning explainibility course in Kaggle Learn by Dan Becker` .     

The techniques used are **Permutation Importance, Partial Dependence Plots and SHAP analysis**. Please read further more details.

<hr/>
# Chapter 1 : New York Home Mortgage Analysis
<hr/>

# Introduction

The Home Mortgage Disclosure Act (HMDA) requires many financial institutions to maintain, report, and publicly disclose information about mortgages.This dataset covers all mortgage decisions made in 2015 for the state of New York.

Before we dive into solve the problem, let us first understand the business related to this dataset. 

Each year thousands of banks and other financial institutions report data about mortgages to the public, thanks to the Home Mortgage Disclosure Act, or **“HMDA”** for short. These public data are important because:

* Help show whether lenders are serving the housing needs of their communities;           
* Give public officials information that helps them make decisions and policies; and          
* Shed light on lending patterns that could be discriminatory            

## Loan Origination Journey

Meet Emily. She wants to buy a home but doesn’t have the money to pay for it in cash, so she applies for a loan at her bank. She tells the bank about her finances, the house she wants to buy, and other information the bank needs to make a decision about whether or not to lend to her, and the terms of the loan. The bank reviews Emily’s application, decides that she meets their criteria, and she gets approved. Once all the papers are signed, Emily closes the loan… or in mortgage-speak, the loan is **“originated.”.**

Therefore the last stage of the loan is **Loan Origination.**

The data provided can be grouped into the following subjects

* **Location**  describes the State, metro area and census tract of the property         

* **Property Type**  describes the Property Type and Occupancy of the property.Property type values include One-to-four family dwelling,Manufactured housing and Multifamily dwelling. This also answers the question “Will the owner use the property as their primary residence ?” . The values include Owner occupied as principal dwelling , Not owner occupied as principal dwelling and Not Applicable.

* **Loan**  describes the action taken on the Loan, purpose of the Loan , Type of the loan ,Loan’s lien status.

* **Lender**  describes the lender associated with the loan and the Federal agency associated with the loan.

*  **Applicant**  describes the demographic information for the applicants and the co-applicants.This has the applicant sex , co- applicant sex , applicant race and ethnicity, co- applicant race and ethnicity.

# Load Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import roc_curve, auc  #Metrics

#ML Libraries
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import xgboost

#eli5 
import eli5
from eli5.sklearn import PermutationImportance
from eli5 import show_prediction

#partial dependencies
from pdpbox import pdp, get_dataset, info_plots

#shap analysis
import shap

#LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Read the data

In [None]:
fillColor = "#FFA07A"
fillColor2 = "#F1C40F"
loans = pd.read_csv('../input/ny-home-mortgage/ny_hmda_2015.csv')

In [None]:
loans.head()

For the modelling we include all the features which are numeric. The categorical features in the model have their corresponding numeric values in the numerical features

In [None]:
cols = [f_ for f_ in loans.columns if loans[f_].dtype != 'object']
features = cols

list_to_remove = ['action_taken','purchaser_type',
                  'denial_reason_1','denial_reason_2','denial_reason_3','sequence_number']

features= list(set(cols).difference(set(list_to_remove)))

X = loans[features]
y = loans['action_taken']

We define a function in which we mark the **Loans which are originated** as 1 and the **Loans which are NOT originated** as 0

In [None]:
def change_action_taken(y):
    if ( y == 1):
        return 1
    else:
        return 0

In [None]:
y = loans['action_taken'].apply(change_action_taken)

X = X.fillna(0)

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Modelling

In [None]:
from lightgbm import LGBMClassifier

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = LGBMClassifier(random_state=1).fit(train_X, train_y)

In [None]:
predictions =  first_model.predict_proba(val_X)

In [None]:
fpr, tpr, thresholds = roc_curve(val_y, predictions[:,1])

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

In [None]:
auc(fpr, tpr)

# Permutation Importance

One of the most basic questions we might ask of a model is `What features have the biggest impact on predictions?`
This concept is called **feature importance**. Permutation Importance is one of the techniques to measure feature importance.   
The process of Permutation importance is provided below           

1. Get a trained model                        
2. Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That **performance deterioration measures the importance of the variable you just shuffled.**                       
3. Return the data to the original order (undoing the shuffle from step 2.) Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.         

In [None]:
perm = PermutationImportance(first_model, random_state=1).fit(val_X, val_y)

eli5.show_weights(perm, feature_names = val_X.columns.tolist())

**Explanation**

The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric).

Like most things in data science, there is some randomness to the exact performance change from a shuffling a column. We measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles. The number after the ± measures how performance varied from one-reshuffling to the next.

We will  occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data. This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.

In our example, the most important feature was **Lien Status** followed by **Loan Purpose** and **Applicant Income**

In [None]:
eli5.explain_weights_df(perm, feature_names=features).head(10)

We show the  contribution of each feature in the prediction

In [None]:
show_prediction(first_model, val_X.iloc[0,:],show_feature_values=True)

# Partial Plots

While feature importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions.

Like permutation importance, partial dependence plots are calculated after a model has been fit. The model is fit on real data that has not been artificially manipulated in any way.             

We will use the fitted model to predict our outcome (probability the **loan has been originated**. But we repeatedly alter the **value for one variable to make a series of predictions.**

## Loan Purpose Feature Analysis

In [None]:
feat_name = 'loan_purpose'
pdp_dist = pdp.pdp_isolate(model=first_model, dataset=val_X, model_features=features, feature=feat_name)

In [None]:
fig, axes = pdp.pdp_plot(pdp_isolate_out=pdp_dist,
                         feature_name=feat_name)

**Interpretation of Partial Dependence Plots**

In [None]:
loans.groupby(['loan_purpose','loan_purpose_name']).loan_type.count()

A few items are worth pointing out as you interpret this plot

The y axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value.
A blue shaded area indicates level of confidence           

From this particular graph, we see that Loan Purpose of **Home purchase**  has the highest probablity of Loan Orgination



## Applicant Income Feature Analysis

In [None]:
feat_name = 'applicant_income_000s'
pdp_dist = pdp.pdp_isolate(model=first_model, dataset=val_X, model_features=features, feature=feat_name)

In [None]:
fig, axes = pdp.pdp_plot(pdp_isolate_out=pdp_dist,
                         feature_name=feat_name)

In [None]:
loans[feat_name].describe()

We see that 
* Too low applicant income decreases the probablity of loan origination         
* Too high applicant income decreases the probablity of loan origination          

We are unable to get better analysis from the above graph. Therefore we plot another graph where the applicant income is reduced

In [None]:
val_X_modified =  val_X[val_X['applicant_income_000s'] <  500]
feat_name = 'applicant_income_000s'
pdp_dist = pdp.pdp_isolate(model=first_model, dataset=val_X_modified, model_features=features, feature=feat_name)

In [None]:
fig, axes = pdp.pdp_plot(pdp_isolate_out=pdp_dist,
                         feature_name=feat_name)

From the graph above, it is evident that the loan origination increases till the applicant income is around 100K

# SHAP Analysis

In [None]:
val_X_small = val_X[0:1000]

explainer = shap.TreeExplainer(first_model)
data_for_prediction_array = val_X_small.iloc[0,:].values.reshape(1, -1)

first_model.predict_proba(data_for_prediction_array)

The loan is 42% likely to be originated

## SHAP Tree Explainer

In [None]:
%time shap_values = explainer.shap_values(val_X_small)

# visualize the first prediction's explanation
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0,:], val_X_small.iloc[0,:],link = 'logit')

Feature values causing increased predictions are in pink, and their visual size shows the magnitude of the feature's effect. Feature values decreasing the prediction are in blue. The biggest impact comes from `Lien Status =  1` followed by `Loan Purpose = 3`

## SHAP Summary Plot

In [None]:
# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(val_X_small)

shap.summary_plot(shap_values, val_X_small)

The figure provides the following information

*  Higher the Lien Status , lower  is the probablity of Loan Origination              

* Higher the Rate Spread, higher is the probablity of Loan Origination

# SHAP Dependence Contribution Plots

In [None]:
shap.dependence_plot("lien_status", shap_values, val_X_small,interaction_index="applicant_income_000s")

In [None]:
loans.groupby(['lien_status','lien_status_name']).applicant_income_000s.count()

From the plot, it is evident that the lien status of **Not applicable** is associated with low applicant income

In [None]:
shap.dependence_plot("loan_purpose", shap_values, val_X_small,interaction_index="applicant_income_000s")

In [None]:
loans.groupby(['loan_purpose','loan_purpose_name']).applicant_income_000s.count()

From the plot, it is evident that very high applicant income is associated with loan purpose of Home Purchase. The loan purpose of Home improvement and Refinancing is not associated with high applicant income

<hr/>

# Chapter 2 : Melbourne Housing Prices

<hr/>

In this chapter, we predict the House prices in Melbourne. In this chapter we focus on the **Shapley Analysis**

In [None]:
houses = pd.read_csv('../input/melbourne-housing-market/Melbourne_housing_FULL.csv')
houses_2 = houses.copy()
houses.head() 

Remove the features `Address` and `Method`

In [None]:
def extract_features(houses):
    cols_to_remove = ['Address','Method']
    cols = [f_ for f_ in houses.columns if houses[f_].dtype == 'object']
    features= list(set(cols).difference(set(cols_to_remove)))

    cols = [f_ for f_ in houses.columns if houses[f_].dtype != 'object']
    features_all = features + cols
    return features,features_all

features,features_all = extract_features(houses)

Convert the categorical features into numeric values

In [None]:
def label_encode_dataset(features,houses):
    for c in features:
        le = LabelEncoder()
        le.fit(houses[c].astype(str))
        houses[c] = le.transform(houses[c].astype(str))
    return houses

houses = label_encode_dataset(features,houses)

In [None]:
houses_all = houses[features_all]
houses_all = houses_all.fillna(0)

# Modelling

Modelling using XGBoost

In [None]:
import xgboost
from xgboost import XGBRegressor
cols_to_remove = ['Price','Date']
features2= list(set(features_all).difference(set(cols_to_remove)))

In [None]:
def create_model(features2,houses_all):
    
    X  = houses_all[features2]
    y = houses_all['Price']

    params = {}
    params["objective"] = "reg:linear"
    params["eta"] = 0.01
    params["min_child_weight"] = 10
    params["subsample"] = 0.8
    params["colsample_bytree"] = 0.8
    params["scale_pos_weight"] = 1.0
    params["silent"] = 1
    params["max_depth"] = 7
    params["nthread"] = 4

    plst = list(params.items())
    num_rounds=20000 

    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
    xgtrain = xgboost.DMatrix(train_X, label = train_y,feature_names = features2)
    xgval = xgboost.DMatrix(val_X,label = val_y,feature_names = features2)

    # define a watch list to observe the change in error of training and holdout data
    watchlist  = [ (xgtrain,'train'),(xgval,'eval')]
    model = xgboost.train(plst, 
                      xgtrain, 
                      num_rounds,
                      watchlist,
                      early_stopping_rounds=50,verbose_eval=False)  
    return (model,train_X, val_X, train_y, val_y)

model,train_X, val_X, train_y, val_y = create_model(features2,houses_all)

# SHAP analysis

# SHAP Tree explainer

In [None]:
# load JS visualization code to notebook
shap.initjs()

# visualize the first prediction's explanation
explainer = shap.TreeExplainer(model)

shap_values2 = explainer.shap_values(val_X)
shap.force_plot(explainer.expected_value, shap_values2[0,:], val_X.iloc[0,:])

Feature values causing increased predictions are in pink, and their visual size shows the magnitude of the feature's effect. Feature values decreasing the prediction are in blue. The biggest impact comes from `Rooms = 4 , Postcode and Suburb`. This is understandable since the price in Melbourne is dependent on the number of rooms as well as the suburb. Suburbs like South Yarra , Prahran have houses with high prices

# SHAP Summary Plot

In [None]:
# summarize the effects of all the features
shap.summary_plot(shap_values2, val_X)

The dominant features affecting  the model are **Distance,Type,PostCode,Rooms and Landsize**         

This plot is made of many dots. Each dot has three characteristics:

*  Vertical location shows what feature it is depicting
* Color shows whether that feature was high or low for that row of the dataset
* Horizontal location shows whether the effect of that value caused a higher or lower prediction.      

From the plot , it is evident 
* High Values of Room , Landsize , BuildingArea  increase the price of the House             
*  High Values of Distance reduce the price of the House     

# SHAP Dependence Contribution Plots

In [None]:
# make plot.
shap.dependence_plot('Distance', shap_values2, val_X, interaction_index="Postcode")

We will start by focusing on the shape, and we'll come back to color in a minute. Each dot represents a row of the data. The horizontal location is the actual value from the dataset, and the vertical location shows what having that value did to the prediction. The fact this slopes **downward**  says that the more  the Distance, the higher the model's prediction is for **lower price of the house**.       

The color is associated with the Postcode. The lower priced houses are in the range 3175 to 3200 postcode

In [None]:
# make plot.
shap.dependence_plot('Postcode', shap_values2, val_X, interaction_index="Distance")

In the same postcode around 3000 to 3200, the price of the house changes. There are other factors which are affecting the price of the house

In [None]:
# make plot.
shap.dependence_plot('Rooms', shap_values2, val_X, interaction_index="Distance")

As the number of rooms increases, the price of the house increases

# Analysis of Malvern Houses

I stayed in the suburb of Malvern in Melbourne for quite some time. Would be interested to know how the house prices varied

In [None]:
houses_Malvern = houses_2[houses_2.Suburb == 'Malvern']

In [None]:
houses_Malvern.head()

In [None]:
features,features_all = extract_features(houses_Malvern)

houses_Malvern = label_encode_dataset(features,houses_Malvern)

houses_Malvern = houses_Malvern[features_all]
houses_Malvern = houses_Malvern.fillna(0)
houses_Malvern2  = houses_Malvern.copy()
cols_to_remove = ['Price','Date']
features2= list(set(features_all).difference(set(cols_to_remove)))
houses_Malvern = houses_Malvern[features2]

## SHAP analysis

In [None]:
# load JS visualization code to notebook
shap.initjs()

# visualize the first prediction's explanation
explainer = shap.TreeExplainer(model)

shap_values2 = explainer.shap_values(houses_Malvern)
shap.force_plot(explainer.expected_value, shap_values2[0,:], val_X.iloc[0,:])

Feature values causing increased predictions are in pink, and their visual size shows the magnitude of the feature's effect. Feature values decreasing the prediction are in blue. The biggest impact comes from `Distance, Type and Suburb`. 