# IFN619. Week11: Explainability

In this tutorial, you will implement predictive models to determine house prices and for textual data. You will also apply LIME to gain insights about your predictions.

## Prediction using Regression Models: The Boston House Dataset

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.


There are 14 attributes in each case of the dataset. They are:


| Feature | Description                                                         |
|---------|---------------------------------------------------------------------|
| CRIM    | per capita crime rate by town                                       |
| ZN      | proportion of residential land zoned for lots over 25,000 sq.ft.    |
| INDUS   | proportion of non-retail business acres per town                    |
| CHAS    | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
| NOX     | nitric oxides concentration (parts per 10 million)                  |
| RM      | average number of rooms per dwelling                                |
| AGE     | proportion of owner-occupied units built prior to 1940              |
| DIS     | weighted distances to five Boston employment centres                |
| RAD     | index of accessibility to radial highways                           |
| TAX     | fullvalue property tax rate per USD 10,000                          |
| PTRATIO | pupil/teacher ratio by town                                         |
| B       | proportion of blacks by town                                        |
| LSTAT   | percentage of lower status of the population                        |
| MEDV    | Median value of owner-occupied homes in USD 1000's                  |



Your task is to apply a model that predits house prices and to inspect the predictionsusing LIME

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lime
from lime import lime_tabular

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler,MinMaxScaler

# keras / deep learning libraries
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import model_from_json
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.utils import plot_model

# callbacks
from tensorflow.keras import backend as K
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import ReduceLROnPlateau

from sklearn.metrics import mean_squared_error, r2_score

from matplotlib import pyplot

sns.set()

%matplotlib inline

In [None]:
# load dataset
data = pd.read_csv("data/HousingData.csv")
data

Check if there are missing values in your dataset

In [None]:
# your code here



Fill up the missing values of a feature with the average of that feature

In [None]:
# your code goes here

# variable Age is an integer, make sure that you convert this value to an integer
# your code goes here


# rename column MEDV to PRICE (the variable that we want to predict)



Let's take a look at the variables that are relevant for our prediction: PRICE.

Look at the correlation matrix and choose which variables should we keep for our model

In [None]:
# your code goes here
correlation_matrix = 
correlation_matrix

In [None]:
# plot correlation matrix: 
# your code goes here


In [None]:
# discard the feature that have an absolute correlation value with PRICE < 0.3
# your code goes here
data_selection = 
data_selection

### Start the Machine Learning Pipeline

In [None]:
# separate your data into features and class variable
# your code goes here

feature_names = 

class_var = 

X = 
y = 

In [None]:
# normalise your data vales in X. Use the MinMax Scaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# separate your data into training (70%), validation (15%) and testset (15%)
# your code goes here
X_train, X_test, Y_train, Y_test = 
X_val, X_test, Y_val, Y_test = 


## Choosing a predictive model

In this unit, we focused on Neural Networks, however there are plenty of models that we can use.
In this tutorial, we will present two types of predictive models for regression: XGBoost (based on trees and bosting theory, and neural networks)

### Neural Networks

In your studio sessions, we presented many neural networks, but for classification. 
For regression, the model is very similar. The difference is that we need to adjust the loss function and choose, for instance, the mean squared error. Let's see how to do that.

In [None]:
model_nn = Sequential()

model_nn.add(Dense(10, input_dim = X_train.shape[1], activation = "tanh"))
model_nn.add(Dense(7, activation = "relu"))
model_nn.add(Dense(3, activation = "relu"))
model_nn.add(Dense(1, activation = "relu"))

# selecting the loss function with mean squared error 
model_nn.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_squared_error'])


In [None]:
# training model  (this will take a while)
history = model_nn.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=400, 
                       batch_size = 64, callbacks = callbacks_list, verbose=0)


In [None]:
# plot training history
pyplot.plot(history.history['mean_squared_error'], label='train')
pyplot.plot(history.history['val_mean_squared_error'], label='val')
pyplot.ylabel('mean_squared_error', fontsize=12)
pyplot.xlabel('iterations', fontsize=12)
pyplot.legend()
pyplot.show()

In [None]:
# evaluate the model (remember in regression: the closer to zero, the better)
_, train_mse_nn = model_nn.evaluate(X_train, Y_train, verbose=0)
_, test_mse_nn = model_nn.evaluate(X_test, Y_test, verbose=0)
_, val_mse_nn = model_nn.evaluate(X_val, Y_val, verbose=0)
print('Train: %.3f, Test: %.3f, Validation: %.3f' % (train_mse_nn, test_mse_nn, val_mse_nn))


In [None]:
# computing Mean Squared Error
preds = model_nn.predict(X_test)
nn_mse = np.sqrt(mean_squared_error(Y_test, preds))
print("Mean Squared Error: " + str(nn_mse))
# computing r^2 score
nn_r2 =  r2_score(Y_test, preds)
print("R^2 Error: " + str(nn_r2))

### XGBoost (eXtreme Gradient Boosting)

XGBoost combines Trees with Boosting theory. 

Boosting:
Kearns and Valiant were the first to pose the question of whether a "weak" learning algorithm that performs just slightly better than random guessing can be "boosted" into an arbitrarily accurate "strong". The general idea of boosing is to improve a single weak model by combining it with a number of other weak models in order to generate a collectively strong model. In other words, to generate several classifiers in a sequence and focus on the training examples that were misclassified to build a new model that pays a "higher attention" to those "weak learners". The general output, is an average of the predictions of all classifiers generated during the training process. For the case of XGBoost, these classifiers are trees. At each iteration, a tree is learned. The training examples that were misclassified by a tree, go the next iteration with higher weights so the new classifier can find rules to correctly classify these "week learners". The final prediction is the average of the predictions of all trees trained (the ensemble model). XGBoost is a black-box, because the predictions are computed from an average weight of different trees (more info: https://medium.com/geekculture/xgboost-versus-random-forest-898e42870f30)



<img src="https://www.researchgate.net/publication/348025909/figure/fig2/AS:1020217916416002@1620250314481/Simplified-structure-of-XGBoost.ppm">

In [None]:
!pip install xgboost

In [None]:
from sklearn.metrics import mean_squared_log_error
import xgboost as xgb

model_xgb = xgb.XGBRegressor(
    objective="reg:squarederror",
    random_state=101,
    n_estimators=1000,
    eval_metric="rmse",
    early_stopping_rounds=300,
    tree_method="hist",  # enable histogram binning in XGB
)

In [None]:
model_xgb.fit(
    X_train,
    Y_train,
    eval_set=[(X_val, Y_val)],
    verbose=False,  # Disable logs
)

In [None]:
# computing Mean Squared Error
preds = model_xgb.predict(X_test)
xbg_mse = np.sqrt(mean_squared_error(Y_test, preds))
print("Mean Squared Error: " + str(xbg_mse))
# computing r^2 score
xgb_r2 =  r2_score(Y_test, preds)
print("R^2 Error: " + str(xgb_r2))

In [None]:
# you can also check the feature importance for XGBoost
from xgboost import plot_importance

booster = model_xgb.get_booster()

# plot feature importance
# Get the importance dictionary (by gain) from the booster
importance = booster.get_score(importance_type="gain")

# make your changes
for key in importance.keys():
    importance[key] = round(importance[key],2)

# provide the importance dictionary to the plotting function
ax = plot_importance(importance, max_num_features=5, importance_type='gain', show_values=True)

# print the feature names
print("f4: " + X.columns.tolist()[4], 
      "f10: " + X.columns.tolist()[10], 
      "f3: " + X.columns.tolist()[3], 
      "f8: " + X.columns.tolist()[8],
      "f7:" + X.columns.tolist()[7])


## Generating explanations with LIME

Choose different instances of the test set and generate LIME Explanations for both the NN model and the XGB model. Discuss the explanations generated for the different models.

For the XGB model, compare the explanations generated with the model's overall feature importance.

In [None]:
# your code goes here


### Generating explanations for Neural Nets

In [None]:
# select a datapoint from the testset
# your code goes here


In [None]:
# your code goes here
# explain instance


In [None]:
# Show the predictions


### Generating explanations for XGBoost

In [None]:
# select a datapoint from the testset
# your code goes here


In [None]:
# your code goes here

# explain instance


In [None]:
# Show the predictions
