# Explainable AI in Decsion Tree using Tree Interpreter

## Context: Understanding the Problem Statement --------Problem Scoping

The California housing dataset to predict the housing price based on the input features. 

The 1990 California Census dataset to study and try to understand how the different attributes can affect the house price predicition. 

### Import the useful Packages & Libraries

In [None]:
!pip install treeinterpreter

In [2]:
#  we can import the California Housing dataset directly from the sklearn library

from treeinterpreter import treeinterpreter as ti
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

treeinterpreter: This is a Python library that provides a way to interpret the predictions of tree-based machine learning models, such as decision trees and random forests. It helps you understand how the model arrives at its predictions by breaking down the contributions of each feature. Reference : https://pypi.org/project/treeinterpreter/

pandas: Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like DataFrames and Series, which allow you to work with structured data efficiently. In your code, you're likely using Pandas to manipulate and prepare your dataset. Reference : https://pandas.pydata.org/docs/

numpy: NumPy is a fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a wide variety of mathematical functions to operate on these arrays. It's often used for numerical operations and array manipulation. Reference :https://numpy.org/doc/

scikit-learn: Scikit-learn is a popular machine learning library for Python. It provides a wide range of machine learning algorithms and tools for tasks such as classification, regression, clustering, and more. Reference :https://scikit-learn.org/stable/index.html

## Dataset:  Data Acquisition
Source - https://www.kaggle.com/datasets/camnugent/california-housing-prices
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the sklearn.datasets.fetch_california_housing function.

##### Load/Read the Dataset

In [3]:
# loaded California housing datasets
# Print its description which explains the individual features of the dataset.
# Show the first few samples of the dataset.
calif_housing = fetch_california_housing()

for line in calif_housing.DESCR.split("\n")[5:22]:
    print(line)

calif_housing_df = pd.DataFrame(data=calif_housing.data, columns=calif_housing.feature_names)
calif_housing_df["Price($)"] = calif_housing.target

calif_housing_df.head()

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price($)
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
X_calif, Y_calif = calif_housing.data, calif_housing.target

print("Dataset Size : ", X_calif.shape, Y_calif.shape)

X_train_calif, X_test_calif, Y_train_calif, Y_test_calif = train_test_split(X_calif, Y_calif,
                                                                            train_size=0.8,
                                                                            test_size=0.2,
                                                                            random_state=123)

print("Train/Test Size : ", X_train_calif.shape, X_test_calif.shape, Y_train_calif.shape, Y_test_calif.shape)

Dataset Size :  (20640, 8) (20640,)
Train/Test Size :  (16512, 8) (4128, 8) (16512,) (4128,)


In [5]:
print("dimension of housing data: {}".format(X_calif.shape)) #originally it was housing.shape, I've changed it to X_calif. Was this correct.

dimension of housing data: (20640, 8)


# Building a Model ------ Modeling 


Classification and regression are two fundamental tasks in supervised machine learning. They involve predicting an output based on input features, but they are used for different types of problems and have distinct objectives:

Classification:

Objective: In classification, the goal is to assign input data points to predefined categories or classes. It's used when the output variable is categorical in nature.
Output: The output of a classification model is a discrete label or class. For example, it could be used to predict whether an email is spam or not (binary classification) or to classify images of animals into different species (multi-class classification).
Examples: Logistic Regression, Decision Trees, Support Vector Machines, and Neural Networks are common algorithms used for classification tasks.
Evaluation: Classification models are evaluated using metrics such as accuracy, precision, recall, F1-score, and confusion matrix.

Regression:

Objective: In regression, the goal is to predict a continuous numeric value or quantity. It's used when the output variable is continuous and can take any value within a range.
Output: The output of a regression model is a numeric value. For example, it could be used to predict house prices based on features like size, location, and number of bedrooms.
Examples: Linear Regression, Decision Tree Regression, Random Forest Regression, and Gradient Boosting Regression are common algorithms used for regression tasks.
Evaluation: Regression models are evaluated using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) score.
Here's a simple example to illustrate the difference:

Classification: Given a dataset of emails, classify each email as either "spam" or "not spam." The output is a discrete label, either "spam" (class 1) or "not spam" (class 0).

Regression: Given a dataset of houses with features like size, location, and number of bedrooms, predict the price of each house. The output is a continuous numeric value (e.g., $300,000).

Classification deals with categorizing data into predefined classes, while regression deals with predicting numeric values. The choice between classification and regression depends on the nature of the problem and the type of output variable you are trying to predict. For more details watch - https://www.youtube.com/watch?v=xYJEM2C0G5Q

In [6]:
from sklearn.tree import DecisionTreeRegressor

dtree_reg = DecisionTreeRegressor(max_depth=10)
dtree_reg.fit(X_train_calif, Y_train_calif)

Steps in decision tree regressor:

The algorithm starts with all the training data at the root of the tree.

It evaluates all possible splits on all features to find the split that minimizes the variance of the target variable in the resulting child nodes.

The algorithm splits the data according to the best split found in step 2 and repeats the process for each child node.

The process continues until the maximum depth is reached or the nodes meet other stopping criteria (e.g., minimum samples per leaf).

Each leaf node of the tree represents a prediction, which is the mean of the target values for the samples in that node.


Watch the video - https://www.youtube.com/watch?v=UhY5vPfQIrA

Creating an instance of the DecisionTreeRegressor, setting the maximum depth of the tree to be 10. The maximum depth is the maximum number of levels the decision tree can have. Limiting the depth of the tree can help prevent overfitting. After training the model, you can use it to make predictions on new data with the predict method. The decision tree model will traverse the tree based on the feature values of the new data until it reaches a leaf node, and the prediction will be the mean of the target values in that leaf node.

In [7]:
print("Test  R^2 Score : %.2f"%dtree_reg.score(X_test_calif, Y_test_calif))
print("Train R^2 Score : %.2f"%dtree_reg.score(X_train_calif, Y_train_calif))

Test  R^2 Score : 0.69
Train R^2 Score : 0.83


The R^2 score ranges from 0 to 1, where a score of 1 indicates that the model perfectly explains the variance in the target variable, while a score of 0 indicates that the model does not explain any of the variance. A high R^2 score on the training dataset and a low R^2 score on the test dataset can indicate overfitting, meaning that the model has learned the training data too well and does not generalize well to new, unseen data. Conversely, a low R^2 score on both the training and test datasets may indicate that the model is underfitting and is not complex enough to capture the patterns in the data.

https://www.youtube.com/watch?v=Q-TtIPF0fCU

Test R^2 Score (0.70): The decision tree regressor explains 70% of the variance in the target variable on the test data. This means that the model has a reasonably good fit to the test data, as it's able to explain a significant portion of the variability in the target variable. However, there is still 30% of the variance that the model can't explain.

Train R^2 Score (0.83): The decision tree regressor explains 83% of the variance in the target variable on the training data. This indicates that the model fits the training data quite well, capturing most of the patterns in the data.

There is a notable gap between the training and test R^2 scores (0.83 vs. 0.70). This indicates that the model might be overfitting to some extent. Overfitting occurs when a model learns the training data too well, including its noise and outliers, and consequently performs less well on new, unseen data.

# Tree Interpreter

In [8]:
preds, bias, contributions = ti.predict(dtree_reg, X_test_calif)
preds.shape, bias.shape, contributions.shape

((4128, 1), (4128,), (4128, 8))

preds: This is an array containing the predictions of the decision tree regressor for each instance in the test dataset.

bias: This is the bias term of the model, which is the average prediction for the entire dataset. It's a baseline prediction that the model makes before considering the features of any specific instance. In this case, the bias term is replicated for each instance in the test set.

contributions: This array contains the contributions of each feature to the prediction for each instance in the test dataset. The contributions represent how much each feature moves the prediction away from the bias term.


In [9]:
print("Bias For Sample 0                        : %.2f"%bias[0])
print("Constributions For Sample 0              : %s"%contributions[0])
print("Prediction Based on Bias & Contributions : %.2f"%(bias[0] + contributions[0].sum()))
print("Actual Target Value                      : %.2f"%Y_test_calif[0])
print("Target Value As Per Treeinterpreter      : %.2f"%preds[0][0])

Bias For Sample 0                        : 2.07
Constributions For Sample 0              : [-0.16431123  0.          0.         -0.23541604  0.         -0.22254362
  0.04525048  0.10894851]
Prediction Based on Bias & Contributions : 1.60
Actual Target Value                      : 1.52
Target Value As Per Treeinterpreter      : 1.60


Bias for sample 0 is 2.07. This is the average prediction for the entire dataset, which serves as a baseline prediction before considering the features of the specific instance.

Contributions for sample 0: This is an array of values that represent the contributions of each feature to the prediction for the first instance in the test dataset. Each value in the array corresponds to a feature in the dataset, and the value represents how much that feature moves the prediction away from the bias term.

Prediction based on bias & contributions:
The prediction for the first instance in the test dataset is calculated by adding the bias term and the sum of the feature contributions:
2.07 + (-0.16431123 + 0 + 0 - 0.23541604 + 0 - 0.22254362 + 0.04525048 + 0.10894851) = 1.60.

Actual target value: The actual target value for the first instance in the test dataset is 1.52. This is the true value that the model is trying to predict.

Target value as per Tree Interpreter: The prediction for the first instance in the test dataset, provided by the Tree Interpreter package, is 1.60. This prediction should match the one calculated by adding the bias and contributions.


## Inference :

By examining the bias and contributions, we can see that the model starts with a baseline prediction of 2.07 and then adjusts this prediction based on the features of the specific instance. In this case, the features move the prediction down from the bias term to 1.60. This prediction is relatively close to the actual target value of 1.52, which indicates that the model is performing well on this instance.

# Random sample from the test dataset

Rerun the below cells to see how it changes per datapoint and see the varaiation of features and prediciton based on that 

In [13]:
import random

random_sample = random.randint(1, len(X_test_calif))
print("Selected Sample     : %d"%random_sample)
print("Actual Target Value : %.2f"%Y_test_calif[random_sample])
print("Predicted Value     : %.2f"%preds[random_sample][0])

def create_contrbutions_df(contributions, random_sample, feature_names):
    contribs = contributions[random_sample].tolist()
    contribs.insert(0, bias[random_sample])
    contribs = np.array(contribs)
    contrib_df = pd.DataFrame(data=contribs, index=["Base"] + feature_names, columns=["Contributions"])
    prediction = contrib_df.Contributions.sum()
    contrib_df.loc["Prediction"] = prediction
    return contrib_df

contrib_df = create_contrbutions_df(contributions, random_sample, calif_housing.feature_names)
contrib_df

Selected Sample     : 1860
Actual Target Value : 1.90
Predicted Value     : 1.58


Unnamed: 0,Contributions
Base,2.069687
MedInc,-0.502729
HouseAge,0.0
AveRooms,-0.167583
AveBedrms,0.0
Population,0.0
AveOccup,-0.162494
Latitude,0.205155
Longitude,0.135811
Prediction,1.577846


Displays the contributions of each feature to the prediction for the randomly-selected instance. By examining the contributions, you can gain insight into how each feature influences the prediction and understand the reasoning behind the model's decision.

The bias term, contributions of each feature, and the final prediction for a randomly-selected instance (Sample 87) from the test dataset.The model arrives at the prediction of 5.00 for this instance. You can see how each feature contributes to the prediction and determine which features have the most significant impact. In this case, the AveOccup and Base features have the largest contributions to the prediction.

# Waterfall chart



Display a waterfall chart visualizing the contributions of each feature and the bias term to the final prediction
To see how each feature affects the prediction and how they add up to the final prediction value.


In [14]:
import plotly.graph_objects as go

def create_waterfall_chart(contrib_df, prediction):
    fig = go.Figure(go.Waterfall(
        name = "Prediction", #orientation = "h",
        measure = ["relative"] * (len(contrib_df)-1) + ["total"],
        x = contrib_df.index,
        y = contrib_df.Contributions,
        connector = {"mode":"between", "line":{"width":4, "color":"rgb(0, 0, 0)", "dash":"solid"}}
    ))

    fig.update_layout(title = "Prediction : %s"%prediction)

    return fig

create_waterfall_chart(contrib_df, contrib_df.loc["Prediction"][0])

## Conclusion:

The model arrives at the prediction of 2.00

House Age and Base features have the most significant contributions to the prediction
