# CrashDS

#### Module 3 : Classification Tree

Dataset from ISLR by *James et al.* : `Heart.csv`         
Source: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html     

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.    
You may install any library using `conda install <library>`.    
Most of the libraries come by default with the Anaconda platform.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

We will also need the essential Python libraries for (basic) Machine Learning.      
Scikit-Learn (`sklearn`) will be our de-facto Machine Learning library in Python.   

> `DecisionTreeClassifier` model from `sklearn.tree` : Our main model for Classification   
> `plot_tree` method from `sklearn.tree` : Function to clearly visualize a Classification Tree   
> `train_test_split` method from `sklearn.model_selection` : Random Train-Test splits     
> `confusion_matrix` metric from `sklearn.metrics` : Primary performance metric for us 

In [None]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

---

## Case Study : Personal Parameters vs Heart Disease


### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Load the CSV file and check the format
heartData = pd.read_csv('Heart.csv')
heartData.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.     
Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
print("Data type : ", type(heartData))
print("Data dims : ", heartData.shape)
heartData.info()

### Format the Dataset

Drop the `Unnamed: 0` column as it contributes nothing to the problem.   
Drop the rows where values are missing in any column using `dropna()`.    
You may instead choose the `fillna()` method to fill in missing values. 

Convert the columns of type `object` to categorical data (factor) format.   
Convert the non-obvious *categorical* columns to `category` format as well.    
You may use `nunique()` method on each column to identify *categoricals*.   

Check the format and vital statistics of the modified dataframe.     

In [None]:
# Drop the first column (axis = 1) by its name
heartData = heartData.drop('Unnamed: 0', axis = 1)

# Drop the rows with `NA` values
heartData = heartData.dropna()

# Convert the Categoricals to appropriate type
heartData["ChestPain"] = heartData["ChestPain"].astype('category')
heartData["Thal"] = heartData["Thal"].astype('category')
heartData["AHD"] = heartData["AHD"].astype('category')
heartData["Sex"] = heartData["Sex"].astype('category')
heartData["Fbs"] = heartData["Fbs"].astype('category')
heartData["RestECG"] = heartData["RestECG"].astype('category')
heartData["ExAng"] = heartData["ExAng"].astype('category')
heartData["Ca"] = heartData["Ca"].astype('category')
heartData["Slope"] = heartData["Slope"].astype('category')

# Check the modified dataset
heartData.info()

---

## Uni-Variate Classification : Predicting AHD using Chol

We take `AHD` as our target variable for the Uni-Variate Classification.    
We will start by setting up a Uni-Variate Classification Tree problem.   

Response Variable : **AHD**     
Predictor Feature : **Chol**    

Check the mutual relationship between the variables to start with.

In [None]:
# Boxplot of numeric variable against categorical variable
f = plt.figure(figsize=(16, 4))
sb.boxplot(x = "Chol", y = "AHD", data = heartData)

### Preparing the Dataset

Extract the Response and Predictor variables as two individual Pandas `DataFrame`.

In [None]:
# Extract Response and Predictors
y = pd.DataFrame(heartData["AHD"])
X = pd.DataFrame(heartData[["Chol"]])

Split the dataset randomly into Train and Test datasets using `train_test_split`.

In [None]:
# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

### Fitting the Classification Model

`DecisionTreeClassifier` is a class for the classification model in `sklearn`.     
We need to create an object of the `DecisionTreeClassifier` class, as follows.

In [None]:
# Create a Decision Tree Classifier object
dectree = DecisionTreeClassifier(max_depth = 2)

Train the Classification Tree model using the Train Set `X_train` and `y_train`.   

In [None]:
# Train the Linear Regression model
dectree.fit(X_train, y_train)

You have *trained* the model. Now it's time to visualize the Tree.    

In [None]:
# Visualize the Classification Tree model
f = plt.figure(figsize=(16, 8))
plot_tree(dectree, 
          feature_names = X_train.columns,
          class_names = ["No", "Yes"], 
          filled = True,
          rounded = True)
plt.show()

### Goodness of Fit of the Model

Check how good the predictions are on the Train Set.    
Metrics : Classification Accuracy and Confusion Matrix.

In [None]:
# Classification Accuracy
print("Classification Accuracy \t:", dectree.score(X_train, y_train))

# Confusion Matrix
y_train_pred = dectree.predict(X_train)
y_labels = ['No', 'Yes']

ax = plt.subplot()
sb.heatmap(confusion_matrix(y_train, y_train_pred, y_labels), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('Actual labels')
ax.xaxis.set_ticklabels(y_labels)
ax.yaxis.set_ticklabels(y_labels)
ax.set_ylim(len(y_labels), 0)  # temporary fix for heatmap
plt.show()

Check how good the predictions are on the Test Set.   

In [None]:
# Classification Accuracy
print("Classification Accuracy \t:", dectree.score(X_test, y_test))

# Confusion Matrix
y_test_pred = dectree.predict(X_test)
y_labels = ['No', 'Yes']

ax = plt.subplot()
sb.heatmap(confusion_matrix(y_test, y_test_pred, y_labels), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('Actual labels')
ax.xaxis.set_ticklabels(y_labels)
ax.yaxis.set_ticklabels(y_labels)
ax.set_ylim(len(y_labels), 0)  # temporary fix for heatmap
plt.show()

---

## Classification Tree : Generic Function

Let us write a generic function to model Classification Tree, as before.      
Our Predictor variable(s) will be $X$ and the Response variable will be $Y$.   

> Train data : (`X_Train`, `y_train`)    
> Test data : (`X_test`, `y_test`)

In [None]:
def modelDecisionTree(X_train, y_train, X_test, y_test, tree_depth):
    '''
        Function to perform Linear Regression with X_Train, y_train,
        and test out the performance of the model on X_Test, y_test.
    '''    
    dectree = DecisionTreeClassifier(max_depth = tree_depth)  # create the decision tree object
    dectree.fit(X_train, y_train)                             # train the decision tree model

    # Predict Response corresponding to Predictors
    y_train_pred = dectree.predict(X_train)
    y_test_pred = dectree.predict(X_test)

    # Visualize the Classification Tree model
    f = plt.figure(figsize=(16, 8))
    plot_tree(dectree, 
          feature_names = X_train.columns,
          class_names = ["No", "Yes"], 
          filled = True,
          rounded = True)
    plt.show()

    # Check the Goodness of Fit (on Train Data)
    print("Goodness of Fit of Model \tTrain Dataset")
    print("Classification Accuracy \t:", dectree.score(X_train, y_train))
    print()

    # Check the Goodness of Fit (on Test Data)
    print("Goodness of Fit of Model \tTest Dataset")
    print("Classification Accuracy \t:", dectree.score(X_test, y_test))
    print()
    
    # Confusion Matrix
    y_labels = ['No', 'Yes']
    f, axes = plt.subplots(1, 2, figsize=(16, 6))
    sb.heatmap(confusion_matrix(y_train, y_train_pred, y_labels),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
    sb.heatmap(confusion_matrix(y_test, y_test_pred, y_labels), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
                
    axes[0].set_xlabel('Predicted labels')
    axes[0].set_ylabel('Actual labels')
    axes[0].xaxis.set_ticklabels(y_labels)
    axes[0].yaxis.set_ticklabels(y_labels)
    axes[1].set_xlabel('Predicted labels')
    axes[1].set_ylabel('Actual labels')
    axes[1].xaxis.set_ticklabels(y_labels)
    axes[1].yaxis.set_ticklabels(y_labels)

    axes[0].set_ylim(len(y_labels), 0)  # temporary fix for heatmap
    axes[1].set_ylim(len(y_labels), 0)  # temporary fix for heatmap
    
    plt.show()

Try out the Generic Function to model Classification Tree on `AHD` against `RestBP`.

In [None]:
# Specify the Predictors and Response
response = "AHD"
predictors = ["RestBP"]

# Extract Response and Predictors
y = pd.DataFrame(heartData[response])
X = pd.DataFrame(heartData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Model Classification Tree with Train-Test
modelDecisionTree(X_train, y_train, X_test, y_test, 2)

Try out the Generic Function to model Classification Tree on `AHD` against `ChestPain`.     
However, `ChestPain` is a categorical variable with labels as strings; not supported.    
Hence, we will preprocess the variable to encode the labels to numerical data type.

In [None]:
# Pre-process the Categorical Predictor(s)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(heartData["ChestPain"])
heartData["encChestPain"] = le.transform(heartData["ChestPain"])

# Specify the Predictors and Response
response = "AHD"
predictors = ["encChestPain"]

# Extract Response and Predictors
y = pd.DataFrame(heartData[response])
X = pd.DataFrame(heartData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Model Classification Tree with Train-Test
modelDecisionTree(X_train, y_train, X_test, y_test, 2)

---

## Multi-Variate Classification Tree

Let us set up a Multi-Variate Classification problem.   

Response Variable : **AHD**     
Predictor Feature : **Chol, RestBP, ChestPain, Thal**       

Fortunately, our generic Classification Tree function works in this case as well.    
However, we still need to encode categorical predictors before fitting the tree.

In [None]:
# Pre-process the Categorical Predictor(s)
from sklearn.preprocessing import LabelEncoder
leCP = LabelEncoder()
leCP.fit(heartData["ChestPain"])
heartData["encChestPain"] = leCP.transform(heartData["ChestPain"])

leTH = LabelEncoder()
leTH.fit(heartData["Thal"])
heartData["encThal"] = leTH.transform(heartData["Thal"])


# Specify the Predictors and Response
response = "AHD"
predictors = ["Chol", "RestBP", "encChestPain", "encThal"]

# Extract Response and Predictors
y = pd.DataFrame(heartData[response])
X = pd.DataFrame(heartData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Model Classification Tree with Train-Test
modelDecisionTree(X_train, y_train, X_test, y_test, 2)

---

## Prediction using a Classification Tree Model

Once we have trained a Classification Tree Model, we may use it to predict the Response.   

In [None]:
# Pre-process the Categorical Predictor(s)
from sklearn.preprocessing import LabelEncoder
leCP = LabelEncoder()
leCP.fit(heartData["ChestPain"])
heartData["encChestPain"] = leCP.transform(heartData["ChestPain"])

leTH = LabelEncoder()
leTH.fit(heartData["Thal"])
heartData["encThal"] = leTH.transform(heartData["Thal"])


# Specify the Predictors and Response
response = "AHD"
predictors = ["Chol", "RestBP", "encChestPain", "encThal"]

# Extract Response and Predictors
y = pd.DataFrame(heartData[response])
X = pd.DataFrame(heartData[predictors])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Model Classification Tree with Train-Test
dectree = DecisionTreeClassifier(max_depth = 2)
dectree.fit(X_train, y_train)

In [None]:
# Predict Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

Let's predict the value of Response for a few specific Data Points -- using the Classification Tree derived above.   

In [None]:
# Extract random Data Points for Prediction
heartData_pred = heartData.sample(5)
heartData_pred

In [None]:
# Extract Predictors for Prediction
X_pred = pd.DataFrame(heartData_pred[predictors])

# Predict Response corresponding to Predictors
y_pred = dectree.predict(X_pred)
y_pred

### Prediction Accuracy

Let us check the errors in the Predicted values, compared to the Actuals.

In [None]:
# Summarize the Actuals and Predictions
y_pred = pd.DataFrame(y_pred, columns = ["Predicted"], index = heartData_pred.index)
heartData_acc = pd.concat([heartData_pred[response], y_pred], axis = 1)

y_correct = (heartData_acc[response] == heartData_acc["Predicted"])
y_correct = pd.DataFrame(list(y_correct), columns = ["Correct"], index = heartData_pred.index)
heartData_acc = pd.concat([heartData_acc, y_correct], axis = 1)

heartData_acc

### Prediction Probability

In case of any Classification Model, we should check the Class Probabilities along with the final Class Predictions.

In [None]:
# Extract Predictors for Prediction
X_pred = pd.DataFrame(heartData_pred[predictors])

# Predict Response Probabilities corresponding to Predictors
y_prob = dectree.predict_proba(X_pred)
y_prob

The confidence of predicting any class essentially depends on the predicted probability and a threshold (default 0.5).

In [None]:
# Summarize the Probabilities with the Predictions
y_prob = pd.DataFrame(list(y_prob[:,1]), columns = ["Confidence"], index = heartData_pred.index)
heartData_conf = pd.concat([heartData_acc, y_prob], axis = 1)

heartData_conf

*NOTE : You can always go back and try fitting a model with more predictors to check the difference.*