# **Complex prediction**: Decision trees (SOLUTIONS)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

We apply a classification tree and random forest model to the Boston dataset to predict crime per capita. We create a binary outcome feature, `CRIM_BIN`, that is equal to 1 if the crime rate contains a value above or equal to its median, and a 0 if the crime rate contains a value below its median. 

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It based on Harrison and Rubinfeld (1978) data and a similar dataset is available [here](https://nowosad.github.io/spData/reference/boston.html). 

--------

## Part 0: Setup

In [None]:
# Import packages

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz

from graphviz import Source
import matplotlib.pyplot as plt


In [None]:
# Define constant(s)

SEED = 17

# **MAIN EXERCISE**

## Part 1: Load data and create train/test sets

In the first part, we load the `housing.csv` dataset. This dataset includes the following columns:

- `CRIM`: per capita crime rate
- `ZN`: proportions of residential land zoned for lots over 25000 sq. ft per town (constant for all Boston tracts)
- `INDUS`: proportions of non-retail business acres per town (constant for all Boston tracts)
- `CHAS`: levels 1 if tract borders Charles River; 0 otherwise
- `NOX`: nitric oxides concentration (parts per 10 million) per town
- `RM`: average numbers of rooms per dwelling
- `AGE`: proportions of owner-occupied units built prior to 1940
- `DIS`: weighted distances to five Boston employment centres
- `RAD`: index of accessibility to radial highways per town (constant for all Boston tracts)
- `TAX`: full-value property-tax rate per USD 10,000 per town (constant for all Boston tracts)
- `PTRATIO`: pupil-teacher ratios per town (constant for all Boston tracts)
- `B`: proportion of blacks
- `LSTAT`: percentage values of lower status population
- `MEDV`: median values of owner-occupied housing in USD 1000


**Q 1**: Load the data. What shape does it have?

In [None]:
# Load the data set

df = pd.read_csv('data/housing.csv', delim_whitespace=True)
df.head()

In [None]:
df.shape

**Q 2**: Create the binary feature `CRIM_BIN` that contains a 1 if `CRIM` contains a value above or equal to its median, and a 0 if `CRIM` contains a value below its median. What % are 1s? Is the target variable balanced?

In [None]:
# Compute median of CRIM
crim_med = df.median()['CRIM']

# OPTION 1
CRIM_BIN = (df['CRIM'] >= crim_med).astype(int)

# OPTION 2
# # extract values of CRIM into list
# CRIM = df['CRIM'].values.tolist()
# # compute values of CRIM_BIN
# CRIM_BIN = []
# for crim_value in CRIM:
#     CRIM_BIN_value = int(crim_value >= crim_med)
#     CRIM_BIN.append(CRIM_BIN_value)
    
# Add CRIM_BIN column to dataframe
df['CRIM_BIN'] = CRIM_BIN

# Display head
df.head()

## Part 2: Split data into train/test sets and look at the descriptive statistics

Before modeling the data, we perform the usual train/test split and look at how the descriptive statistics between the two sets compare.

**Q 1**: Divide data into a training set (80%) and testing set (20%) randomly with a seed (we defined the seed as a constant at the very top of the notebook). The seed ensures that the random process returns the same results when ran multiple times. Next, split the training and testing data into the explanatory variables and the outcome variable. How can you ensure that samples are randomly assigned to the training or testing set? 

In [None]:
# Randomly split data into train set (80%) and test set (20%)
df_train, df_test = train_test_split(df, train_size = 0.8, test_size = 0.2, random_state = SEED)


In [None]:
# For both train and test data, extract CRIM_BIN column and combine relevant explanatory variables

y_col   = 'CRIM_BIN'
X_cols = ['ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

# Prepare data for classifier model (converts data frame to a list of lists)
y_train = df_train[y_col]
X_train = df_train[X_cols]

y_test = df_test[y_col]
X_test = df_test[X_cols]

**Q 2**: Look at the descriptive statistics for train/test sets. Are the distributions similar? What can we do if the distributions of the outcome variable (`CRIM_BIN`) are different? 

In [None]:
# Compute descriptive statistics for the training set

df_train.describe().T

In [None]:
df_test.describe().T

If the distributions for the outcome variable (`CRIM_BIN`) are different, we can stratify according to this variable. More specifically, we can set the `stratify` parameter of the `train_test_split()` function.

## Part 3: Fit a decision tree classifier

A decision tree classifier is a simple, non-linear tree model. You find the sklearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). This model represents our baseline.

**Q 1**: Fit a decision tree and tune the `max_depth` parameter. Hint: use `sklearn.model_selection.GridSearchCV()` for parameter tuning and tune the `max_depth` parameter in the range `[1,14]`. What's the optimal depth of trees?

In [None]:
# Tune max_depth parameter in the range (1,14)

tuned_parameters = [{'max_depth': range(1,14)}]

clf = GridSearchCV(DecisionTreeClassifier(random_state = SEED), tuned_parameters, cv = 5, scoring='accuracy')
clf.fit(X_train, y_train)

# Look at the best parameters
clf.best_params_

In [None]:
# Extract the optimal tree depth

best_tree_depth = clf.best_params_['max_depth']
best_tree_depth

**Q 2**: Assess the classifier on the test set. What accuracy do you achieve?

In [None]:
# Assessing best performing classifier tree on test set

clf_tree = DecisionTreeClassifier(max_depth = best_tree_depth, random_state=SEED)
clf_tree.fit(X_train, y_train)
y_pred = clf_tree.predict(X_test)

acc_tree = accuracy_score(y_test, y_pred)
print('Accuracy: ' + str(acc_tree))

# **ADVANCED EXERCISE**

*Optional.* If time permits and you feel comfortable with Python, continue with the advanced parts of this exercise below.

## Part 4: Plot the tree partitioning and feature importance

This part tells us how the tree classifier partitions the feature space. In other words, we see which features are most informative (i.e. split at the root) and at what values.

**Q 1**: Plot the tree patitioning. What's the most informative feature?

Hint 1: use the `export_graphviz()` function from the `graphviz` package to plot the tree.

Hint 2: use the `Source` function from the `graphviz` package to create a graph that can be displayed in the notebook.

In [None]:
# Plot best performing regression tree using graphviz
#     this may not work on all computers - requires graphviz to be installed
#     install graphviz on a Mac computer by running:  brew install -v graphviz
#     install graphviz on Linux computer by running:  sudo apt-get install graphviz
Source(export_graphviz(clf_tree, out_file=None, feature_names=X_cols, max_depth=2))

**Q 2**: Look at the feature importance on a histogram. Hint: use the `.feature_importances_` function in sklearn.

In [None]:
# Extract and plot importance of explanatory features

feat_names = ['ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']
feat_importance = clf_tree.feature_importances_.tolist()

plt.bar(list(range(1, len(feat_names)+1)), feat_importance, tick_label = feat_names, align = 'center')
plt.xticks(rotation='vertical')
plt.title('Importance of Features for Classification Tree')
plt.xlabel('explanatory features')
plt.ylabel('importance')
plt.show()

Answer: NOX (nitric oxides concentration) is the best predictor. Does this make sense?

## Part 5: Compute the ROC curve

In this part, we compute the false positive rate, true positive rate and thresholds defining ROC curve.

**Q 1**: What are the false positive and true rates? What's the area under the curve (AUC)?

Hint: generate predictions using the `predict_proba()` function in `sklearn`. We need probabilistic instead of binary predictions to compute the ROC thresholds.

In [None]:
y_pred_proba = clf_tree.predict_proba(X_test)[:,1]

# Compute false positive rate, true positive rate and thresholds defining ROC curve
# (note: these values define the points at which the ROC curve has a kink)
fpr_tree, tpr_tree, thresholds_tree = roc_curve(y_test, y_pred_proba, pos_label = 1)

print('False positive rates: {}\n'.format(fpr_tree))
print('True positive rates: {}\n'.format(tpr_tree))
print('Thresholds: {}\n'.format(thresholds_tree))

# Accuracy 
print('Accuracy:'.ljust(25) + str(acc_tree))

# Compute and show area under the ROC curve
roc_auc_tree = auc(fpr_tree, tpr_tree)
print ('Area under curve (AUC):'.ljust(25) + str(roc_auc_tree))

**Q 2**: Plot the ROC curve. Remember: the ROC curve has the false positive rate on the x-axis and the true positive rate on the y axis.

In [None]:
# Plot the ROC curve

plt.plot(fpr_tree, tpr_tree, lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.title('ROC curve for Classification Tree')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

## Part 6: Compare classification tree to random forest model

Random Forests are a model frequently applied in data science applications in business. Hence, let's see how they perform for this example. 

**Q 1**: Fit a random forest model. What's the optimal number of trees/estimators?

In [None]:
# Tune n_estimators parameter in the range(50, 251, 25)

tuned_parameters = [{'n_estimators': range(50, 251, 25)}]

clf_forest = GridSearchCV(RandomForestClassifier(random_state = SEED), tuned_parameters, cv = 5, scoring = 'accuracy')
clf_forest.fit(X_train, y_train)

# Look at the best parameters
clf_forest.best_params_

In [None]:
# Extract the optimal number of trees

best_n_estimators = clf_forest.best_params_['n_estimators']
best_n_estimators

**Q 2**: Assess the model on the test set. What accuracy do you achieve?

In [None]:
# Assessing best performing classifier tree on test set (baseline AUC is approx. 0.94)

# Fit with the best number of estimators
clf_randomForest = RandomForestClassifier(n_estimators = best_n_estimators, random_state=SEED)
clf_randomForest.fit(X_train, y_train)

# Compute accuracy on test set 
y_pred = clf_randomForest.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred)
print('Accuracy:'.ljust(25) + str(acc_rf))

# AUC ROC
y_predProba = clf_randomForest.predict_proba(X_test)[:, 1]
fpr_tree, tpr_tree, thresholds_tree = roc_curve(y_test, y_predProba, pos_label = 1)
roc_auc_rf = auc(fpr_tree, tpr_tree)
print ('Area under curve (AUC):'.ljust(25) + str(roc_auc_rf))

## **SUMMARY OF ACCURACY AND AUC VALUES**

In [None]:
width   = 35
models  = ['Decision Tree ACC', 'Random Forest ACC', 'Decision Tree AUC', 'Random Forest AUC']
results = [acc_tree, acc_rf, roc_auc_tree, roc_auc_rf]
print('', '=' * width, '\n', 'Summary of ACC and AUC Scores'.center(width), '\n', '=' * width)  
for i in range(len(models)):
    if i == 2: print()
    print(models[i].center(width-8), '{0:.4f}'.format(results[i]))