# **Complex prediction**: Decision trees

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

We apply a classification tree and random forest model to the Boston dataset to predict crime per capita. We create a binary outcome feature, `CRIM_BIN`, that is equal to 1 if the crime rate contains a value above or equal to its median, and a 0 if the crime rate contains a value below its median. 

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It based on Harrison and Rubinfeld (1978) data and a similar dataset is available [here](https://nowosad.github.io/spData/reference/boston.html). 

--------

## Part 0: Setup

In [None]:
# Import packages

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz

from graphviz import Source
import matplotlib.pyplot as plt


In [None]:
# Define constant(s)

SEED = 17

# **MAIN EXERCISE**

## Part 1: Load data and create train/test sets

In the first part, we load the `housing.csv` dataset. This dataset includes the following columns:

- `CRIM`: per capita crime rate
- `ZN`: proportions of residential land zoned for lots over 25000 sq. ft per town (constant for all Boston tracts)
- `INDUS`: proportions of non-retail business acres per town (constant for all Boston tracts)
- `CHAS`: levels 1 if tract borders Charles River; 0 otherwise
- `NOX`: nitric oxides concentration (parts per 10 million) per town
- `RM`: average numbers of rooms per dwelling
- `AGE`: proportions of owner-occupied units built prior to 1940
- `DIS`: weighted distances to five Boston employment centres
- `RAD`: index of accessibility to radial highways per town (constant for all Boston tracts)
- `TAX`: full-value property-tax rate per USD 10,000 per town (constant for all Boston tracts)
- `PTRATIO`: pupil-teacher ratios per town (constant for all Boston tracts)
- `B`: proportion of blacks
- `LSTAT`: percentage values of lower status population
- `MEDV`: median values of owner-occupied housing in USD 1000


**Q 1**: Load the data. What shape does it have?

**Q 2**: Create the binary feature `CRIM_BIN` that contains a 1 if `CRIM` contains a value above or equal to its median, and a 0 if `CRIM` contains a value below its median. What % are 1s? Is the target variable balanced?

## Part 2: Split data into train/test sets and look at the descriptive statistics

Before modeling the data, we perform the usual train/test split and look at how the descriptive statistics between the two sets compare.

**Q 1**: Divide data into a training set (80%) and testing set (20%) randomly with a seed (we defined the seed as a constant at the very top of the notebook). The seed ensures that the random process returns the same results when ran multiple times. Next, split the training and testing data into the explanatory variables and the outcome variable. How can you ensure that samples are randomly assigned to the training or testing set? 

**Q 2**: Look at the descriptive statistics for train/test sets. Are the distributions similar? What can we do if the distributions of the outcome variable (`CRIM_BIN`) are different? 

## Part 3: Fit a decision tree classifier

A decision tree classifier is a simple, non-linear tree model. You find the sklearn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). This model represents our baseline.

**Q 1**: Fit a decision tree and tune the `max_depth` parameter. Hint: use `sklearn.model_selection.GridSearchCV()` for parameter tuning and tune the `max_depth` parameter in the range `[1,14]`. What's the optimal depth of trees?

**Q 2**: Assess the classifier on the test set. What accuracy do you achieve?

# **ADVANCED EXERCISE**

*Optional.* If time permits and you feel comfortable with Python, continue with the advanced parts of this exercise below.

## Part 4: Plot the tree partitioning and feature importance

This part tells us how the tree classifier partitions the feature space. In other words, we see which features are most informative (i.e. split at the root) and at what values.

**Q 1**: Plot the tree patitioning. What's the most informative feature?

Hint 1: use the `export_graphviz()` function from the `graphviz` package to plot the tree.

Hint 2: use the `Source` function from the `graphviz` package to create a graph that can be displayed in the notebook.

**Q 2**: Look at the feature importance on a histogram. Hint: use the `.feature_importances_` function in sklearn.

## Part 5: Compute the ROC curve

In this part, we compute the false positive rate, true positive rate and thresholds defining ROC curve.

**Q 1**: What are the false positive and true rates? What's the area under the curve (AUC)?

Hint: generate predictions using the `predict_proba()` function in `sklearn`. We need probabilistic instead of binary predictions to compute the ROC thresholds.

**Q 2**: Plot the ROC curve. Remember: the ROC curve has the false positive rate on the x-axis and the true positive rate on the y axis.

## Part 6: Compare classification tree to random forest model

Random Forests are a model frequently applied in data science applications in business. Hence, let's see how they perform for this example. 

**Q 1**: Fit a random forest model. What's the optimal number of trees/estimators?

**Q 2**: Assess the model on the test set. What accuracy do you achieve?

## **SUMMARY OF ACCURACY AND AUC VALUES**