# Exercise - Regularizing a Decision Tree Model


In this exercise, you will train a base Decision Tree classification model and investigate the effects of its hyperparameters on the bias/variance trade-off in order to address overfitting/underfitting.


In [15]:
# DO NOT MODIFY - imports
import pandas as pd
import numpy as np

## 1. Preparation and Set-up

Let's say we want to try to classify the direction of 1-day price movements of the [E-mini S&P 500](https://www.cmegroup.com/markets/equities/sp/e-mini-sandp500.html) futures contracts. Execute the cell below to load data for approximately 24 years including some technical analysis features inot the `df` DataFrame.


In [16]:
# DO NOT MODIFY - load data
df = pd.read_csv("data.csv")
df.shape

(6007, 18)

In [17]:
# DO NOT MODIFY - data preview
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,EMA10,EMA30,ATR,ADX,RSI,MACD,MACDsignal,ClgtEMA10,EMA10gtEMA30,MACDSIGgtMACD,target_cls
0,2000-11-16,1396.75,1402.75,1376.5,1379.25,1379.25,180156,1396.300225,1402.770723,31.179556,26.944748,41.653459,-4.726233,1.37898,-1,-1,1,0
1,2000-11-17,1378.75,1393.5,1360.25,1370.5,1370.5,90660,1391.609275,1400.688741,31.327445,27.342044,40.051128,-6.469625,-0.190741,-1,-1,1,0
2,2000-11-20,1369.75,1373.5,1345.25,1347.75,1347.75,82907,1383.634861,1397.273338,31.107627,28.056237,36.156663,-9.576616,-2.067916,-1,-1,1,1
3,2000-11-21,1348.25,1362.75,1336.5,1356.0,1356.0,82365,1378.610341,1394.610542,30.760654,28.912117,38.492375,-11.243614,-3.903056,-1,-1,1,0
4,2000-11-22,1356.0,1361.0,1322.0,1323.0,1323.0,76824,1368.49937,1389.990507,31.349179,30.012468,33.251961,-15.054013,-6.133247,-1,-1,1,1


We will be using the following columns for features (`X`) and the `target_cls` column as the classification target (`y`):


In [13]:
# DO NOT MODIFY - features and target definition
X = df[["ATR", "ADX", "RSI", "ClgtEMA10", "EMA10gtEMA30", "MACDSIGgtMACD"]]
y = df["target_cls"]

Run the cell below to split the data 70/30:


In [18]:
# DO NOT MODIFY - train/test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=52, shuffle=False
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4204, 6), (1803, 6), (4204,), (1803,))

## 2. Baseline Score, "Base" Model and Over-/Underfitting Diagnosis

Run the cell below to find the distibution of the target class in the training set. Note which class (`0` or `1`) is the majority class.

In [86]:
# DO NOT MODIFY - class distribution on the training set
print("Target class distribution on `y_train`:")
y_train.value_counts() / len(y_train)   # Expected dutput: Target class `1` is the majority class with a 53.36% share

Class distribution on `y_train`:


target_cls
1    0.535918
0    0.464082
Name: count, dtype: float64

If instead of training a machine learning model you always predicted the majority class above, what accuracy score would you achieve on the training set?

In [88]:
# FILL IN - What accuracy score would you get if you predicted the majority class (from `y_train`) for all samples in the test set?
print("Target class distribution on `y_test`:")
y_test.value_counts()[1] / len(y_test)

Target class distribution on `y_test`:


np.float64(0.5463117027176927)

The above number provides a "baseline score" to anchor our efforts to. Now train a basic `DecisionTreeClassifier` with the default hyperparameter values:


In [79]:
# DO NOT MODIFY - import
from sklearn.tree import DecisionTreeClassifier

# FILL IN - Instantiate and train (fit) a DecisionTreeClassifier with the default hyperparameters and random_state=52
clf = DecisionTreeClassifier(random_state=52)
clf.fit(X_train, y_train)

We will now use this base tree model to get predictions on the train/test sets. Run the cells below and compare the test accuracy result to the baseline score.

In [93]:
# DO NOT MODIFY
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print("Train accuracy:", accuracy_score(y_train, y_pred_train))
print("Test accuracy:", accuracy_score(y_test, y_pred_test))

Train accuracy: 1.0
Test accuracy: 0.47088186356073214


Given the above train/test accuracy scores, is the base model overfitting, underfitting, or neither?

In [None]:
# 1. The model is overfitting
# 2. The model is underfitting
# 3. The model is well-fitted
# FILL IN - Choose the correct answer
answer = 1
# The model is clearly severely overfitting as the training accuracy is 100% and the test accuracy is even less than the baseline score.

## 3. Regularizing the Decision Tree Model

The process of reducing the complexity of a tree-based model, also known as "pruning" the model, helps alleviate overfitting. There are two types of pruning:
- In the first kind, we prevent the model from reaching too much complexity/density in the first place by adjusting its hyperparameters. This is the method we will use today.
- In the second type, known as "backward pruning", we allow the unconstrained tree model to reach its maximum complexity (which means it will almost definitely overfit) and then use methods like Cost Complexity Pruning to pare it down. (If interested, see the documentation for the `cost_complexity_pruning_path()` method of the classifier.)

Our base `DecisionTreeClassifer` from earlier had no limit imposed on the maximum depth it was allowed to reach. Re-instantiate and re-fit it, this time limiting its maximum depth to 10. Keep `random_state=52`.

In [128]:
# FILL IN - Re-train the model with its maximum depth hyperparameter set to 10 and random_state=52
clf = DecisionTreeClassifier(max_depth=10, random_state=52)
clf.fit(X_train, y_train)

Inspect the new training and test accuracy scores. Is the model doing better than before in terms of over-/under-fitting?

In [129]:
# FILL IN - Print out the train and test accuracy scores - They should be much closer to each other now indicating reduced overfitting
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

print("Train accuracy:", accuracy_score(y_train, y_pred_train))
print("Test accuracy:", accuracy_score(y_test, y_pred_test))

Train accuracy: 0.579686013320647
Test accuracy: 0.540765391014975


Absolute performance is not the focus of this exercise; but feel free to compare the test accuracy to our baseline score to see if we were able to eke out a few more basis points of performance with the pruned model.

Alternatively, we could have pruned the tree by increasing the minimum number of samples allowed per leaf node. Re-train and re-evaluate the classifier with the minimum number of samples per leaf set to `100` (from the default value of `1`) to see if this helps reduce variance. **NOTE:** Do not set the maximum depth hyperparameter. We want to inspect the isolated effect of one hyperparameter at a time.

In [109]:
# FILL IN - Re-train the model with its minimum samples per leaf hyperparameter set to 100 and random_state=52
clf = DecisionTreeClassifier(min_samples_leaf=100, random_state=52)
clf.fit(X_train, y_train)

# FILL IN - Print out the train and test accuracy scores - Again we should see a reduction in overfitting compared to the base model
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

print("Train accuracy:", accuracy_score(y_train, y_pred_train))
print("Test accuracy:", accuracy_score(y_test, y_pred_test))

Train accuracy: 0.5768315889628924
Test accuracy: 0.5002773155851359


Another major hyperparameter that can help reduce the complexity of the decision tree is the minimum number of samples required to be present before a split is allowed to occur in decision nodes. Change this hyperparameter's value to `500` (from its default value of `2`). As before, inspect its _isolated_ effect on bias/variance by comparing its train/test accuracy score to that of the base decision tree.

In [117]:
# FILL IN - Re-train the model with its minimum samples per split hyperparameter set to 500 and random_state=52
clf = DecisionTreeClassifier(min_samples_split=500, random_state=52)
clf.fit(X_train, y_train)

# FILL IN - Print out the train and test accuracy scores - Again we should see a reduction in overfitting compared to the base model
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

print("Train accuracy:", accuracy_score(y_train, y_pred_train))
print("Test accuracy:", accuracy_score(y_test, y_pred_test))

Train accuracy: 0.582778306374881
Test accuracy: 0.5041597337770383
