<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-gradient-boosting-with-xgboost/01-decision-tree-in-depth/01_decision_tree_in_depth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Decision tree in depth

XGBoost is an ensemble method, meaning that it is composed of different machine
learning models that combine to work together. The individual models that make up the
ensemble in XGBoost are called base learners.

Decision trees, the most commonly used XGBoost base learners, are unique in the
machine learning landscape. Instead of multiplying column values by numeric weights,
as in linear regression and logistic regression,
decision trees split the data by asking questions about the columns.

A decision tree can create thousands of branches until it uniquely maps each sample to the
correct target in the training set. This means that the training set can have 100% accuracy.
Such a model, however, will not generalize well to new data.

Decision trees are prone to overfitting the data. In other words, decision trees can map too
closely to the training data, a problem explored later in this chapter in terms of variance
and bias. 

Hyperparameter fine-tuning is one solution to prevent overfitting. Another
solution is to aggregate the predictions of many trees, a strategy that Random Forests and
XGBoost employ.


##Setup

In [21]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV


import warnings
warnings.filterwarnings('ignore')

In [None]:
!wget https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter02/census_cleaned.csv
!wget https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/raw/master/Chapter02/bike_rentals_cleaned.csv

##Exploring decision trees

Decision Trees work by splitting the data into branches. The branches are followed down
to leaves where predictions are made.

In [3]:
df_census = pd.read_csv("census_cleaned.csv")
df_census.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# declare your predictor and target columns, X and y
X = df_census.iloc[:, :-1]
y = df_census.iloc[:, -1]

In [5]:
# split the data into training and tests set
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=2)

In [6]:
# decision tree classifier
clf = DecisionTreeClassifier(random_state=2)
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)
accuracy_score(y_pred, y_test)

0.8131679154894976

##Tuning hyperparameters

The best way to learn about hyperparameters is through experimentation. Although there
are theories behind the range of hyperparameters chosen, results trump theory. Different
datasets see improvements with different hyperparameter values.

Before selecting hyperparameters, let's start by finding a baseline score.

In [8]:
df_bikes = pd.read_csv("bike_rentals_cleaned.csv")
df_bikes.head()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349
3,4,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,1562
4,5,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,1600


In [9]:
x_bikes = df_bikes.iloc[:, :-1] # get all rows and columns except last one
y_bikes = df_bikes.iloc[:, -1]  # get all rows and last columns only

In [11]:
reg_tree = DecisionTreeRegressor(random_state=2) 

In [12]:
scores = cross_val_score(reg_tree, x_bikes, y_bikes, scoring="neg_mean_squared_error", cv=5)

In [15]:
# Compute the root mean squared error
rmse = np.sqrt(-scores)
print(f"RMSE mean: {rmse.mean():.2f}")

RMSE mean: 1233.36


In [16]:
reg_tree.fit(x_train, y_train)
y_pred = reg_tree.predict(x_train)

In [20]:
reg_mse = mean_squared_error(y_train, y_pred)
reg_rmse = np.sqrt(reg_mse)
reg_rmse

0.0

In [22]:
# GridSearchCV searches a grid of hyperparameters
params = {"max_depth": [None, 2, 3, 4, 6, 8, 10, 20]}

grid_reg = GridSearchCV(reg_tree, params, scoring="neg_mean_squared_error", cv=5, n_jobs=-1)
grid_reg.fit(x_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(random_state=2), n_jobs=-1,
             param_grid={'max_depth': [None, 2, 3, 4, 6, 8, 10, 20]},
             scoring='neg_mean_squared_error')

In [23]:
best_params = grid_reg.best_params_
print(f"Best params: {best_params}")

Best params: {'max_depth': 8}


In [24]:
best_score = np.sqrt(-grid_reg.best_score_)
print(f"Training score: {best_score:.3f}")

Training score: 0.318


In [25]:
best_model = grid_reg.best_estimator_
y_pred = best_model.predict(x_test)
rmse_test = mean_squared_error(y_test, y_pred) ** 0.5

print(f"Test score: {rmse_test:.3f}")

Test score: 0.323
