# Gradient Boosting Trees From Scratch
I will being by looking at the example of a GB Tree for regression. I will use the data from the autotrader task that has already been split, cleaned and encoded with the aim of predicting the price of a car.  I will begin by using sklearn's decision tree algorithm as a base, rather than implementing my own tree algorithm.

In [27]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [6]:
# Load in data from autotrader data cleaner
X_train = pd.read_csv("data/X_train.csv")
X_test = pd.read_csv("data/X_test.csv")
y_train = pd.read_csv("data/y_train.csv")
y_test = pd.read_csv("data/y_test.csv")

In [9]:
print(f"Training data size: {len(X_train)}")
print(f"Testing data size: {len(X_test)}")
X_train.head()

Training data size: 156170
Testing data size: 66930


Unnamed: 0,has_service_history,is_imported,has_website,has_trim,is_ulez,is_convertible,known_reg_plate,is_private_plate,vehicle_age,mileage_deviation_encoded,...,model_col_6,model_col_7,model_col_8,model_col_9,model_col_10,model_col_11,model_col_12,model_col_13,model_col_14,model_col_15
0,1,0,1,1,0,1,1,0,56.0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,0,1,1,0,0,1,0,12.0,-1,...,0,0,0,1,0,0,0,0,0,0
2,1,1,1,1,0,0,1,0,15.0,-1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,1,1,0,1,0,1.0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,1,1,1,0,1,0,1000.0,0,...,0,0,0,0,0,0,1,0,0,0


## Notes on Tree-based Algorithms

Gradient Boosted Trees is another ensemble method, similar to a Random Forest algorithm. This means several tree-like models are trained in parallel and are all used to contribute to a final prediction. 

### Random Forests

In a Random Forest (RF), the first step is to produce a bootstrapped dataset (sampling with replacement). We then build a decision tree using this bootstrapped dataset - crucially, we use a random subset of the features at each step rather than the entire feature space. These 2 steps are repeated to produce a 'forest' of these random trees. 

**Random Forests on New Data**

When a new piece of data is passed through a RF, each tree provides an answer. In classification this could be yes/no, in regression this could be a continuous variable, like price. Once all the trees have given an answer, the final prediction is either the mode (in classification) or the mean (in regression) - this technique is called *bagging*. An out-of-bag score can then be calculated - the out-of-bag dataset are the examples that were not selected in the bootstrapping stage and acts as a validation set like in Cross Validation. In classification, the proportion of out-of-bag samples that were *incorrectly* classified is known as the out-of-bag error 

### Regression Trees
The inital root for a regression tree is the threshold for a feature which minimises the sum of the squared residuals. Each leaf continues in this fashion until the tree can no longer be split. To prevent overfitting, the algorithm typically has a min_samples_split hyperparameter. This prevents a split with a single observation.

# <a href="https://jerryfriedman.su.domains/ftp/trebst.pdf">Gradient Boosted Trees</a>
## Regression
Gradient Boost (GB) starts with a single leaf for the inital guess - for a continuous variable, this will be the average of the target. From this inital guess, GB trees are constructed but their overall size (number of leaves) is limited - typically this is between 8 and 32. Each successive tree is built upon the previous trees errors and once the maximum number of trees are constructed, a linear combination of these trees are utilised to make predictions. Each successive tree is scaled by the same amount - a constant usually defined as the learning rate or $\eta$.

 **AdaBoost** is a similar boosted tree algorithm - in this, however, the trees that are constructed are actually 'stumps' (this means a root and a single left and right leaf). Not only that but each successive stump is individually scaled based on how well it performed, rather than scalling all trees by a fixed $\eta$



In [20]:
initial_guess = y_train['price'].mean()
print(f"""
Step 1 - Initial Prediction
In our case, this should be the mean price of all the examples

Our inital guess is therefore £{initial_guess} -> (£{initial_guess:.2f} to 2 d.p)
""")


Step 1 - Initial Prediction
In our case, this should be the mean price of all the examples

Our inital guess is therefore £16879.324050713967 -> (£16879.32 to 2 d.p)



In [25]:
print(f"""
Step 2 - Build a tree based on the previous tree's erros

As our previous tree's guess was {initial_guess:.2f}, we can calculate the error by calculating: (observed price - predicted price)
This difference is saved as a pseudo residual
""")
y_train['residual'] = y_train['price'] - initial_guess
y_train.head(10)


Step 2 - Build a tree based on the previous tree's erros

As our previous tree's guess was 16879.32, we can calculate the error by calculating: (observed price - predicted price)
This difference is saved as a pseudo residual



Unnamed: 0,price,residual
0,13500,-3379.324051
1,6500,-10379.324051
2,14990,-1889.324051
3,45000,28120.675949
4,21685,4805.675949
5,1999,-14880.324051
6,10790,-6089.324051
7,26550,9670.675949
8,66070,49190.675949
9,17890,1010.675949


In [29]:
print(f"""
Step 3 - Build a tree based on all the features to predict the residuals
""")
dt1 = DecisionTreeRegressor(random_state=1337).fit(X_train, y_train['residual'])


Step 3 - Build a tree based on all the features to predict the residuals

