 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Gradient boosting detailed explanation

* let's explain in detail how gradient boost works

* our goal in this example is to predict whether a person likes the movie "The Conjuring"
    <br>
    
    * we are going to demonstrate on a toy dataset that I created, just to keep things simple

In [1]:
# Import the packages we will use
# in this example

import pandas as pd
import math

In [2]:
# Create some example data
# Store it in a DataFrame

df = pd.DataFrame({
    "Likes horror movies": ["Yes", "Yes", "No", "Yes", "No", "No"],
    "Optimal movie length":[60, 90, 60, 120, 90, 60],
    "Age": [15, 90, 45, 20, 32, 14],
    "Likes 'The Conjuring'": ["Yes", "Yes", "No", "No", "Yes", "Yes"]    
})

In [3]:
# Display the DataFrame

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring'
0,Yes,60,15,Yes
1,Yes,90,90,Yes
2,No,60,45,No
3,Yes,120,20,No
4,No,90,32,Yes
5,No,60,14,Yes


<br><br>

## First step

* in gradient boost, we start with just one leaf and then later on add predictors
    <br>
    
    * the predictors we use in gradient boost are decision trees

* create a leaf with an initial prediction for every person
    <br>
    
    * the first leaf we create contains the **log(odds)** prediction
    * **log(odds)** plays a very important role in Logistic Regression
        <br>
        
        * the odds are defined as **probability_of_success/probability_of_failure**
        * log(odds) is just the logarithm of the odds
    

* we have two classes: "Yes" and "No"
    <br>
    
    * "Yes" represents the probability of success, while "No" represents the probability of failure
    * the odds in our case will be equal to 4/2 ---> we have 4 "Yes" values, and 2 "No" values
    * that means: **log(odds) = log(4/2)** 

In [4]:
# Calculate the initial log(odds)

init_log_odds = math.log(4/2)

init_log_odds

0.6931471805599453

* now that we have the initial log(odds) value, we need to convert it to something we can use to calculate the residuals

* that something is a probability value

* the easiest way to do that ---> **Logistic function**

**Equation for calculating the probability from log(odds)**

### $ \frac {e^{log(odds)}}{1+e^{log(odds)}}$

In [5]:
# Calculate the initial probability
# Since we are just starting we assign this probability 
# to all examples, i.e. to all rows in our DataFrame

init_prob = (math.e**(init_log_odds))  /  (1 + math.e**(init_log_odds))

init_prob

0.6666666666666666

* this probability is initially assigned to everyone
    <br>
    
    * we don't have a decision tree, just one leaf, so we predict the same result for every row
    * since the probability is larger than 0.5, we can classify everyone in the dataset as people who will like 'The Conjuring' (https://en.wikipedia.org/wiki/The_Conjuring)

* because the prediction will be wrong for two people we can't leave it at that, but we instead calculate pseudo-residuals
    <br>
    
    * ***`pseudo_residual = observed_value - predicted_value`***
    <br>
    * the **observed value** for a given person (row in the DataFrame) will be 1 if they liked the movie (100% probability they like the movie) and 0 if they don't (0% probability they would enjoy it)
    <br>
    * the **predicted_value** at this point is just the initial probabilty we assigned to everyone
        <br>
        
        * **init_prob = 0.6666666666666666**

* since our probability is a float, and the values in our "Likes 'The Conjuring'" columns are not, we need to one-hot encode the data stored in that column

In [6]:
# Binary encode data so that we can calculate residuals
# Yes to 1
# No to 0

df["Likes 'The Conjuring'"] = df["Likes 'The Conjuring'"].map({"Yes":1, "No":0})

In [7]:
# Calculate the pseudo-residuals.
# Essentially, the difference between the real probability value and 
# the probability we calculated

df["residuals"] =  df["Likes 'The Conjuring'"] - init_prob

In [8]:
# Display dataframe with residuals

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.333333
1,Yes,90,90,1,0.333333
2,No,60,45,0,-0.666667
3,Yes,120,20,0,-0.666667
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


## Second step: build first tree

* create a tree based on the values we just calculated 

* essentially, go over each row in the DataFrame, get the corresponding residuals, and include those values in the correct leafs, based on the selected split    

* typically we also limit the number of leaves, in this case let's limit it to 3 (internal nodes are purple, and each leaf is orange)

    <br>
    
    * in practice we actually use more leaves (e.g. a typical value is 8), but since we are working with a very small DataFrame and we are just trying to demonstrate how the procedure works we are going to keep things simple in this example

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_first_tree.png?v=1" width="800"/>


Here's the DataFrame again, to follow the tree above more easily:

In [9]:
df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.333333
1,Yes,90,90,1,0.333333
2,No,60,45,0,-0.666667
3,Yes,120,20,0,-0.666667
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


### Problem: leafs contain residuals

* we can't compare a residual with the values in our **Likes 'The Conjuring'** column, we can only compare probabilites with probabilites

* the procedure of getting back to probability values from residuals is actually pretty complex (and uses some pretty complex math), so we are not going to go into that in detail

* a simplified version would be that we can use the following equation to calculate the log(odds) value for some leaf:
    <br>
        
               sum_of_residuals / [sum of (previous_probability  * (1 - previous_probability))]

* once we have the log(odds) value, we can easily convert it to a probability value

#### Calculating log(odds) for first tree

* the equation we will use is the one we mentioned above

* the sum of residuals will be the sum of all values that are in a leaf

* since this is the first tree we are building, the previous probability will be the same in all cases, and it will be the one we got by converting the original log odds value
    <br>
    
    * rounded, the original probability is equal to ***0.666667***

* we are going to round the values to a few decimal points to keep things simple, in practice values don't get rounded but are instead used as is

<br><br>

In [10]:
# Calculate first leaf value
# by plugging in values in 
# the equation we defined earlier

first_calculation = -0.666667 / ((0.666667 )*(1-0.666667))

first_calculation

-3.0000030000030002

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_first_tree_first_calculation.png" width="600"/>


<br><br>

In [11]:
# Calculate second leaf value

second_calculation = (0.333333 + (-0.666667)) / ((0.666667)*(1-0.666667) + (0.666667)*(1-0.666667))

second_calculation 

-0.7500018750013125

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_first_tree_second_calculation.png" width="600"/>


<br><br>

In [12]:
# Calculate third leaf value

third_calculation = (0.333333 + 0.333333 + 0.333333) / ((0.666667)*(1-0.666667) + (0.666667)*(1-0.666667) + (0.666667)*(1-0.666667))

third_calculation

1.4999992500003752

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_first_tree_third_calculation.png" width="600"/>


<br><br><br><br>

## Third step: update values

* to get the new log(odds) value for a row of our DataFrame, we need to add to the initial log of odds value the newly calculated values multiplied by a learning rate
    <br>
    
    * defines how fast the model is "learning"
    * I use a large one for the purposes of demonstration (learning rate = 0.5), typically it is 0.1 or something similar
    * larger rates lead to bigger initial increases in accuracy, but can quickly lead to overfitting

* after we calculate the new log odds for a row, we can convert it to a probability and calculate a new residual to replace the old one

<img src="https://edlitera-images.s3.amazonaws.com/full_first_calculation.png" width="1200"/>


* to calculate new log of odds value just follow the tree, and afterwards you can convert that to a new probability using the same Logistic Function equation we used when first converting log odds to probability

### Calculate new log(odds) values

In [13]:
# Display first row of DataFrame

df.iloc[0:1]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.333333


**How to update log(odds):**

* since the person doesn't prefer movies that are 120 minutes in length we don't go to the leaf that is to the left, but instead move to the right


* because the person is not older than 35, we move to the leaf that is furthest to the right (which means that we will use 1.49)


* **final update: original log odds + learning rate * furthest right leaf value**

In [14]:
# Update log odds value

learning_rate = 0.5

updated_log_odds_value_first_row = init_log_odds + learning_rate * 1.49

updated_log_odds_value_first_row 

1.4381471805599453

### Calculate new probability and update residual for first row

In [15]:
# Calculate new probability for first row

new_probability_first_row = (math.e**(updated_log_odds_value_first_row))  /  (1 + math.e**(updated_log_odds_value_first_row))

new_probability_first_row

0.8081675677459842

* as you can see the probability is bigger 
    <br>
    
    * since the value in the **Likes 'The Conjuring'** column is "Yes" the probability value we are trying to achieve for the first row is 1
    * **essentially, WE ARE TRYING TO MINIMIZE THE RESIDUALS !**

In [16]:
# Calculate new residual

df.loc[0, "residuals"] = df.loc[0, "Likes 'The Conjuring'"] - new_probability_first_row

In [17]:
# Display DataFrame
# with the updated first residual

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.333333
2,No,60,45,0,-0.666667
3,Yes,120,20,0,-0.666667
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


**Our residual value decreased from 0.333333 to 0.191832 - our model is learning !**

<br><br><br><br>

### We must update all the residuals in the same way as we did for the first row !

**Second row**

In [18]:
# Display second row of DataFrame

df.iloc[1:2]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
1,Yes,90,90,1,0.333333


In [19]:
# Update log odds value

learning_rate = 0.5

updated_log_odds_value_second_row = init_log_odds + learning_rate*(-0.75)

updated_log_odds_value_second_row 

0.3181471805599453

In [20]:
# Calculate new probability for second row

new_probability_second_row = (math.e**(updated_log_odds_value_second_row))  /  (1 + math.e**(updated_log_odds_value_second_row))

new_probability_second_row

0.578872639607127

In [21]:
# Calculate new residual

df.loc[1, "residuals"] = df.loc[1, "Likes 'The Conjuring'"] - new_probability_second_row

In [22]:
# Display DataFrame 
# with second residual updated

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.421127
2,No,60,45,0,-0.666667
3,Yes,120,20,0,-0.666667
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


* **as you can see, a model can improve the prediction in one case but make it worse in another case**
    <br>
    
    * the residual in the first row became smaller, but in this second row it became bigger
    * however, as long as most of the predictions improve we say that the model is learning

**Third row**

In [23]:
# Display third row of DataFrame

df.iloc[2:3]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
2,No,60,45,0,-0.666667


* goes to the same place in the tree as the second row ---> **the new probability that we will use to calculate the new residual is the same as the one for the second row because they are located in the same leaf**

In [24]:
# Calculate new probability for the third row

new_probability_third_row = new_probability_second_row

new_probability_third_row

0.578872639607127

In [25]:
# Calculate new residual

df.loc[2, "residuals"] = df.loc[2, "Likes 'The Conjuring'"] - new_probability_third_row

In [26]:
# Display DataFrame 
# with third residual updated

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.421127
2,No,60,45,0,-0.578873
3,Yes,120,20,0,-0.666667
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


**VERY IMPORTANT**

* when we say we are trying to minimize the residual values **we are talking about them becoming as close to 0 as possible**
* if a residual is negative, we actually want its value to increase until it becomes close to zero

**Fourth row**

In [27]:
# Display fourth row of DataFrame

df.iloc[3:4]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
3,Yes,120,20,0,-0.666667


In [28]:
# Update log odds value

learning_rate = 0.5

updated_log_odds_value_fourth_row = init_log_odds + learning_rate * (-3)

updated_log_odds_value_fourth_row 

-0.8068528194400547

In [29]:
# Calculate new probability for the fourth row

new_probability_fourth_row = (math.e**(updated_log_odds_value_fourth_row))  /  (1 + math.e**(updated_log_odds_value_fourth_row))

new_probability_fourth_row

0.3085615459637724

In [30]:
# Calculate new residual

df.loc[3, "residuals"] = df.loc[3, "Likes 'The Conjuring'"] - new_probability_fourth_row

In [31]:
# Display DataFrame 
# with fourth residual updated

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.421127
2,No,60,45,0,-0.578873
3,Yes,120,20,0,-0.308562
4,No,90,32,1,0.333333
5,No,60,14,1,0.333333


**Fifth row**

In [32]:
# Display fifth row of DataFrame

df.iloc[4:5]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
4,No,90,32,1,0.333333


* goes to the same place in the tree as the first row ---> **the new probability that we will use to calculate the new residual is the same as the one for the first row because they are located in the same leaf**

In [33]:
# Calculate new probability for the fifth row

new_probability_fifth_row = new_probability_first_row

new_probability_fifth_row

0.8081675677459842

In [34]:
# Calculate new residual

df.loc[4, "residuals"] = df.loc[4, "Likes 'The Conjuring'"] - new_probability_fifth_row

In [35]:
# Display DataFrame 
# with fifth residual updated

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.421127
2,No,60,45,0,-0.578873
3,Yes,120,20,0,-0.308562
4,No,90,32,1,0.191832
5,No,60,14,1,0.333333


**Sixth row**

In [36]:
# Display sixth row of DataFrame

df.iloc[5:]

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
5,No,60,14,1,0.333333


* goes to the same place in the tree as the first row ---> **the new probability that we will use to calculate the new residual is the same as the one for the first row because they are located in the same leaf**

In [37]:
# Calculate new probability for the fifth row

new_probability_sixth_row = new_probability_first_row

new_probability_sixth_row

0.8081675677459842

In [38]:
# Calculate new residual

df.loc[5, "residuals"] = df.loc[5, "Likes 'The Conjuring'"] - new_probability_sixth_row

In [39]:
# Display DataFrame 
# with sixth residual updated

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',residuals
0,Yes,60,15,1,0.191832
1,Yes,90,90,1,0.421127
2,No,60,45,0,-0.578873
3,Yes,120,20,0,-0.308562
4,No,90,32,1,0.191832
5,No,60,14,1,0.191832


**Add a column to the DataFrame that represents the probabilities that we used to calculate the new residuals**

* this is important for later, when we create the next tree
    <br>
    
    * in the next tree we build, we will need to plug in these probabilites in the simplified equation that converts leaf residual values to log(odds) values

In [40]:
# Store all probabilities 
# in a list

list_of_new_probabilites = [
    new_probability_first_row,
    new_probability_second_row,
    new_probability_third_row,
    new_probability_fourth_row,
    new_probability_fifth_row,
    new_probability_sixth_row,   
]

In [41]:
# Create a Pandas Series
# Out of the new probabilites

new_probabilites_series = pd.Series(list_of_new_probabilites)

In [42]:
# Add that Series as a column
# to the DataFrame

df["new probabilites"] = new_probabilites_series

In [43]:
# Reorder the columns in the DataFrame
# Not necessary, but makes it easier to analyze
# the DataFrame when we start creating 
# the second tree

df = df[["Likes horror movies", "Optimal movie length", "Age", 
         "Likes 'The Conjuring'", "new probabilites", "residuals"]]

In [44]:
# Display modified DataFrame
# with the new probabilites
# and new residuals

df

Unnamed: 0,Likes horror movies,Optimal movie length,Age,Likes 'The Conjuring',new probabilites,residuals
0,Yes,60,15,1,0.808168,0.191832
1,Yes,90,90,1,0.578873,0.421127
2,No,60,45,0,0.578873,-0.578873
3,Yes,120,20,0,0.308562,-0.308562
4,No,90,32,1,0.808168,0.191832
5,No,60,14,1,0.808168,0.191832


<br>
<br>
<br>
<br>

## Fourth step: build second tree

* built the same way as the first tree with one key difference: this time the updated residuals are the values located inside the leafs of the tree

* this is how the second tree "learns" from the mistakes of the first tree

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_second_tree.png" width="600"/>


* afterwards, we need to calculate the log(odds) values for each leaf, using the formula that we used before when we were calculating log(odds) leaf values for the first tree
    <br>
    
    * this time the last predicted probability is not the same for every leaf, because we have predicted different probabilites for different leafs in the previous tree
    * what were our "new probabilities" for the previous tree, are the last predicted probability values for the second tree
    * once again, the equation we use is:
    
             sum of residuals / [sum of (previous probability  * (1 - previous probability))]


### Calculate leaf values for the second tree

* once again we will use rounded values for the sake of making the process easy to follow

**First calculation**

In [45]:
# Do first calculation

first_calculation_second_tree = 0.421127 / ((0.578873)*(1-0.578873))
first_calculation_second_tree

1.7274946318104316

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_second_tree_first_calculation.png" width="600"/>


**Second calculation**

In [46]:
# Do second calculation

second_calculation_second_tree = (-0.578873) / ((0.578873)*(1-0.578873))
second_calculation_second_tree 

-2.3745805897033434

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_second_tree_second_calculation.png" width="600"/>


**Third calculation**

In [47]:
# Perform second calculation

third_calculation_second_tree = (0.191832*3 + (-0.308562)) /((0.808168*(1-0.808168))*3 + (0.308562*(1-0.308562)))
third_calculation_second_tree 

0.3934474400228691

<img src="https://edlitera-images.s3.amazonaws.com/boosting_detailed_explanation_second_tree_third_calculation.png" width="600"/>


## Fifth step: update values again

* repeat the same process we had before when updating values

* this time however to calculate the new probability we go through two trees instead of one

<img src="https://edlitera-images.s3.amazonaws.com/full_second_calculation.png" width="1200"/>


* for each row we can following the tree we calculate the new log(odds) value, convert it to a new probability prediction and then compare it to the real values (1 or 0 respectively) to get the newest residuals

* now we have a new version of the DataFrame that contains the latest predictions and the latest residuals

* this process repeats until the training is finished
    <br>
    
    * e.g. until we reach a certain number of iterations (trees)

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>