# Welcome to Week 9! 
This week, you will explore gradient boost, including learning rate, number of estimators, tree depth, and regularization.

## Learning Objectives
At the end of this week, you should be able to: 
- Explain how gradient boost works. 
- Construct a gradient-boosting regression model by iteratively fitting weak learners to residuals. 
- Compare the impact of different learning rates on model performance and complexity. 
- Evaluate how tree depth and number of estimators impact the risks of underfitting or overfitting. 
- Apply regularization techniques in gradient boosting, including limiting tree complexity, setting sample thresholds, and subsampling. 

## 9.1 Lesson: Gradient Boost
**Gradient boost** is a little like random forest, but it’s more sophisticated. 
- Gradient boost and random forest are both ensemble learning methods using decision trees, but they work differently. 
    - Random forest builds many independent trees in parallel and averages their results, making it robust and less prone to overfitting. 
    - Gradient boost builds trees sequentially, each one correcting the errors of the last. 

Let’s start by understanding gradient boost in the case of regression: 
- The idea is that we begin by building a simple decision tree — perhaps just a stump (a decision tree with one branch only) — that partly explains the data. 
- This simple decision tree or stump is called a “weak learner.” (Of course, any tree, or even a random forest, should only partly explain the data. 
- If the random forest exactly explains the data, it is probably bad — it likely means we’ve overfit.) 
- Now, the weak learner makes certain predictions for each sample that are not exactly correct.  

To illustrate this: 
A real estate company could use gradient boosting to predict house prices based on features like square footage, location, and number of bedrooms. For three recent listings, let’s say the actual sale prices (in hundreds of thousands) were 1.8, 1.3, and 2.6, respectively:

In [24]:
import numpy as np
import pandas as pd
sample_1, sample_2, sample_3 = 1.8, 1.3, 2.6

Our first prediction would generally be a single value across the board, which in this case might be the average of the three values, or 1.9. 

In that case, the residual (what remains to be explained) is whatever’s left, e.g., ﻿1.8 minus 1.9 equals negative 0.1﻿ for the first sample and ﻿1.3 minus 1.9 equals negative 0.6﻿ for the second sample. So the initial residuals are: 

In [25]:
prediction_1 = np.mean([sample_1, sample_2, sample_3])
print(f"Initial prediction: {prediction_1}")

Initial prediction: 1.9000000000000001


In that case, the residual (what remains to be explained) is whatever’s left, e.g., 
- $1.8 - 1.9 = -0.1$ for the first sample, and 
- $1.3 - 1.9 = -0.6$ for the second sample. 

So the initial residuals are: 

In [26]:
sample_1_residual = sample_1 - prediction_1
sample_2_residual = sample_2 - prediction_1
sample_3_residual = sample_3 - prediction_1
print(f"Residuals: {sample_1_residual}, {sample_2_residual }, {sample_3_residual}")

Residuals: -0.10000000000000009, -0.6000000000000001, 0.7


Now, we’re going to build our decision stump. We are trying to predict the residuals; suppose the predictions are: 

In [27]:
sample_1_prediction = -0.4
sample_2_prediction = -0.4
sample_3_prediction = 0.9

As with any decision tree, we’d have to base this one on a feature that gives us this stump. 

Maybe sample 1 and 2 are labeled “orange,” and sample 3 is labeled “blue,” which enables us to assign the same value to 1 and 2 and a different value to sample 3. 

Then the total predictions so far are: 

In [28]:
sample_1_total = prediction_1 + (sample_1_prediction)
sample_2_total = prediction_1 + sample_2_prediction
sample_3_total = prediction_1 + sample_3_prediction

print(f"Sample 1 total: {sample_1_total}, Sample 2 total: {sample_2_total}, Sample 3 total: {sample_3_total}")

Sample 1 total: 1.5, Sample 2 total: 1.5, Sample 3 total: 2.8000000000000003


On the other hand, we could apply a *learning rate* (LR) of $0.5$, meaning that we scale our predictions of the residuals (to $0.5 \times – 0.4 = – 0.2$, and $0.5 \times 0.9 = 0.45$) before adding them:

In [29]:
learning_rate = 0.5
sample_1_total_lr = learning_rate * sample_1_prediction
sample_2_total_lr = learning_rate * sample_2_prediction
sample_3_total_lr = learning_rate * sample_3_prediction
print(f"Sample 1 total with learning rate: {sample_1_total_lr}\nSample 2 total with learning rate: {sample_2_total_lr}\nSample 3 total with learning rate: {sample_3_total_lr}")

sample_1_total_lr = prediction_1 + sample_1_total_lr
sample_2_total_lr = prediction_1 + sample_2_total_lr
sample_3_total_lr = prediction_1 + sample_3_total_lr
print(f"Sample 1 total with learning rate: {sample_1_total_lr}, \nSample 2 total with learning rate: {sample_2_total_lr}, \nSample 3 total with learning rate: {sample_3_total_lr}")

Sample 1 total with learning rate: -0.2
Sample 2 total with learning rate: -0.2
Sample 3 total with learning rate: 0.45
Sample 1 total with learning rate: 1.7000000000000002, 
Sample 2 total with learning rate: 1.7000000000000002, 
Sample 3 total with learning rate: 2.35


This leads to new residuals:

In [30]:
sample_1_residual_2 = sample_1 - sample_1_total_lr
sample_2_residual_2 = sample_2 - sample_2_total_lr
sample_3_residual_2 = sample_3 - sample_3_total_lr
print(f"New residuals: {sample_1_residual_2}, {sample_2_residual_2}, {sample_3_residual_2}")

New residuals: 0.09999999999999987, -0.40000000000000013, 0.25


Our next job is to build a second weak learner (a decision stump) to fit this new set of residuals - and so on.

The model iteratively improves its predictions by minimizing the error between predicted and actual values. This allows our real estate company to price future house listings more accurately.

That's gradient boost: We can keep building more stumps to fit the residuals more and more precisely. A typical gradient boost might fit 50 to 500 stumps in this way to acheive a high degree of accuracy. Gradient boost is one of the most powerful models there is, acheiving good accuracy with this relatively small number of nodes. 

### Learning Rate, # of Estimators
If the learning rate is set to a small value, the gradient boost gradually moves toward a solution, taking a while to get there. The training takes longer — and the resulting model is more complex — but it may be more accurate because it contains a greater amount of complexity. 

Small learning rates result in a large number of trees (i.e., a large number of weak learners), while high learning rates result in faster training and a smaller number of weak learners.

### Tree Depth
We can also choose the tree depth of each weak learner. Very deep trees might tend to overfit, while shallow trees are more likely to underfit, as they have a harder time picking up on complex relationships between features. A tree is "deep" if it has many layers, allowing it to fit data in a complex way.
- For instance, if there are 1000 samples, then in principle they could fit perfectly by a tree with $10$ layers (up to $2^10 = 1,024$ leaves.)
- The problem is that if we fit the training data perfectly, we are likely overfitting — meaning that we fit this particular dataset exactly but cannot fit other data from the same population. On the other hand, underfitting means we do not even fit the training data.

### Regularization
Gradient boost can be regularized by (1) penalizing complexity (the number of leaves), (2) setting a minimum number of samples per leaf, or (3) using a subsample when training a given weak learner (e.g., if there are 1,000 samples, use a smaller number than 1,000 to train each weak learner).

### Think About It:
- How is gradient boost similar to random forest?
- How is gradient boost different from random forest? 
- Does a small learning rate generally lead to higher or lower accuracy? Why? 