___

# Machine Learning in Geosciences ] 
Department of Applied Geoinformatics and Carthography, Charles University

Lukas Brodsky lukas.brodsky@natur.cuni.cz


## Exercise: Boosting Early Stopping technique

This notebook is dedicated to early stopping in boosting model . 

**Objective**:
Understand and implement different ensemble learning techniques—Bagging, Boosting, and Stacking—on a real-world dataset and compare their performance.

Tasks: 
1. Implement Gradient Boosting algorithm based on sklearn `GradientBoostingRegressor` class. 
2. Run boosting model with up to 200 estimators and measure testing error. 
3. Implement **Early stoping** procedure and plot the model performance with the indicator of the stopped boosting. 

In [None]:
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

In [None]:
np.random.seed(42)

In [None]:
# Data 
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [None]:
plt.plot(X, y, 'b.')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=49)

### Gradient Boosting Model - error evolution

**To implement:** 

`
Algorithm GradientBoosting:
Initialize GradientBoostingRegressor with:
        max_depth = 2
        warm_start = True
        random_state = 42

    Initialize train_err as an empty list
    Initialize test_err as an empty list

    For n_estimators from 1 to 199 do:
        Set gbrt.n_estimators to n_estimators
        Train gbrt using X_train and y_train

        // Compute training error
        Predict y_train_pred using gbrt on X_train
        Compute train_error as mean squared error between y_train and y_train_pred
        Append train_error to train_err list

        // Compute test error
        Predict y_pred using gbrt on X_test
        Compute test_error as mean squared error between y_test and y_pred
        Append test_error to test_err list
End Algorithm
`

In [None]:
# Measure validation error of Boosting model with up to 200 estimators 
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

pass 

In [None]:
# Plot the evolution of the validation errors 
plt.plot(list(range(1, 200)), train_err, 'g-', label='Training error')
plt.plot(list(range(1, 200)), test_err, 'b-', label='Testing error')
plt.ylim(0.0, 0.008)
plt.xlabel('Number of trees')
plt.title('Training and testing error')
plt.legend()

## When to stop the model learning?

.

### Implement early stopping procedure with parameter testing error going up = 5 to find the best!

Stop after model testing error increses for **five times** in the iterative learning!

**Early stopping:**

`
Algorithm EarlyStopping:
    If val_error < min_val_error Then:
        Set min_val_error to val_error
        Set error_going_up to 0
    Else:
        Increment error_going_up by 1
        If error_going_up equals 5 Then:
            Break the loop
End Algorithm
`

In [None]:
# Run the early stopping algorithm 
# mean_squared_error(y_test, y_pred)
# min_val_error = float("inf") 
# add error_going_up = 0 

pass 

In [None]:
print(gbrt.n_estimators)

In [None]:
# Plot the evolution of the validation errors 
plt.plot(list(range(1, 200)), train_err, 'g-', label='Training error')
plt.plot(list(range(1, 200)), test_err, 'b-', label='Testing error')
plt.axvline(gbrt.n_estimators, color = 'red', label = 'STOP')
plt.ylim(0.0, 0.008)
plt.xlabel('Number of trees')
plt.title('Training and testing error')
plt.legend()

In [None]:
# model plot data 
X_sim = np.linspace(-0.5, 0.5, 100)
pass 

In [None]:
# Model prediction plot 
pass 
# plot data 
# plot model 