# Unit 4 Gradient Boosting in Machine Learning

Hello there\! Today, we're going to explore Gradient Boosting, a powerful technique that improves the accuracy of machine learning models. Our goal is to understand what **Gradient Boosting** is, how it works, and how to use it with a real example in Python. By the end, you'll know how to implement **Gradient Boosting** and apply it to a dataset.

## What is Gradient Boosting?

**Gradient Boosting** is an ensemble technique that combines multiple weak learners, usually decision trees, to form a stronger, more accurate model. Unlike Bagging and **Random Forests**, which create models independently, **Gradient Boosting** builds models sequentially. Each new model aims to correct errors made by the previous ones.

Imagine baking a cake. The first cake might not be perfect — maybe too dry or not sweet enough. The next time, you make changes to improve it based on previous errors. Over time, you get closer to perfection. This is how **Gradient Boosting** works.

Here's a step-by-step explanation of **Gradient Boosting**:

1.  **Start with an initial model:** This can be a simple model like a single decision tree.
2.  **Calculate errors:** Find out where the initial model makes mistakes.
3.  **Build the next model:** Create a new model that focuses on correcting the errors from the initial model using gradients.
4.  **Combine models:** Add the new model to the existing ones to create a stronger model.
5.  **Repeat:** Continue this process until desired accuracy is achieved or a specified number of models is built.

Consider tuning a musical instrument. Initially, it may be out of tune. By fine-tuning each string separately, you reduce the overall error (or off-tune sound) until the instrument sounds perfect.

### Gradient Boosting vs. AdaBoost

**Gradient Boosting** and **AdaBoost** are both boosting techniques but they differ in their approach to combining weak learners.

  * **AdaBoost:** Each subsequent model focuses more on the instances that previous models misclassified. It assigns weights to instances, increasing weights for those that are hard to classify.
  * **Gradient Boosting:** Each subsequent model tries to minimize the loss function (usually the residual error) directly through gradient descent. It builds the new learner in the direction that reduces the error of the whole ensemble.

## Loading and Preparing the Dataset

Before we dive into coding, let's understand why datasets are crucial. A good dataset allows us to train and test our machine learning models effectively. We'll use the `load_digits` function from `scikit-learn`, which provides a real-world dataset for digit classification (0 to 9) from images.

```python
from sklearn.datasets import load_digits

# Load real dataset
X, y = load_digits(return_X_y=True)
```

We need to split this dataset into training and testing sets to evaluate our model properly. Here's how we do it in Python:

```python
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

While this task is usually solved using deep learning, namely Convolutional Neural Networks, we can also approach it using simpler classifiers, including `GradientBoostingClassifier` and `AdaBoostClassifier`.

## Training and Testing a Gradient Boosting Classifier

Now let's train a Gradient Boosting model using `GradientBoostingClassifier` from `scikit-learn`. Here's the basic code to train our model:

```python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Train a gradient boosting classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)

# Evaluate the model
y_pred = gb_clf.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy on test data: {accuracy:.2f}")  # Accuracy on test data: 0.97
```

In this code:

  * `GradientBoostingClassifier` is the model we're using.
  * `n_estimators=100` means we'll build 100 weak learners (decision trees).
  * `random_state=42` ensures reproducibility.
  * `fit(X_train, y_train)` trains the model on our training data.
  * `predict(X_test)` generates predictions on the test data.

Lastly, we calculate the accuracy using the `accuracy_score` function.

The `n_estimators` parameter is crucial because it determines the number of boosting stages, or how many times we refine our model. If set too low, our model might not be accurate enough (underfitting). If set too high, our model might become too complex (overfitting).

## Comparing Gradient Boosting to AdaBoost and RandomForest

Let's compare **GradientBoosting**, **AdaBoost**, and **RandomForest** classifiers using the same dataset and parameters:

```python
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

# Train an AdaBoost classifier
ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42, algorithm='SAMME')
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_pred_ada, y_test)
print(f"Accuracy for AdaBoost on test data: {accuracy_ada:.2f}")  # Accuracy for AdaBoost on test data: 0.83

# Train a Random Forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
accuracy_rf = accuracy_score(y_pred_rf, y_test)
print(f"Accuracy for RandomForest on test data: {accuracy_rf:.2f}")  # Accuracy for RandomForest on test data: 0.97
```

We train all models on the same data and compare their accuracies on the testing set. Main conclusions:

  * The `GradientBoostingClassifier` outperforms the `AdaBoostClassifier` on this dataset
  * The `GradientBoostingClassifier` shows comparable performance to the `RandomForestClassifier`
  * The `GradientBoostingClassifier` is a candidate to be the main model for this task

## Lesson Summary

Well done\! In this lesson, we learned about **Gradient Boosting**. We covered what it is, how it works, and how to implement it in Python using a real dataset.

To recap:

  * **Gradient Boosting** builds models sequentially to correct errors from previous models.
  * We used the `load_digits` dataset to train and test our Gradient Boosting model.
  * The `GradientBoostingClassifier` from `scikit-learn` allowed us to easily implement this technique.
  * We compared **Gradient Boosting** with **AdaBoost** and **RandomForest** to see how they perform on the same dataset.

Now that we've covered the theory, it's time for some hands-on practice. You'll apply these concepts to new datasets and fine-tune model parameters to see how **Gradient Boosting** can improve model performance.

Ready to get started? Let's move to the practice session\!

## Adjust Gradient Boosting Estimators

n this task, you'll change the number of weak learners in the Gradient Boosting model from 5 to 25. This small tweak demonstrates how the number of boosting stages impacts model performance.

Let's see the difference!


```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load and split dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting model
gb_clf = GradientBoostingClassifier(n_estimators=5, random_state=42)
gb_clf.fit(X_train, y_train)

# Model evaluation
accuracy = accuracy_score(gb_clf.predict(X_test), y_test)
print(f"Gradient Boosting accuracy: {accuracy:.2f}")

```

In this task, you'll change the number of weak learners in the Gradient Boosting model from 5 to 25. This small tweak demonstrates how the number of boosting stages impacts model performance.

Let's see the difference\!

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load and split dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting model with n_estimators changed from 5 to 25
gb_clf = GradientBoostingClassifier(n_estimators=25, random_state=42)
gb_clf.fit(X_train, y_train)

# Model evaluation
accuracy = accuracy_score(gb_clf.predict(X_test), y_test)
print(f"Gradient Boosting accuracy: {accuracy:.2f}")

```

## Complete the Gradient Boosting Setup for Digit Classification

Great job, fellow Space Explorer! Let's level up the challenge.

Complete the missing pieces of the code to train and evaluate a Gradient Boosting classifier on the load_digits dataset and evaluate its performance.


```python
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load dataset
X, y = load_digits(return_X_y=True)

#TODO:  Split dataset and train Gradient Boosting classifier

# Evaluate model performance with accuracy
accuracy = accuracy_score(gb_clf.predict(X_test), y_test)
print(f"Gradient Boosting Accuracy: {accuracy:.2f}")

```


Great job, fellow Space Explorer\! Let's level up the challenge.

Complete the missing pieces of the code to train and evaluate a Gradient Boosting classifier on the `load_digits` dataset and evaluate its performance.

```python
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Load dataset
X, y = load_digits(return_X_y=True)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Gradient Boosting classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42) # Using 100 estimators as a common default
gb_clf.fit(X_train, y_train)

# Evaluate model performance with accuracy
accuracy = accuracy_score(gb_clf.predict(X_test), y_test)
print(f"Gradient Boosting Accuracy: {accuracy:.2f}")

```

## Gradient Boosting vs. AdaBoost on Synthetic Data

Great!

Now, let's compare our two boosting models.

Complete the missing parts to train the an AdaBoost and a GradientBoosting classifier and evaluate its accuracy using the given synthetic dataset.

```python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Train an AdaBoost classifier with 50 estimators, fit to training data, and evaluate accuracy

# TODO: Train a Gradient Boosting classifier with 50 estimators, fit to training data, and evaluate accuracy

# Print accuracies
print(f"Accuracy for AdaBoost: {accuracy_ada:.2f}")
print(f"Accuracy for Gradient Boosting: {accuracy_gb:.2f}")

```


Great\!

Now, let's compare our two boosting models.

Complete the missing parts to train an AdaBoost and a GradientBoosting classifier and evaluate its accuracy using the given synthetic dataset.

```python
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an AdaBoost classifier with 50 estimators, fit to training data, and evaluate accuracy
ada_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
accuracy_ada = accuracy_score(y_pred_ada, y_test)

# Train a Gradient Boosting classifier with 50 estimators, fit to training data, and evaluate accuracy
gb_clf = GradientBoostingClassifier(n_estimators=50, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_pred_gb, y_test)

# Print accuracies
print(f"Accuracy for AdaBoost: {accuracy_ada:.2f}")
print(f"Accuracy for Gradient Boosting: {accuracy_gb:.2f}")

```

## Comparing Models Efficiency

Every explorer needs the right tools to succeed. In this task, we create a small synthetic dataset and train Gradient Boosting, Random Forest, and AdaBoost models on it. Your goal will be to compare the time it takes to train each model. Remember, good predictions is not the only factor when choosing the right model, we also care about efficiency!

The first one is done for you, use it as an example

Embark on your journey, Space Voyager!

```python
import numpy as np
import time
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier

# Create a small synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)

# Train and time the Gradient Boosting model
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
start_time = time.time()
gb_clf.fit(X, y)
gb_time = time.time() - start_time

# TODO: Train and time the Random Forest model

# TODO: Train and time the AdaBoost model

print(f"Gradient Boosting training time: {gb_time:.2f} seconds")
print(f"Random Forest training time: {rf_time:.2f} seconds")
print(f"AdaBoost training time: {ab_time:.2f} seconds")

```


Every explorer needs the right tools to succeed. In this task, we create a small synthetic dataset and train Gradient Boosting, Random Forest, and AdaBoost models on it. Your goal will be to compare the time it takes to train each model. Remember, good predictions is not the only factor when choosing the right model, we also care about efficiency\!

The first one is done for you, use it as an example

Embark on your journey, Space Voyager\!

```python
import numpy as np
import time
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier

# Create a small synthetic dataset
X, y = make_classification(n_samples=5000, n_features=20, random_state=42)

# Train and time the Gradient Boosting model
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
start_time = time.time()
gb_clf.fit(X, y)
gb_time = time.time() - start_time

# Train and time the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
start_time = time.time()
rf_clf.fit(X, y)
rf_time = time.time() - start_time

# Train and time the AdaBoost model
ab_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
start_time = time.time()
ab_clf.fit(X, y)
ab_time = time.time() - start_time

print(f"Gradient Boosting training time: {gb_time:.2f} seconds")
print(f"Random Forest training time: {rf_time:.2f} seconds")
print(f"AdaBoost training time: {ab_time:.2f} seconds")

```

## Gradient Boosting with Varying Estimators

Hey there, Space Pioneer! You're doing great so far.

Let’s tweak the model a bit. Add the missing pieces to train and evaluate the GradientBoostingClassifier.

Good luck, and may the force of data be with you!


```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load the digits dataset
X, y = load_digits(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of n_estimators to try
n_estimators_list = list(range(1, 31, 7))  # Trying fewer values for simplicity
accuracies = []

# Train and test models with different n_estimators
for n in n_estimators_list:
    # TODO: define the GradientBoostingClassifier with the given n_estimators (n)
    # TODO: make predictions on the test set and calculate the accuracy
    # TODO: append the obtained accuracy to the accuracies list

# Plot the results
plt.plot(n_estimators_list, accuracies, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Gradient Boosting Accuracy vs Number of Estimators')
plt.show()

```

Hey there, Space Pioneer\! You're doing great so far.

Let’s tweak the model a bit. Add the missing pieces to train and evaluate the GradientBoostingClassifier.

Good luck, and may the force of data be with you\!

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Load the digits dataset
X, y = load_digits(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# List of n_estimators to try
n_estimators_list = list(range(1, 31, 7))  # Trying fewer values for simplicity
accuracies = []

# Train and test models with different n_estimators
for n in n_estimators_list:
    # define the GradientBoostingClassifier with the given n_estimators (n)
    gb_clf = GradientBoostingClassifier(n_estimators=n, random_state=42)
    gb_clf.fit(X_train, y_train)

    # make predictions on the test set and calculate the accuracy
    y_pred = gb_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # append the obtained accuracy to the accuracies list
    accuracies.append(accuracy)

# Plot the results
plt.plot(n_estimators_list, accuracies, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Gradient Boosting Accuracy vs Number of Estimators')
plt.show()

```