# Unit 5 Stacking in Machine Learning

Hey there\! 😊 In this lesson, we'll explore an exciting machine learning technique called **"Stacking."** You might wonder why we’re learning about stacking and how it helps us make better predictions. Well, imagine asking several experts for their opinions and combining them to make a decision. By the end of this lesson, you'll know how to implement and use stacking to boost model performance\!

-----

## Introduction to Stacking

Let's dive into **stacking**. Stacking is an ensemble technique combining multiple models (**base models**) to produce a final prediction using another model (**meta-model**). Think of base models as chefs, and the meta-model as a food critic who tastes all the dishes and decides the final rating.

### How Does Stacking Work?

  * **Training Base Models:** Training multiple base models on the same dataset. Each model brings its unique strength to capture different aspects of the data.
  * **Generating Meta-Data:** Using the base models' predictions, we generate a new dataset (**meta-data**). This dataset is composed of the predictions of all base models.
  * **Training Meta-Model:** Training a meta-model on this new meta-data. The meta-model learns how to best combine the predictions of the base models to make the final prediction.

### Why Stacking?

  * **Improved Accuracy:** Combining different models captures various patterns and reduces errors.
  * **Reduced Overfitting:** Multiple models balance biases and variances.

-----

## Loading and Splitting the Dataset

To get hands-on with stacking, we need data. We'll again use the `digit` dataset from `scikit-learn`. This dataset contains images of digits used to predict what digit each image represents.

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load digit dataset
X, y = load_digits(return_X_y=True)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

-----

## Defining Base and Meta Models

We define our base models (chefs) and meta-model (food critic). Base models do the heavy lifting, while the meta-model combines predictions for the final output. We use `RandomForestClassifier` and `GradientBoostingClassifier` as base models and `LogisticRegression` as our meta-model:

```python
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier

# Defining the base and meta models
estimators = [
    ('rf', RandomForestClassifier(n_estimators=20, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=20, random_state=42))
]

stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
```

The `estimators` list has tuples with each base model's name and the actual model. `StackingClassifier` combines these estimators with the `final_estimator` (meta-model).

For base models, avoid utilizing comparable models or models with similar hyperparameter settings, as this may not result in a significant performance gain.

-----

## Calculating Accuracy on Testing Data

After training, we can evaluate our model's performance by calculating the accuracy on the testing data:

```python
from sklearn.metrics import accuracy_score

# Predict and calculate accuracy
y_pred = stack_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")
# Stacking Classifier Accuracy: 0.97
```

-----

## Comparison with GradientBoostingClassifier

Finally, let's see how our stacking classifier stacks up against a single `GradientBoostingClassifier`:

```python
# Training GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(n_estimators=50, random_state=42)
gb_clf.fit(X_train, y_train)

# Predict and calculate accuracy for GradientBoostingClassifier
y_pred_gb = gb_clf.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"GradientBoosting Classifier Accuracy: {accuracy_gb:.2f}")
# GradientBoosting Classifier Accuracy: 0.96
```

We see that the performance of the `GradientBoostingClassifier` is comparable to the performance of our new `StackingClassifier`. But how do we choose one in this case? Well, this is what the next course of this path is all about\!

However, let's answer here shortly:
Firstly, we will try to find optimal models' hyperparameters and perhaps one of the models will outperform in this case.
Secondly, we will implement a better validation technique than simple `train_test_split`.
Finally, if two models still perform on the same level, we can consider other factors, like model's training time and interpretability.

-----

## Lesson Summary

Great job\! 🎉 Let's recap:

  * **Stacking:** Combines multiple models (base models) using another model (meta-model) for a final prediction.
  * **How It Works:** Includes training base models, generating meta-data, and training a meta-model.
  * **Loading and Splitting Data:** Used the digit dataset and split it into training and testing sets.
  * **Model Setup:** Defined base models (`RandomForestClassifier` and `GradientBoostingClassifier`) and a meta-model (`LogisticRegression`).
  * **Training and Evaluating:** Implemented and trained the `StackingClassifier` and calculated its accuracy.
  * **Comparison:** Compared the accuracy of the stacking classifier to that of a single `GradientBoostingClassifier`.

Awesome, you made it\! Now it's time to practice. In the upcoming session, you’ll build, train, and evaluate stacking models. Let's get started\! 🚀

## Change Meta-Model to Gradient Boosting

Hey Space Voyager!

You’ve learned how to set up stacking with different base models. Now, change the set of base models to see how a different ensemble performs. Replace KNeighborsClassifier and DecisionTreeClassifier with RandomForestClassifier and SVC respectively. Make sure to modify the estimators list accordingly in the stacking classifier.

Let's code!

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining base and meta models
estimators = [
    ('knn', KNeighborsClassifier()),
    ('dt', DecisionTreeClassifier(random_state=42))
]
stack_clf = StackingClassifier(estimators=estimators, final_estimator=GradientBoostingClassifier(n_estimators=100, random_state=42))
stack_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = stack_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

```

Sure, I'll help you with that\! You want to change the base models in your `StackingClassifier` from `KNeighborsClassifier` and `DecisionTreeClassifier` to `RandomForestClassifier` and `SVC`, respectively. Also, you want to ensure the meta-model remains `GradientBoostingClassifier`.

Here's the modified code:

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression # Not used for meta-model anymore, but good to keep if you might switch back
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining base and meta models
estimators = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)), # Changed from KNeighborsClassifier
    ('svc', SVC(random_state=42, probability=True)) # Changed from DecisionTreeClassifier, SVC needs probability=True for stacking
]
# The meta-model is already GradientBoostingClassifier as requested
stack_clf = StackingClassifier(estimators=estimators, final_estimator=GradientBoostingClassifier(n_estimators=100, random_state=42))
stack_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = stack_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

```

## Change the Meta-Model in Stacking Classifier

Nice work so far, Space Wanderer! Now, let's change our meta-model. Modify the provided code to use LogisticRegression instead of KNeighborsClassifier as the meta-model in the stacking ensemble. This will provide a different approach to the decision-making process.

Let's see if it improves the predictions!

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the base and meta models
base_models = [
    ('lr', LogisticRegression(max_iter=250, random_state=42)),
    ('dt', DecisionTreeClassifier(random_state=42))
]
meta_model = KNeighborsClassifier()
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model)
stacking_model.fit(X_train, y_train)

# Predict and calculate accuracy
predictions = stacking_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

```

Alright, Space Wanderer\! Let's swap out that meta-model. You want to change the `final_estimator` (meta-model) of your `StackingClassifier` from `KNeighborsClassifier` to `LogisticRegression`.

Here's the modified code:

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier # Still imported, but no longer used for meta_model
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the base and meta models
base_models = [
    ('lr', LogisticRegression(max_iter=250, random_state=42)),
    ('dt', DecisionTreeClassifier(random_state=42))
]
# Change the meta_model to LogisticRegression
meta_model = LogisticRegression(max_iter=250, random_state=42) # Using LogisticRegression as the meta-model
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model)
stacking_model.fit(X_train, y_train)

# Predict and calculate accuracy
predictions = stacking_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Stacking Classifier Accuracy: {accuracy:.2f}")

```

## Complete the Stacking Classifier

Awesome work so far, Space Wanderer!

Now let's see if you can fill in the missing pieces! Complete the code to define our base models using the RandomForestClassifier and GradientBoostingClassifier, then create the StackingClassifier with LogisticRegression as the meta-model. Each model has a unique strength. Finally, calculate the accuracy of the stacking model on the test set.

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Define the RandomForest and GradientBoosting classifiers
# Hint: use relatively small amount of estimators (e.g. 20) to make the code execution faster
base_models = [
    ('random_forest', ___),
    ('gradient_boosting', ___)
]
meta_model = LogisticRegression()

# TODO: Create the StackingClassifier using base models and LogisticRegression as the meta model

# Training the Stacking Model
stacking_model.fit(X_train, y_train)

# TODO: Calculate the accuracy on the test set and print it

```

Hey Space Wanderer\! Let's get those missing pieces filled in and complete your Stacking Classifier.

Here's the completed code:

```python
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score

# Load digit dataset
X, y = load_digits(return_X_y=True)

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Define the RandomForest and GradientBoosting classifiers
# Hint: use relatively small amount of estimators (e.g. 20) to make the code execution faster
base_models = [
    ('random_forest', RandomForestClassifier(n_estimators=20, random_state=42)), # Completed: RandomForestClassifier
    ('gradient_boosting', GradientBoostingClassifier(n_estimators=20, random_state=42)) # Completed: GradientBoostingClassifier
]
meta_model = LogisticRegression(random_state=42) # Added random_state for reproducibility, though not strictly required by problem.

# TODO: Create the StackingClassifier using base models and LogisticRegression as the meta model
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model) # Completed: StackingClassifier creation

# Training the Stacking Model
stacking_model.fit(X_train, y_train)

# TODO: Calculate the accuracy on the test set and print it
predictions = stacking_model.predict(X_test) # Completed: Make predictions
accuracy = accuracy_score(y_test, predictions) # Completed: Calculate accuracy
print(f"Stacking Classifier Accuracy: {accuracy:.2f}") # Completed: Print accuracy

```

Complete the code to iterate through possible meta models for the stacking classifier and print the accuracy for each one.

```python
from sklearn.datasets import load_digits
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset and split it
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Define the base models
estimators = [
    ('rf', RandomForestClassifier(n_estimators=12, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=6, random_state=42))
]

# List of possible meta models
meta_models = [
    LogisticRegression(max_iter=100),
    RidgeClassifier(),
    DecisionTreeClassifier()
]

# Iterate through meta models and print accuracy
for meta_model in meta_models:
    # TODO: train StackingClassfier with estimators=estimators and final_estimator=meta_model
    # TODO: make predictions and calculate the accuracy. Put the accuracy into the accuracy variable
    print(f'Meta Model: {meta_model.__class__.__name__}, Accuracy: {accuracy:.2f}')

```

```python
from sklearn.datasets import load_digits
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset and split it
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Define the base models
estimators = [
    ('rf', RandomForestClassifier(n_estimators=12, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=6, random_state=42))
]

# List of possible meta models
meta_models = [
    LogisticRegression(max_iter=100, random_state=42), # Added random_state for reproducibility
    RidgeClassifier(random_state=42), # Added random_state for reproducibility
    DecisionTreeClassifier(random_state=42) # Added random_state for reproducibility
]

# Iterate through meta models and print accuracy
for meta_model in meta_models:
    # TODO: train StackingClassfier with estimators=estimators and final_estimator=meta_model
    stacking_model = StackingClassifier(estimators=estimators, final_estimator=meta_model)
    stacking_model.fit(X_train, y_train)

    # TODO: make predictions and calculate the accuracy. Put the accuracy into the accuracy variable
    predictions = stacking_model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    
    print(f'Meta Model: {meta_model.__class__.__name__}, Accuracy: {accuracy:.2f}')

```