# **Ensemble Learning**

Why use one machine learning model to solve a problem when you can use many at the same time? Ensemble learning is the process of building a complex machine learning model by combining multiple base estimators as building blocks. The main goals for employing ensemble methods are to end up with a resultant machine learning model that is more performant and more robust than its components.

In ensembling, often the base models that are chosen tend to underachieve on their own and are typically referred to as weak learners. The reason this is preferable is to have the model implementation be computationally efficient. Combining strong learners doesn’t necessarily make the resultant ensembled model more performant so one might as well choose learners that cost less computationally.

What makes a model a “weak learner”? There are many ways in which a base model can be considered weak. For example, it can have high bias or high variance. The nature of the weakness of the base model is typically taken into consideration and is a design choice when determining the best ensembling method to use.

The basic components of ensemble learning are:
- Base models that are weak learners
- An ensembling method that combines the base models to improve performance and robustness

It is possible to have an ensemble model that performs worse than any one of its contributing base estimators. To circumvent this outcome it is important that the base estimators are uncorrelated and independent. Having a higher diversity among the trained base estimators leads to a stronger ensemble model.

Three common ensembling methods in machine learning that will be covered briefly in this article and later on in separate modules are Bagging, Boosting, and Stacking.

![image](images/bagging_vs_boosting.png)

## **Bagging**

As the name suggests, Bootstrap AGGregatING (also known as Bagging) is an ensemble method that combines the concepts of bootstrapping and aggregation. Bagging can be used for both classification and regression problems. Bagging methods use weak learners as base models that are complex and tend to suffer from high variance. Their weakness as models is due to the fact that they are built with only a subset of the available features and on a subset of the training data due to bootstrapping.

If the full dataset is represented by the larger cookie in the figure above, each of the candies atop the cookie represents an individual training data instance. The smaller cookies in the Bagging panel represent a bootstrapped sample of the full training dataset. Bootstrapping refers to the method of sampling data with replacement.

Bagging is a learning technique that is done in parallel. Each of the base models is trained independently of the others. Additionally, each base model is trained using only a subset of the original features. Even though the base models tend to be complex, they overfit to both a subset of the available training data and a subset of the available features. This allows them to be diverse from one another, often leading to a very strong ensemble model when aggregated. In the Bagging panel of the figure, the base models are decision trees that are relatively large and overfit to the bootstrapped subset of data provided to each of them.

Once each of the base models is trained, the method for ensembling tends to be a simple aggregation technique over each of the models; a majority vote for classification problems and averaging for regression problems.

A common implementation of a bagging algorithm that uses decision trees as their base model is the Random Forest.

### **Random Forests**

We’ve seen that decision trees can be powerful supervised machine learning models. However, they’re not without their weaknesses — decision trees are often prone to overfitting. We’ve discussed some strategies to minimize this problem, like pruning, but sometimes that isn’t enough. We need to find another way to generalize our trees. This is where the concept of a random forest comes in handy.

A random forest is an ensemble machine learning technique. A random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.Some of the trees in the random forest may be overfit, but by making the prediction based on a large number of trees, overfitting will have less of an impact.

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time. To make a random forest, we use a technique called bagging, which is short for bootstrap aggregating

#### **Random Samples**

How it works is as follows: every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had 1000 rows in it, we could make a decision tree by picking 100 of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.

In bootstrapping, we’re doing this process with replacement. Picture putting all 100 rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag. This means that when we’re picking our 100 random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all 100 randomly picked rows could all be the same row! Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from 1000 rows to 100. We can pick 1000 rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.

Then after many trees have been made, the results are “aggregated” together. In the case of a classification task, often the aggregation is taking the majority vote of the individual classifiers. For regression tasks, often the aggregation is the average of the individual regressors.

#### **Random Feature Selection**

In addition to using bootstrapped samples of our dataset, we can continue to add variety to the ways our trees are created by randomly selecting the features that are used.

When we use a decision tree, all the features are used and the split is chosen as the one that increases the information gain the most. While it may seem counter-intuitive, selecting a random subset of features can help in the performance of an ensemble model. In the following example, we will use a random selection of features prior to model building to add additional variance to the individual trees. While an individual tree may perform worse, sometimes the increases in variance can help model performance of the ensemble model as a whole.

The two steps we walked through above created trees on bootstrapped samples and randomly selecting features. These can be combined together and implemented at the same time! Combining them adds an additional variation to the base learners for the ensemble model. This in turn increases the ability of the model to generalize to new and unseen data, i.e., it minimizes bias and increases variance. Rather than re-doing this process manually, we will use `scikit-learn`‘s bagging implementation, `BaggingClassifier()`, to do so.

#### **sklearn.ensemble.BaggingClassifier**

Much like other models we have used in scikit-learn, we instantiate a instance of `BaggingClassifier()` and specify the parameters. The first parameter, `base_estimator` refers to the machine learning  model that is being bagged. In the case of random forests, the base estimator would be a decision tree. We are going to use a decision tree classifier WITH a `max_depth` of 5, this will be instantiated with `BaggingClassifier(DecisionTreeClassifier(max_depth=5))`.

After the model has been defined, methods `.fit()`, `.predict()`, `.score()` can be used as expected. Additional hyperparameters specific to bagging include the number of estimators (`n_estimators`) we want to use and the maximum number of features we’d like to keep (`max_features`).

Note: While we have focused on decision tree classifiers (as this is the base learner for a random forest classifier), this procedure of bagging is not specific to decision trees, and in fact can be used for any base classifier or regression model. The scikit-learn implementation is generalizable and can be used for other base models!

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
    names=["buying", "maint", "doors", "persons", "lug_boot", "safety", "accep"],
)
df["accep"] = ~(df["accep"] == "unacc")  # 1 is acceptable, 0 if not acceptable
X = pd.get_dummies(df.iloc[:, 0:6], drop_first=True)
y = df["accep"]
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)

# 1. Bagging classifier with 10 Decision Tree base estimators
bag_dt = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=5), n_estimators=10)
bag_dt.fit(x_train, y_train)

print("Accuracy score of Bagged Classifier, 10 estimators:")
bag_accuracy = bag_dt.score(x_test, y_test)
print(bag_accuracy)

# 2.Set `max_features` to 10.
bag_dt_10 = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5), n_estimators=10, max_features=10
)
bag_dt_10.fit(x_train, y_train)

print("Accuracy score of Bagged Classifier, 10 estimators, 10 max features:")
bag_accuracy_10 = bag_dt_10.score(x_test, y_test)
print(bag_accuracy_10)

# 3. Change base estimator to Logistic Regression

bag_lr = BaggingClassifier(estimator=LogisticRegression(), n_estimators=10, max_features=10)
bag_lr.fit(x_train, y_train)

print("Accuracy score of Logistic Regression, 10 estimators:")
bag_accuracy_lr = bag_lr.score(x_test, y_test)
print(bag_accuracy_lr)

Accuracy score of Bagged Classifier, 10 estimators:
0.8912037037037037
Accuracy score of Bagged Classifier, 10 estimators, 10 max features:
0.9074074074074074
Accuracy score of Logistic Regression, 10 estimators:
0.875


#### **sklearn.ensemble.RandomForestClassifier**

The random forest algorithm has a slightly different way of randomly choosing features. Rather than choosing a single random set at the onset, each split chooses a different random set.

For example, when finding which feature to split the data on the first time, we might randomly choose to only consider the price of the car, the number of doors, and the safety rating. After splitting the data on the best feature from that subset, we’ll likely want to split again. For this next split, we’ll randomly select three features again to consider. This time those features might be the cost of maintenance, the number of doors, and the size of the trunk. We’ll continue this process until the tree is complete.

One question to consider is how to choose the number of features to randomly select. Why did we choose 3 in this example? A good rule of thumb is select as many features as the square root of the total number of features. Our car dataset doesn’t have a lot of features, so in this example, it’s difficult to follow this rule. But if we had a dataset with 25 features, we’d want to randomly select 5 features to consider at every split point.

You now have the ability to make a random forest using your own decision trees. However, scikit-learn has a `RandomForestClassifier()` class that will do all of this work for you! `RandomForestClassifier` is in the `sklearn.ensemble` module.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score

# 1. Create a Random Forest Classifier and print its parameters
rf = RandomForestClassifier()
print("Random Forest parameters:")
rf_params = rf.get_params()
print(rf_params)

# 2. Fit the Random Forest Classifier to training data and calculate accuracy score on the test data
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
print("Test set accuracy:")
rf_accuracy = rf.score(x_test, y_test)
print(rf_accuracy)

Random Forest parameters:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Test set accuracy:
0.9537037037037037


#### **sklearn.ensemble.RandomForestRegressor**

Just like in decision trees, we can use random forests for regression as well! It is important to know when to use regression or classification — this usually comes down to what type of variable your target is. Now, instead of a classification task, we will use `scikit-learn`‘s `RandomForestRegressor()` to carry out a regression task.

### **Extra Reading**

https://medium.com/data-science/random-forest-explained-a-visual-guide-with-code-examples-9f736a6e1b3c


## **Boosting**

Boosting is an ensemble learning technique where the weak learners are too simple and tend to suffer from high bias. In the Boosting panel of the figure above, the base models are decision trees with only one level, a decision stump. Decision stumps can only make a decision based off of one feature at a time, causing them to underfit the data substantially.

Boosting is a sequential learning technique where each of the base models builds off the previous model. Each subsequent model aims to improve the performance of the final ensembled model by attempting to fix the errors in the previous stage.

In the Boosting panel of the figure, you may notice that some of the candies atop the cookies are larger than the others. These particular training instances were misclassified by the previous decision stump and are therefore given more weight by the next decision stump. This is one method in which boosting methods may learn from their mistakes.

There are two important decisions that need to be made to perform boosted ensembling:
- Sequential Fitting Method
- Aggregation Method

Two common implementations of the boosting algorithm are Adaptive Boosting and Gradient Boosting.

### **Adaptive Boosting**

Adaptive Boosting (or AdaBoost) is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees.

For AdaBoost, the Sequential Fitting Method is accomplished by updating the weight attached to each of the training dataset observations as we proceed from one base model to the next. The Aggregation Method is a weighted sum of those base models where the model weight is dependent on the error of that particular estimator.

The training of an AdaBoost model is the process of determining the training dataset observation weights at each step as well as the final weight for each base model for aggregation.

![image](images/adaptive_boosting.png)

 We already said that the base models for boosting are supposed to be very simple and tend to underfit. That is correct, and for this reason we use the simplest version of a decision tree, known as a decision stump. A decision stump only makes a single decision, so the resultant estimator only has two leaf nodes.

Taking a look at the Result of the 1st Base Model, we see that the decision boundary, that is the border between the lighter green and lighter red regions, does a decent job of separating the green circles from the red triangles. However we do notice that there are two red triangles in the light green region. This indicates that they have been classified incorrectly by the decision stump.

Each of the base models will contribute a different amount to the final ensemble model. The influence that a particular base model contributes is going to be dependent on the number of errors it makes, or for regression, the magnitude of the errors it makes. We do not want a decision stump that does a terrible job of classifying the data to have the same say as a decision stump that does a great job. Once we are able to evaluate the Result of the 1st Base Model, we can Weight the Model and assign it a value, here indicated by alpha_1.

To prepare for the next stage of the sequential learning process, we need to Reweight the Data. The instances of the training data that were classified incorrectly by the 1st Base Model, the two red triangles in the middle right, are given a larger weight than the other data instances indicated by their larger size. By assigning those misclassified points a larger weight, we are asking the the 2nd Base Model to give them preferential treatment during the Model Fitting.

Taking a look at the Result of the 2nd Base Model, we see that is exactly what happens. The two larger red triangles are classified correctly by the 2nd Base Model. Once again we assign the base model a weight, alpha_2 proportional to the errors it makes and prepare for the next stage of the sequential learning by reweighting the training data. The instances that were incorrectly classified by the 2nd Base Model, the two green circles in on the center right, are given a larger weight.

Once we have reached the predefined number of estimators for our AdaBoost model, the base models are ready to aggregate. In this example we have chosen n_estimators = 3. The influence of each base model in the final ensemble model will be proportional to the alpha it was assigned during the training process.

#### **sklearn.ensemble.AdaBoostClassifier**

In [3]:
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load dataset to a pandas DataFrame
path_to_data = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
column_names = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "accep"]
df = pd.read_csv(path_to_data, names=column_names)

target_column = "accep"
raw_feature_columns = [col for col in column_names if col != target_column]

# Create dummy variables from the feature columns
X = pd.get_dummies(df[raw_feature_columns], drop_first=True)

# Convert target column to binary variable; 0 if 'unacc', 1 otherwise
df[target_column] = np.where(df[target_column] == "unacc", 0, 1)
y = df[target_column]

# Split the full dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3)


# 1. Create a decision stump base model using the Decision Tree Classifier and print its parameters
decision_stump = DecisionTreeClassifier(max_depth=1)
print(decision_stump.get_params())

# 2. Create an Adaptive Boost Classifier and print its parameters
ada_classifier = AdaBoostClassifier(estimator=decision_stump, n_estimators=5)
print(ada_classifier.get_params())

# 3. Fit the Adaptive Boost Classifier to the training data and get the list of predictions
ada_classifier.fit(X_train, y_train)
y_pred = ada_classifier.predict(X_test)

{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 1, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': None, 'splitter': 'best'}
{'algorithm': 'deprecated', 'estimator__ccp_alpha': 0.0, 'estimator__class_weight': None, 'estimator__criterion': 'gini', 'estimator__max_depth': 1, 'estimator__max_features': None, 'estimator__max_leaf_nodes': None, 'estimator__min_impurity_decrease': 0.0, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 2, 'estimator__min_weight_fraction_leaf': 0.0, 'estimator__monotonic_cst': None, 'estimator__random_state': None, 'estimator__splitter': 'best', 'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 1.0, 'n_estimators': 5, 'random_state': None}


### **Gradient Boosting**

Gradient Boosting is a sequential ensembling method that can be used for both classification and regression. It can use any base machine learning model, though it is most commonly used with decision trees, known as Gradient Boosted Trees.

For Gradient Boost, the Sequential Fitting Method is accomplished by fitting a base model to the negative gradient of the error in the previous stage. The Aggregation Method is a weighted sum of those base models where the model weight is constant.

![image](images/gradient_boosting.png)

The errors will be greater for the training data instances where the model did not do as good of a job with its prediction and will be lower on training data instances where the model fit the data well.

In the next stage of the sequential learning process, we fit the 2nd Base Model. Here is where the interesting part comes in. Instead of fitting the model to the target values y_actual as we are typically used to doing in machine learning, we actually fit the model on the errors of the previous stage, in this case $h_1$. The 2nd Base Model is literally learning from the mistakes of the 1st Base Model through those residuals that were calculated.

The results of the 2nd Base Model are multiplied by a constant learning rate, alpha, and added to the results of the 1st Base Model to give the set of updated predictions, The results of the second base model, which was tasked with fitting the errors of the first base model are multiplied by a constant learning rate, alpha and added to the results of the first base model to give us a set of updated predictions, $y_2$.

The subsequent stages repeat the same steps. At stage N, the base model is fit on the errors calculated at the previous stage $h_{(N-1)}$. The new model that is fit is multiplied by the constant learning rate alpha and added to the predictions of the previous stage.

Once we have reached the predefined number of estimators for our Gradient Boosting model or the residual errors are not changing between iterations, the model will stop training and we end up with the resultant ensemble model.

#### **sklearn.ensemble.GradientBoostingClassifier**

In [4]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Load dataset to a pandas DataFrame
path_to_data = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
column_names = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "accep"]

df = pd.read_csv(path_to_data, names=column_names)
target_column = "accep"
raw_feature_columns = [col for col in column_names if col != target_column]

# Create dummy variables from the feature columns
X = pd.get_dummies(df[raw_feature_columns], drop_first=True)

# Convert target column to binary variable; 0 if 'unacc', 1 otherwise
df[target_column] = np.where(df[target_column] == "unacc", 0, 1)
y = df[target_column]

# Split the full dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.3)

# 1. Create a Gradient Boosting Classifier and print its parameters
grad_classifier = GradientBoostingClassifier(n_estimators=15)

print(grad_classifier.get_params())

# 2. Fit the Gradient Boosted Trees Classifier to the training data and get the list of predictions
grad_classifier.fit(X_train, y_train)
y_pred = grad_classifier.predict(X_test)

{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.1, 'loss': 'log_loss', 'max_depth': 3, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 15, 'n_iter_no_change': None, 'random_state': None, 'subsample': 1.0, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


### **Extreme Gradient Boosting (XGBoost)**

XGBoost (Extreme Gradient Boosting) is an optimized version of Gradient Boosting that is faster and more efficient than traditional gradient boosting algorithms. It uses second-order derivatives to approximate loss functions, improving optimization. Key Features
- Regularization (L1 & L2) to prevent overfitting.
- Tree Pruning to prevent unnecessary splits.
- Parallel & Distributed Computing for scalability.
- Handling Missing Values Automatically.

#### **xgboost.xgb**

In [5]:
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Load dataset
data = datasets.load_breast_cancer()  # For regression example
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define XGBoost model
xgb_model = xgb.XGBRegressor(
    objective="reg:squarederror", n_estimators=100, learning_rate=0.1, max_depth=4
)

# Train model
xgb_model.fit(X_train, y_train)

# Predict
y_pred = xgb_model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"XGBoost MSE: {mse:.4f}")

XGBoost MSE: 0.0432


#### **Extra Reading**

https://medium.com/sfu-cspmp/xgboost-a-deep-dive-into-boosting-f06c9c41349

https://medium.com/@prathameshsonawane/xgboost-how-does-this-work-e1cae7c5b6cb

https://towardsdatascience.com/xgboost-the-definitive-guide-part-1-cc24d2dcd87a/

https://python.plainenglish.io/mastering-xgboost-a-beginners-guide-to-boosting-classification-performance-8a6de637a293

https://python.plainenglish.io/anomaly-detection-end-to-end-real-life-bank-card-fraud-detection-with-xgboost-2a343f761fa9

https://medium.com/data-science/visualizing-xgboost-parameters-a-data-scientists-guide-to-better-models-38757486b813

### **LightGBM**

LightGBM is a gradient boosting framework optimized for speed and efficiency. Instead of growing trees level-wise (like XGBoost), it grows trees leaf-wise, which allows for faster training and better accuracy. Key features include:
- Histogram-based learning (reduces computation time).
- Leaf-wise tree growth (improves accuracy but may overfit).
- Handles large datasets efficiently.
- Built-in categorical feature handling.


#### **lightgbm.lgb**

In [6]:
import lightgbm as lgb
from sklearn.datasets import load_diabetes  # Regression example
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Load dataset
data = load_diabetes()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create dataset for LightGBM
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Define parameters
params = {"objective": "regression", "metric": "mse", "learning_rate": 0.1, "num_leaves": 31}

# Train model
lgb_model = lgb.train(params, train_data, num_boost_round=100, valid_sets=[test_data])

# Predict
y_pred = lgb_model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print(f"LightGBM MSE: {mse:.4f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000081 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 595
[LightGBM] [Info] Number of data points in the train set: 353, number of used features: 10
[LightGBM] [Info] Start training from score 153.736544
LightGBM MSE: 3203.0861


#### **Extra Reading**

https://medium.com/@turkishtechnology/light-gbm-light-and-powerful-gradient-boost-algorithm-eaa1e804eca8

https://medium.com/@mohtasim.hossain2000/mastering-lightgbm-an-in-depth-guide-to-efficient-gradient-boosting-8bfeff15ee17

https://towardsdatascience.com/a-quick-guide-to-lightgbm-library-ef5385db8d10/

https://towardsdatascience.com/lightgbm-the-fastest-option-of-gradient-boosting-1fb0c40948a3/


### **CatBoost**

CatBoost is designed specifically for categorical data and automatically handles categorical variables without requiring one-hot encoding. However, CatBoost introduces several key innovations that distinguish it from other gradient boosting methods:

1. Handling Categorical Features: CatBoost converts categorical values into numerical values using a novel algorithm that takes into account the target variable for ordering categorical levels. This process, known as “target statistics,” helps in reducing overfitting and provides a more accurate representation of categorical data.

2. Ordered Boosting: One of the core innovations of CatBoost is its ordered boosting mechanism. Traditional gradient boosting methods can suffer from prediction shift due to the overlap between the training data for the base models and the data used to calculate the gradients. CatBoost addresses this by introducing a random permutation of the dataset in each iteration and using only the data before each example in the permutation for training. This approach reduces overfitting and improves model robustness.

3. Symmetric Trees: CatBoost builds balanced trees, also known as symmetric trees, as its base predictors. Unlike traditional gradient boosting methods that build trees leaf-wise or depth-wise, CatBoost’s symmetric trees ensure that all leaf nodes at the same level share the same decision rule. This leads to faster execution and reduces the likelihood of overfitting.

#### **catboost.CatBoostClassifier**

```python
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostClassifier
model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=3, verbose=200)

# Fit model
model.fit(X_train, y_train, eval_set=(X_test, y_test))

# Make predictions
predictions = model.predict(X_test)
```

- `iterations (num_boost_round)`: The number of gradient-boosted trees to construct. A higher value generally leads to better performance but increases training time and risks overfitting.
- `learning_rate`: The step size used in each gradient boosting iteration. Lower learning rates slow down learning for a more gradual approach (possibly preventing overfitting), while higher values converge faster (but risks overshooting the optimal solution).
- `depth (max_depth)`: The maximum depth of each decision tree. Deeper trees allow for more complex interactions, but increase overfitting risk. Experiment with different depths to strike a balance.
- `l2_leaf_reg (reg_lambda)`: L2 regularization coefficient applied to leaf values. Larger values impose heavier regularization to prevent overfitting, potentially at the cost of some accuracy.

#### **catboost.CatBoostRegressor**

```python
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostRegressor
model = CatBoostRegressor(iterations=1000, learning_rate=0.1, depth=3, verbose=200)

# Fit model
model.fit(X_train, y_train, eval_set=(X_test, y_test))

# Make predictions
predictions = model.predict(X_test)
```

#### **Extra Reading**

https://medium.com/we-talk-data/what-is-catboost-a-guide-to-boosting-techniques-f370a41f989d

https://ai.plainenglish.io/understand-catboost-intuition-and-training-process-e0bf258065f2

## **Stacking (Meta-Ensembling)**
Stacking is an extremely flexible ensembling technique where a final model is trained to learn how to best combine a set of base models to make strong predictions. In contrast to bagging and boosting, the base models in stacking do not need to be the same type of learning algorithm.

While bagging and boosting are built with base models that are weak learners, that does not necessarily have to be the case for stacking. A stacking algorithm can be used to combine decently performing learners as well. Unlike bagging, there are no subsampling processes used. The stacking model effectively uses a full training set.

Consider the scenario in which we are handling a classification problem where we are exploring different models. When using a decision tree we find that our model has poor generalization performance as a result of overfitting. Additionally, when using a logistic regression classification model, we discover our model parameters are hard to tune due to highly-correlated features. We might discover that while each model has a unique advantage over the others, each may also have a distinct drawback such as poor generalization accuracy or the inability to predict a specific class. As it turns out, there’s no rule stating we can’t take the best of both worlds!

Stacking can be thought of as a democracy of machine learning models, where different models are trained and subsequently cast their vote through their predictions. A majority-rules approach can be used for determining the final model prediction if we weighted each estimator equally. In practice, each base estimator may need to be weighted differently, so we have a later-stage model to learn how to appropriately weigh the predictions of all the prior base estimators.

### **Training Base Estimators**

We can select from combinations of different base estimators, such as a logistic regression model in combination with a decision tree. We could additionally select models of the same learning algorithms, but with different parameters, such as multiple decision trees with varying depths. The number of estimators is arbitrary, so it’s good practice to explore how different combinations behave.

This introduces a problem, however. The estimators would be making predictions on data used in training. This puts our model at risk of overfitting. To avoid this, we use K-Fold Cross Validation as described next.

![image](images/stacking.png)

To assemble our ensemble, we’ll make a dictionary of base estimators. This will be contained within the `level_0_estimators` dict. Also, our final estimator will be a Random Forest as represented with `level_1_estimator`. Notice also how we prepare to add new features to our training dataset (as columns) in `level_0_columns`.

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

rand_state = 42

level_0_estimators = dict()
level_0_estimators["logreg"] = LogisticRegression(random_state=rand_state)
level_0_estimators["forest"] = RandomForestClassifier(random_state=rand_state)

level_0_columns = [f"{name}_prediction" for name in level_0_estimators.keys()]

level_1_estimator = RandomForestClassifier(random_state=rand_state)

### **K-Fold Cross Validation**

Consider 10 segments (or folds) and a stacking model that uses a logistic regression model and a decision tree model. Each estimator can be trained using data from 9 of the segments, and make predictions on the excluded 10th segment. We then append the predictions as new features to that 10th segment. Now 1/10th of the training data has two new features: one is the prediction made by the logistic regression model and the other is the prediction made by the decision tree model.

We want to do the same with the other 9 segments, so we rotate the excluded segment and repeat this process until all training data points are augmented with new features. The end result is a prediction made on each training sample, without having seen the sample during the training process.


Handling our k-fold cross-validation is fairly straightforward using `sklearn.model_selection.StratifiedKFold` from the scikit-learn library. The kfold is then given to the instantiated `StackingClassifier`.

In [9]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=rand_state)


### **Feature Augmentation**

In our stacking setup, the base estimators need to be trained to make predictions on our training data. The prediction of each estimator will be appended to the corresponding data sample as a new feature. We thus augment the training data set with this additional information. The augmented training set is used by our later-stage stacking model to make the final prediction.

![image](images/feature_augmentation.png)

So, in summary, say our training dataset has 10,000 samples, 10 features, and we select 3 base estimators. We would train each base estimator on the training set and make predictions on the training set. Each estimator would make a prediction on each sample, therefore each sample will have 3 predictions. These 3 predictions are appended to the pre-existing 10 features. This leaves us with 10,000 training samples with 13 features each.


| Sample | Logistic Regression prediction | Decision Tree prediction | pH | Conductivity | Turbidity | Potability |
| :----- | :----------------------------- | :----------------------- | :------- | :------------ | :-------- | :---------- |
| 1 | 0 | 1 | 8.316766 | 363.266516 | 4.628771 | 1 |
| 2 | 0 | 0 | 9.092223 | 398.410813 | 4.075075 | 0 |
| 3 | 1 | 1 | 5.584087 | 280.467916 | 2.559708 | 1 |

### **sklearn.ensemble.StackingClassifier**
With our augmented training set, we’re ready to prepare the final model that will make the official prediction from our stacking setup.

This later-stage model is a learning algorithm that we select just like the base estimators. We may even reuse an algorithm from an earlier stage here. The purpose of this model is to learn the proper weighting of the earlier estimator given our training samples now include a data point from each estimator. Some estimators may perform better than others, so our overall model should account for this.

The only difference in the training process is that the model will be designed to accept samples of the augmented size rather than the given size. This means if the given data set has n features and m base estimators, this model will require n + m features on each sample.

And that’s it! Once the later-stage model is trained, we feed testing data samples into the base estimators, which will append their predictions to the data sample. This sample is then given to the later-stage model for our final prediction.


```python
from sklearn.ensemble import StackingClassifier


stacking_clf = StackingClassifier(
    estimators=list(level_0_estimators.items()),
    final_estimator=level_1_estimator,
    passthrough=True,
    cv=kfold,
    stack_method="predict_proba",
)

df = pd.DataFrame(
    stacking_clf.fit_transform(X_train, y_train), columns=level_0_columns + list(X_train.columns)
)
```

Calling `fit_transform` on our classifier manages a lot of the heavy lifting. It will handle the training of our level_0 base estimators along with the level_1_estimator, make cross-validated predictions on the training set, and augment the training set with predictions from each estimator. Let’s see how the resulting training dataset looks:


| Sample | logreg\_prediction | forest\_prediction | pH | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic\_carbon | Trihalomethanes | Turbidity |
| :----- | :----------------- | :----------------- | :------- | :------- | :---------- | :---------- | :---------- | :------------ | :------------- | :--------------- | :-------- |
| 0 | 0.417408 | 0.17 | 9.927024 | 208.490738 | 19666.992792 | 8.008618 | 340.237824 | 482.842435 | 11.360427 | 85.829113 | 4.051733 |
| 1 | 0.379019 | 0.26 | 8.769676 | 215.368742 | 13969.438863 | 7.548543 | 322.799070 | 369.016667 | 18.919188 | 54.755214 | 3.776718 |
| 2 | 0.409582 | 0.87 | 8.077261 | 125.302719 | 23931.282833 | 8.773162 | 317.693331 | 398.328789 | 15.279583 | 62.668356 | 4.279871 |
| 3 | 0.407893 | 0.73 | 9.739562 | 166.948864 | 13623.160063 | 7.235922 | 385.059134 | 369.591289 | 12.322604 | 68.505852 | 2.568080 |
| 4 | 0.399452 | 0.04 | 5.343075 | 211.662091 | 45166.912141 | 6.651801 | 279.767500 | 485.959717 | 19.682337 | 70.546862 | 4.240032 |



Finally, with our full model trained, we can make predictions. Let’s compare how our Stacking classifier performed with how a lone linear model and a lone decision tree model perform!

```python
y_val_pred = stacking_clf.predict(X_test)
stacking_accuracy = accuracy_score(y_test, y_val_pred)

vanilla_logistic_regression = LogisticRegression(random_state=rand_state).fit(X_train, y_train)
lr_accuracy = accuracy_score(y_test, vanilla_logistic_regression.predict(X_test))
                                   
vanilla_decision_tree = RandomForestClassifier(random_state=rand_state).fit(X_train, y_train)
dt_accuracy =  accuracy_score(y_test, vanilla_decision_tree.predict(X_test))

print(f'Stacking accuracy: {stacking_accuracy:.4f}')
print(f'Logistic Regression accuracy: {lr_accuracy:.4f}')
print(f'Decision Tree accuracy: {dt_accuracy:.4f}')
```

### **Limitations**

Stacking is very powerful in that we remove the occasionally difficult choice of which learning algorithm to use for our problem. Depending on the use case, this benefit does come with some limitations worth noting:
- Because we have an arbitrary number of learning algorithms in use, training an entire stacking model is computationally expensive. This is also true for deployed inference models.
- Such a large model with many parameters means that a plethora of data is needed for proper training. Small datasets won’t see significant gains with stacking. Stacking models typically yields marginal gains over the best single estimator used for the same problem. When successful, a stacked model may reduce error by 2% or less.

Stacking offers some creativity and openendedness in how we want to build our model. A vanilla stacking model may have one tier of models to contribute to the final prediction. We could alternatively construct a multi-tier approach in which one level of models feeds into a later stage of models before making a final prediction.

# **Extra Reading**

https://medium.com/@hassaanidrees7/gradient-boosting-vs-random-forest-which-ensemble-method-should-you-use-9f2ee294d9c6

https://blog.gopenai.com/how-bayesian-ml-model-tuning-avoids-overfitting-hyperoptimized-gradient-boosting-hgboost-in-9ef68f719b07

https://pub.towardsai.net/bagging-vs-boosting-the-power-of-ensemble-methods-in-machine-learning-6404e33524e6

https://pub.towardsai.net/reinforcement-learning-enhanced-gradient-boosting-machines-77457e8cb4d9

https://medium.com/code-applied/time-series-secrets-prediction-model-with-lightgbm-572a689e974f

https://medium.com/@vikashsinghy2k/xgboost-for-regression-in-python-build-train-and-evaluate-your-model-19eec4ac2f74

https://medium.com/chat-gpt-now-writes-all-my-articles/a-new-gradient-boosting-method-from-standford-research-ngboost-python-code-included-875004ef3f47

https://medium.com/@kylejones_47003/boosting-stacking-and-bagging-for-ensemble-models-for-time-series-analysis-with-python-d74ab9026782

