### **Bagging (Bootstrap Aggregating) 🛍️**  
Bagging is an **ensemble learning** technique that **reduces variance** and prevents overfitting by training multiple models on different random subsets of data.

### **How It Works? ⚙️**  
1️⃣ **Bootstrap Sampling** → Random subsets of data are created with replacement.  
2️⃣ **Train Models** → Each subset trains an independent model (often decision trees).  
3️⃣ **Aggregate Predictions** →  
   - **Classification** → Majority voting 🗳️  
   - **Regression** → Averaging 📊  

### **Example Models Using Bagging**  
✅ **Random Forest** (Bagging + Decision Trees 🌲)  
✅ **BaggingClassifier** (Any base model)  

🚀 **Boosts accuracy & stability!**

In [None]:
import warnings
import pandas as pd 
import numpy as np 

# models 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier 
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier 

# model selection 
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV 

# Bagging 
from sklearn.ensemble import BaggingClassifier

# metric 
from sklearn.metrics import accuracy_score, confusion_matrix

# visualization 
import matplotlib.pyplot as plt 
import seaborn as sns

warnings.filterwarnings('ignore')

In [3]:
# import dataset 

df = pd.read_csv('./preprocessedData/cleanData.csv')

In [4]:
independent = df.drop(columns = 'target')

dependent = df[['target']]

In [5]:
x_train, x_test, y_train, y_test = train_test_split(independent, dependent, test_size = 0.3, stratify = dependent, random_state = 42)

### Ada boost bagging 

In [6]:
# Random forest  

rf_params = {
    "n_estimators": [100, 200, 500],  
    "max_depth": [10, 20, 30], 
    "min_samples_split": [2, 5, 10, 20, 30], 
    "max_features": ['sqrt', 'log2'], 
    "min_samples_leaf": [1, 2, 5], 
    "bootstrap": [True, False] 
}

In [7]:
rf_model = GridSearchCV(estimator = RandomForestClassifier(), param_grid = rf_params, cv = 5, n_jobs = -1).fit(x_train, y_train).best_estimator_

In [8]:
rf_model

In [9]:
ada_boost = AdaBoostClassifier(
    estimator = rf_model, # Use estimator instead of base_estimator
    n_estimators = 50,
    learning_rate = 1.0,
    random_state = 42
)

In [10]:
# bagging 

bagging_model = BaggingClassifier(estimator=ada_boost, 
                                  n_estimators=10, 
                                  random_state=42)

In [11]:
bagging_model.fit(x_train, y_train)

In [12]:
y_pred = bagging_model.predict(x_test)

In [13]:
accuracy_score(y_test, y_pred)

0.8583333333333333

In [14]:
y_pred = bagging_model.predict(x_train)

In [15]:
accuracy_score(y_train, y_pred)

0.975

# Gradient boost 

In [16]:
# Random forest  

gb_model = GradientBoostingClassifier(
    n_estimators=100,  
    learning_rate=0.1,  
    max_depth=3, 
    min_samples_split=2,  
    min_samples_leaf=1,  
    random_state=42
)

In [17]:
# bagging 

bagging_model = BaggingClassifier(estimator=gb_model, 
                                  n_estimators=10, 
                                  random_state=42)

In [18]:
bagging_model.fit(x_train, y_train)

In [19]:
y_pred = bagging_model.predict(x_test)

In [20]:
accuracy_score(y_test, y_pred)

0.85

In [21]:
y_pred = bagging_model.predict(x_train)

In [22]:
accuracy_score(y_train, y_pred)

0.9857142857142858

# XGBoost 

In [None]:
xgb_model = XGBClassifier(
    n_estimators=100, 
    learning_rate=0.1, 
    max_depth=3,  
    min_child_weight=1,  
    subsample=0.8,  
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,  
    eval_metric="logloss"  
)

In [None]:
# bagging 

bagging_model = BaggingClassifier(estimator=xgb_model, 
                                  n_estimators=15, 
                                  random_state=42)

In [None]:
bagging_model.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



In [None]:
y_pred = bagging_model.predict(x_test)
accuracy_score(y_test, y_pred)

0.85

In [None]:
y_pred = bagging_model.predict(x_train)
accuracy_score(y_train, y_pred)

0.9535714285714286