Ensemble Techniques

1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.

   Ensemble Learning is a technique in machine learning where we combine multiple models (called “base learners” or “weak learners”) to build a more accurate and robust final model.

Common Ensemble Methods:

   Bagging (Bootstrap Aggregating)

   Boosting

   Stacking

2. What is the difference between Bagging and Boosting?

   **Bagging (Bootstrap Aggregating):**

   Trains multiple models independently and in parallel on different random subsets of the training data.

   Uses sampling with replacement to create these subsets.

   The main goal is to reduce variance and prevent overfitting.

   Final prediction is made by averaging (for regression) or majority voting (for classification).

   Example algorithm: Random Forest.


   **Boosting:**

   Trains models sequentially, where each new model tries to fix the errors made by the previous models.

   Uses the entire dataset, but gives more weight to misclassified samples after each round.

   The main goal is to reduce bias and improve accuracy.

   Final prediction is made by combining models using a weighted sum, giving more importance to better models.

  Example algorithms: AdaBoost, Gradient Boosting, XGBoost.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
   
   **Bootstrap Sampling:**

   Bootstrap sampling is a technique where we create multiple random samples from the original dataset by sampling with replacement.

   **Role in Bagging:**

   Each model (like each decision tree in a Random Forest) is trained on a different bootstrap sample of the data.

   Because every model sees a slightly different dataset, they learn different patterns and make different errors.

   When we combine all the models’ predictions (by averaging or majority vote), the overall result becomes more stable and accurate.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

   **Out-of-Bag (OOB) Samples: **

   In Bagging (like in Random Forest), each model is trained on a bootstrap sample — a random sample taken with replacement from the original dataset.
   Because of this, about 63% of the data points are used in training each model, and the remaining 37% of the data points are not included in that sample.

   **OOB Score:**

      The OOB Score is the average prediction accuracy (or error) calculated using all OOB samples across all models in the ensemble.

      It provides an unbiased estimate of the model’s performance — similar to what you’d get from cross-validation.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

   **In a Single Decision Tree:**

      Feature importance is based on how much a feature reduces impurity (like Gini impurity or entropy) when it’s used to split the data.

      The more a feature helps in correctly classifying data (i.e., reduces impurity), the higher its importance score.

      Since the tree is built on one dataset, the importance values may be biased or unstable — small data changes can lead to big differences in importance.


   **In a Random Forest:**

   A Random Forest builds many decision trees on different bootstrap samples of data and averages their results.

   Feature importance is calculated by averaging the importance of each feature across all trees.

   This process gives a more stable, accurate, and unbiased estimate of which features are truly important.

   It reduces the effect of randomness or overfitting that might happen in a single tree.

6. Write a Python program to:

● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.


In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = model.feature_importances_
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

print("Top 5 Most Important Features:")
print(feature_importance.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


7.  Write a Python program to:

● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_acc = accuracy_score(y_test, dt_pred)

bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)
bag_acc = accuracy_score(y_test, bag_pred)

print("Accuracy of Single Decision Tree:", dt_acc)
print("Accuracy of Bagging Classifier:", bag_acc)


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


8. Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [2, 4, 6, None]
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': 2, 'n_estimators': 150}
Final Accuracy: 1.0


9. Write a Python program to:

● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset

● Compare their Mean Squared Errors (MSE)


In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bag_reg = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
bag_pred = bag_reg.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

rf_reg = RandomForestRegressor(n_estimators=50, random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Mean Squared Error (Bagging Regressor):", bag_mse)
print("Mean Squared Error (Random Forest Regressor):", rf_mse)


Mean Squared Error (Bagging Regressor): 0.25787382250585034
Mean Squared Error (Random Forest Regressor): 0.25772464361712627


 10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.

You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.

   **Answer**

   **1. Choosing between Bagging and Boosting:**

If the model is overfitting or data is noisy, use Bagging (e.g., Random Forest) to reduce variance and make the model stable.

If the model is underfitting and we need higher accuracy, use Boosting (e.g., XGBoost, AdaBoost) to reduce bias and improve performance.

**2. Handling Overfitting:**

Use cross-validation and regularization parameters like max_depth, min_samples_leaf, etc.

In Boosting, use a small learning rate and early stopping.

Limit the number of trees (n_estimators) and remove irrelevant features.

**3. Selecting Base Models:**

Use Decision Trees as base learners for both Bagging and Boosting.

Try multiple models like Random Forest, XGBoost, or CatBoost and choose the one giving best validation accuracy.

**4. Evaluating Performance using Cross-Validation:**

Use Stratified K-Fold Cross-Validation to keep the class balance of defaulters and non-defaulters.

Measure performance using Accuracy, Precision, Recall, F1-Score, and AUC-ROC.


**5. How Ensemble Learning Improves Decision-Making:**

Combines multiple models → gives more accurate and stable predictions.

Reduces bias and variance, leading to better generalization on new customers.

Helps financial institutions detect high-risk customers more reliably, reducing loan losses.

