# **PW SKILLS**
## ENSEMBLE LEARNING ASSIGNMENT

### **Question 1**: What is Ensemble Learning in machine learning? Explain the key idea behind it.

**Answer:**
Ensemble Learning is a powerful technique in machine learning where multiple models (often called “weak learners”) are combined to produce a stronger predictive model. The central idea is that instead of relying on a single model, combining the predictions of multiple models helps reduce errors, improve accuracy, and generalize better to unseen data.

The logic behind ensemble learning is similar to the concept of “wisdom of the crowd.” Just as collective decisions made by a group are often more reliable than an individual’s judgment, the combined predictions of several models usually perform better than a single model.

Key advantages of ensemble learning include:

Reduction of variance – Combining models reduces overfitting, making predictions more stable.

Reduction of bias – By combining weak learners, the overall bias is reduced, leading to better accuracy.

Improved generalization – Ensembles generalize better on test data.

Common ensemble techniques are Bagging (Bootstrap Aggregating), Boosting, and Stacking. For example, Random Forest uses bagging with decision trees, while AdaBoost and XGBoost use boosting techniques.

Thus, the key idea of ensemble learning is: “A group of weak learners, when combined in the right way, can create a strong learner that outperforms any individual model.”

## **Question 2: What is the difference between Bagging and Boosting?**

Answer:
Bagging and Boosting are two major ensemble learning techniques, but they differ in approach and objectives:

Definition:

Bagging (Bootstrap Aggregating): Creates multiple subsets of the dataset using bootstrap sampling, trains separate models on each subset, and combines predictions (e.g., Random Forest).

Boosting: Builds models sequentially, where each new model focuses on correcting the errors of the previous one (e.g., AdaBoost, XGBoost).

Training Approach:

Bagging trains models in parallel.

Boosting trains models sequentially.

Error Handling:

Bagging reduces variance (helps avoid overfitting).

Boosting reduces bias (helps improve weak learners).

Weights on Data:

Bagging treats all observations equally.

Boosting assigns higher weights to misclassified samples, forcing later models to focus on them.

Performance:

Bagging is more stable and less prone to overfitting.

Boosting usually gives higher accuracy but risks overfitting if not tuned properly.

Example:

Random Forest → Bagging technique.

AdaBoost/XGBoost → Boosting techniques.

In short, Bagging focuses on stability and variance reduction, while Boosting focuses on accuracy and bias reduction.

## **Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

Answer:
Bootstrap Sampling is a statistical technique where multiple subsets of data are created from the original dataset by sampling with replacement. This means the same data point can appear multiple times in one subset, while some data points may not appear at all.

In Bagging methods (like Random Forest):

Each base learner (e.g., a decision tree) is trained on a different bootstrap sample of the dataset.

Because each learner sees slightly different data, the models become diverse.

The final prediction is obtained by aggregating (majority vote for classification or averaging for regression).

Why it is important in Random Forest:

Ensures diversity among decision trees.

Reduces correlation between models.

Leads to better generalization and reduced overfitting.

For example, in Random Forest, if we create 100 trees, each tree is trained on a bootstrap sample of the dataset. This makes the ensemble stronger and more robust compared to a single decision tree.

## **Question** 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

Answer:
When bootstrap sampling is performed in Bagging methods, some data points are left out of the training subset. These are called Out-of-Bag (OOB) samples.

Key points:

On average, about 36% of data is not included in a bootstrap sample.

These OOB samples can act as a validation set for the model.

Each base learner (e.g., a tree in Random Forest) is tested on its corresponding OOB samples.

OOB Score:

It is the performance of the model evaluated using only the OOB samples.

Provides an unbiased estimate of the generalization error without needing a separate validation/test set.

Advantages:

Saves computation time and data (no need for separate cross-validation).

Gives a reliable performance measure during model training.

For example, in a Random Forest classifier, the OOB score is often used as a quick evaluation metric to judge accuracy.

## **Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.**

Answer:
In a Single Decision Tree:

Feature importance is calculated based on how much a feature decreases impurity (e.g., Gini impurity or entropy) when used for splitting.

The importance value depends heavily on the structure of that specific tree.

It can be biased, as one strong feature may dominate the tree structure, while others may be ignored.

In a Random Forest (Ensemble of Trees):

Feature importance is averaged across multiple decision trees.

Each feature’s importance is measured by its contribution to reducing impurity in all trees.

Provides a more stable and reliable ranking compared to a single tree.

Reduces bias since multiple trees contribute to the calculation.

Comparison:

Single Tree → Importance may be unstable and highly dataset-dependent.

Random Forest → More robust, generalized, and reliable measure of importance.

Thus, Random Forest provides a better and more balanced estimate of feature importance than a single Decision Tree.

## **Question 6: Python program – Random Forest on Breast Cancer dataset & top 5 important features**

Answer:
Random Forest is an ensemble of decision trees that ranks features by their contribution to classification accuracy. Using the Breast Cancer dataset, we can identify which features are most important for predicting whether a tumor is malignant or benign.

In [4]:
#QUESTION 6:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Print top 5 important features
print("Top 5 Important Features:")
print(feature_importance_df.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


## **Question 7: Python program – Bagging Classifier vs Decision Tree on Iris dataset**

Answer:
Bagging improves model performance by combining multiple Decision Trees trained on bootstrapped samples. We compare its accuracy with a single Decision Tree.

In [5]:
#QUESTION 7:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_acc = accuracy_score(y_test, y_pred_dt)

# Bagging with Decision Trees
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
                            n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, y_pred_bag)

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)

Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


## Question **8**: Python program – Random Forest with GridSearchCV for hyperparameter tuning

Answer:
Random Forest performance depends on hyperparameters like max_depth and n_estimators. GridSearchCV helps find the best combination.

In [6]:
#QUESTION 8
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
                                                    test_size=0.3, random_state=42)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None]
}

# Grid Search
grid = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best model
best_rf = grid.best_estimator_
y_pred = best_rf.predict(X_test)
final_acc = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", final_acc)


Best Parameters: {'max_depth': 10, 'n_estimators': 200}
Final Accuracy: 0.9707602339181286


## **Question 9: Python program – Bagging Regressor vs Random Forest Regressor (California Housing dataset)**

Answer:
For regression, ensemble methods like Bagging and Random Forest reduce prediction errors. We compare their Mean Squared Error (MSE).

In [7]:
#QUESTION 9
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
                                                    test_size=0.3, random_state=42)

# Bagging Regressor
bag_reg = BaggingRegressor(n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Bagging Regressor MSE:", mse_bag)
print("Random Forest Regressor MSE:", mse_rf)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


## **Question 10: Real-world case – Ensemble Learning for Loan Default Prediction**

Answer:
In a financial institution, predicting loan default is crucial. Using ensemble methods improves reliability and reduces risk.

Step-by-Step Approach:

Choice between Bagging or Boosting:

If the dataset is large and prone to overfitting → Bagging (Random Forest).

If accuracy and handling complex relationships matter → Boosting (XGBoost).

Handling Overfitting:

Use techniques like limiting tree depth (max_depth), early stopping, and regularization.

Cross-validation to tune parameters.

Selecting Base Models:

Decision Trees are common base learners.

Logistic Regression + Gradient Boosting can also be combined in stacking.

Performance Evaluation:

Use k-fold cross-validation to measure accuracy, precision, recall, and ROC-AUC.

OOB score can be used in Random Forest.

Justification of Ensemble Learning in Banking:

Reduces risk of wrong predictions.

Improves decision-making by combining strengths of multiple models.

Ensures fairness and stability in loan approval systems.

In [8]:
#QUESTION 10:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Simulated dataset (replace with real banking data)
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Gradient Boosting (Boosting method)
gb = GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)

print("Loan Default Prediction Accuracy:", accuracy_score(y_test, y_pred))


Loan Default Prediction Accuracy: 0.8833333333333333
