#Ensemble Learning | Assignment

#Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Answer:  

Ensemble Learning is a machine learning technique in which multiple individual models, called base learners, are combined to build a single, more powerful predictive model. Instead of relying on one model, ensemble learning aggregates the predictions of several models to improve overall performance.


The key idea behind ensemble learning is that a group of diverse models can correct each other’s errors. By combining their predictions (using methods like voting or averaging), ensemble models achieve higher accuracy, better generalization, and reduced overfitting compared to individual models.

#Question 2: What is the difference between Bagging and Boosting?

Answer: Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques, but they differ in how models are trained and combined.

Bagging trains multiple models independently on different bootstrap samples drawn randomly from the training dataset. All models are given equal importance, and their predictions are combined using averaging (for regression) or majority voting (for classification). Bagging mainly helps in reducing variance and overfitting.

Boosting, on the other hand, trains models sequentially. Each new model focuses more on the data points that were misclassified by previous models. Misclassified samples are given higher importance, and final predictions are made using a weighted combination of all models. Boosting mainly helps in reducing bias and improving accuracy.


| Aspect           | Bagging                  | Boosting           |
| ---------------- | ------------------------ | ------------------ |
| Training style   | Independent              | Sequential         |
| Data sampling    | Random bootstrap samples | Reweighted samples |
| Model importance | Equal                    | Weighted           |
| Main goal        | Reduce variance          | Reduce bias        |
| Example          | Random Forest            | AdaBoost, XGBoost  |


#Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Answer:

Bootstrap sampling is a resampling technique in which multiple training datasets are created by randomly sampling data points from the original dataset with replacement. As a result, some data points may appear multiple times in a sample, while others may not appear at all.

In bagging methods like Random Forest, bootstrap sampling is used to train each decision tree on a different subset of the data. This introduces diversity among the trees, reduces correlation between them, and helps in lowering variance. By combining predictions from many independently trained trees, Random Forest achieves better accuracy and robustness compared to a single decision tree.

#Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer:
 Out-of-Bag (OOB) samples are the data points that are not selected during bootstrap sampling when training individual models in bagging-based ensemble methods such as Random Forest. On average, about one-third of the original data is left out of each bootstrap sample and becomes OOB data for that model.

The OOB score is used as an internal validation method to evaluate the performance of ensemble models without using a separate validation set. Each data point is predicted using only the models for which it was an OOB sample, and the aggregated predictions are compared with the true labels. The resulting OOB score provides an unbiased estimate of the model’s generalization performance.

#Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
Answer:

 Feature importance analysis helps identify which input features contribute most to a model’s predictions. The way feature importance is calculated and interpreted differs between a single Decision Tree and a Random Forest.

In a single Decision Tree, feature importance is based on how much a feature reduces impurity (such as Gini Impurity or Entropy) at each split. Since the tree is built on the entire dataset, the importance values can be unstable and sensitive to small changes in the data.

In a Random Forest, feature importance is calculated by averaging the impurity reduction across all trees in the forest. Because Random Forest uses multiple trees trained on different bootstrap samples and random feature subsets, the resulting feature importance scores are more stable, reliable, and less prone to overfitting.


| Aspect                    | Decision Tree           | Random Forest              |
| ------------------------- | ----------------------- | -------------------------- |
| Number of models          | Single tree             | Multiple trees             |
| Stability                 | Low (sensitive to data) | High (averaged over trees) |
| Overfitting               | More likely             | Less likely                |
| Reliability of importance | Lower                   | Higher                     |
| Generalization            | Weaker                  | Stronger                   |


#Question 6: Write a Python program to:

● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()

● Train a Random Forest Classifier

● Print the top 5 most important features based on feature importance scores.

(Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Create DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort and select top 5 features
top_5_features = feature_importance_df.sort_values(
    by='Importance', ascending=False
).head(5)

# Print top 5 important features
print("Top 5 Most Important Features:")
print(top_5_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


#Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset

● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

Answer:->

In [2]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Train Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
bag_pred = bagging.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_pred)

# Print accuracies
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


#Question 8: Write a Python program to:

● Train a Random Forest Classifier

● Tune hyperparameters max_depth and n_estimators using GridSearchCV

● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10]
}

# GridSearchCV (cv reduced to avoid errors)
grid = GridSearchCV(
    rf,
    param_grid,
    cv=3,
    scoring='accuracy'
)

# Train
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Prediction
y_pred = best_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 50}
Final Accuracy: 0.9649122807017544


#Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset

● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)


Answer:

In [8]:
pip install -U scikit-learn




In [11]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)

# Calculate Mean Squared Error
bagging_mse = mean_squared_error(y_test, bagging_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25592438609899626
Random Forest Regressor MSE: 0.2553684927247781


#Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:

● Choose between Bagging or Boosting

● Handle overfitting

● Select base models

● Evaluate performance using cross-validation

● Justify how ensemble learning improves decision-making in this real-world
context.

Answer -> 1. **Choosing between Bagging or Boosting**

To predict loan default, I would first analyze the data complexity and error patterns.

Bagging is useful when the base model has high variance and tends to overfit, such as decision trees.

Boosting is preferred when the goal is to improve accuracy by focusing on difficult and misclassified cases.

Since loan default prediction is a high-risk, imbalanced problem where misclassifying defaulters is costly, Boosting would be the preferred choice because it reduces bias and improves predictive performance.

**2. Handling Overfitting**

Overfitting can be controlled by:

Limiting tree depth and number of estimators

Using regularization parameters (learning rate, subsampling)

Applying early stopping (in boosting methods)

Validating model performance using cross-validation

These techniques ensure that the model generalizes well to unseen customer data

**3. Selecting Base Models**

Decision Trees are chosen as base models because:

They capture non-linear relationships in financial data

They handle feature interactions effectively

They are easy to interpret, which is important in financial institutions

Simple decision trees act as weak learners that can be combined effectively in ensemble methods.

**4. Evaluating Performance using Cross-Validation**

Cross-validation is used to assess model stability and robustness.

K-fold cross-validation ensures that the model performs consistently across different subsets of data

Evaluation metrics such as ROC-AUC, Recall, and Precision are used instead of accuracy to handle class imbalance

This provides a reliable estimate of real-world performance.

**5. Business Justification of Ensemble Learning**

Ensemble learning improves decision-making by:

Increasing prediction accuracy for loan default

Reducing financial risk by correctly identifying high-risk customers

Supporting fair and data-driven loan approval decisions

Enhancing trust and compliance through stable and reliable predictions

By combining multiple models, ensemble techniques deliver more robust and reliable results than single models in real-world financial applications.


