---

# Assignment Code: DA-AG-014


---


# Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.

- Answer:

Ensemble Learning in machine learning is a technique where we combine the predictions of multiple models to create a stronger overall model. Instead of relying on a single “weak” learner, ensemble methods bring together several learners to improve accuracy and robustness.

Key Idea:

The central idea is that a group of models working together performs better than any single model alone.

Each model (called a base learner) may make some mistakes, but when we combine their results through methods like bagging, boosting, or stacking, the errors are reduced.

It works on the principle of “wisdom of the crowd” — just like asking many people for their opinion usually gives a more reliable answer than asking just one person



---

# Question 2: What is the difference between Bagging and Boosting?

- Answer:

Bagging and Boosting are two popular ensemble learning techniques, but they work differently.

Bagging (Bootstrap Aggregating):

Multiple models are trained in parallel on different random subsets of the data (sampled with replacement).

Each model gives a prediction, and the final output is decided by majority vote (classification) or average (regression).

Example: Random Forest.

Goal: Reduce variance and avoid overfitting.

Boosting:

Models are trained sequentially, where each new model focuses on correcting the mistakes made by the previous ones.

The final prediction is a weighted combination of all models.

Example: AdaBoost, Gradient Boosting, XGBoost.

Goal: Reduce bias and improve accuracy.

In short:

Bagging → Parallel training, reduces variance.

Boosting → Sequential training, reduces bias.


---

# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?



- Answer:

Bootstrap sampling is a technique where we create new datasets by randomly selecting samples with replacement from the original dataset. This means the same data point can appear more than once in a sample, while some points may not appear at all.

Role in Bagging (e.g., Random Forest):

In Bagging, multiple models are trained on different bootstrap samples of the data.

Since each sample is slightly different, each model learns different patterns.

When we combine their predictions (by voting or averaging), the overall model becomes more stable and less prone to overfitting.



---

# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

- Out-of-Bag (OOB) Samples

In bagging methods (like Random Forest), each tree is trained on a bootstrap sample (random sample with replacement) from the dataset.

On average, about 63% of the training data is included in the bootstrap sample, leaving around 37% of the data unused for that tree.

The unused data points for a given tree are called Out-of-Bag (OOB) samples.

OOB Score

OOB samples act like a validation set for each tree.

For every data point, we can average the predictions from only those trees for which the point was OOB.

The OOB score is the accuracy (or other performance metric) calculated using these OOB predictions.

Role in Model Evaluation

OOB score provides an internal cross-validation mechanism, so you don’t need a separate validation set.

It gives a reliable estimate of generalization error.

Saves computational cost since no extra data splitting or k-fold cross-validation is required.



---

# Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

- Feature Importance in a Single Decision Tree vs. Random Forest
1. In a Single Decision Tree

How it works:

Feature importance is measured by how much each feature reduces impurity (e.g., Gini Index or Entropy) when it is used for splitting.

Each split contributes to reducing impurity. The total importance of a feature is the sum of impurity reductions from all nodes where that feature is used.

This importance is then normalized (so that all features sum to 1).

Limitations:

Unstable: A small change in the dataset can lead to a very different tree, changing which features appear important.

Biased toward high-cardinality features: Features with many unique values (e.g., ID numbers) may appear artificially important.

2. In a Random Forest

How it works:

Feature importance is computed across all trees in the forest.

For each feature, its contribution to impurity reduction is averaged across trees.

This results in more stable and reliable estimates of importance compared to a single tree.

Additionally, Random Forest can also estimate feature importance using permutation importance (measuring how prediction accuracy decreases when a feature’s values are randomly shuffled).

Advantages:

More robust and stable (less variance due to ensemble averaging).

Less biased compared to a single tree.

Provides a clearer ranking of features’ contributions to predictions.



---


# Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Get feature importance scores
importances = clf.feature_importances_

# Create DataFrame for feature importance
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
})

# Sort and display top 5
top_5 = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Important Features in Breast Cancer Dataset:")
print(top_5)


Top 5 Important Features in Breast Cancer Dataset:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


---

# Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree


In [4]:
# Bagging Classifier vs Single Decision Tree on Iris (version-robust)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
from sklearn import __version__ as skver

print("scikit-learn version:", skver)

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 1) Single Decision Tree
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, y_train)
y_pred_tree = dtree.predict(X_test)
tree_acc = accuracy_score(y_test, y_pred_tree)

# 2) Bagging with Decision Trees (handles both old/new sklearn APIs)
base_tree = DecisionTreeClassifier(random_state=42)
try:
    # New API (sklearn >= 1.2)
    bagging = BaggingClassifier(
        estimator=base_tree,
        n_estimators=50,
        random_state=42,
        n_jobs=-1
    )
except TypeError:
    # Old API (sklearn <= 1.1)
    bagging = BaggingClassifier(
        base_estimator=base_tree,
        n_estimators=50,
        random_state=42,
        n_jobs=-1
    )

bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bag_acc = accuracy_score(y_test, y_pred_bag)

print("Accuracy of Single Decision Tree:", round(tree_acc, 4))
print("Accuracy of Bagging Classifier:", round(bag_acc, 4))



scikit-learn version: 1.6.1
Accuracy of Single Decision Tree: 0.9333
Accuracy of Bagging Classifier: 0.9333


---

# Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    "max_depth": [2, 4, 6, 8, None],
    "n_estimators": [50, 100, 150, 200]
}

# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Best model
best_rf = grid_search.best_estimator_

# Predictions
y_pred = best_rf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Final Accuracy on Test Set:", accuracy)


Best Parameters: {'max_depth': 2, 'n_estimators': 100}
Final Accuracy on Test Set: 0.9


---

# Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)


In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bagging Regressor
bagging_reg = BaggingRegressor(random_state=42)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# Random Forest Regressor
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Results
print("Mean Squared Error - Bagging Regressor:", mse_bagging)
print("Mean Squared Error - Random Forest Regressor:", mse_rf)

if mse_rf < mse_bagging:
    print("✅ Random Forest Regressor performs better.")
else:
    print("✅ Bagging Regressor performs better.")


Mean Squared Error - Bagging Regressor: 0.2824242776841025
Mean Squared Error - Random Forest Regressor: 0.2553684927247781
✅ Random Forest Regressor performs better.


---

# Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

- Theory Part
Bagging Regressor

Bagging (Bootstrap Aggregating) is an ensemble learning technique.

It trains multiple base learners (e.g., Decision Trees) on different bootstrap samples of the dataset.

Predictions are then averaged (for regression) to reduce variance and improve accuracy.

Key idea: “many weak learners combined give a strong learner.”

Random Forest Regressor

Random Forest is an improved version of Bagging.

It also trains multiple decision trees but with an extra step:

At each split, it considers only a random subset of features (not all).

This reduces correlation between trees and improves generalization.

It is one of the most powerful ensemble methods for regression and classification.

In [8]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Bagging Regressor ---
bagging_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),   # <- FIXED: use 'estimator' instead of 'base_estimator'
    n_estimators=100,
    random_state=42
)
bagging_model.fit(X_train, y_train)
y_pred_bag = bagging_model.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# --- Random Forest Regressor ---
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error (Bagging):", mse_bag)
print("Mean Squared Error (Random Forest):", mse_rf)


Mean Squared Error (Bagging): 0.25592438609899626
Mean Squared Error (Random Forest): 0.2553684927247781


---