In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

1.  What is Ensemble Learning in machine learning? Explain the key idea 
behind it.

Ans- Ensemble learning is a ML program where we train multiple models in a single and select the best model by comparing or taking an average. It makes the model more accurate and robust.

2. What is the difference between Bagging and Boosting

Ans- In the bagging method all the individual models will take the bootstrap samples and create the models in parallel. Whereas in the boosting each model will build sequentially. The output of the first model (the errors information) will be pass along with the bootstrap samples data.

3.  What is bootstrap sampling and what role does it play in Bagging methods 
like Random Forest?

Ans- Bootstrapping is a key component of Bagging (Bootstrap Aggregating), where multiple models are trained on different samples and combined for better accuracy. Random Forests, a popular machine learning algorithm, use bootstrapping to train multiple decision trees and aggregate their results.

4.  What are Out-of-Bag (OOB) samples and how is OOB score used to 
evaluate ensemble models? 

Ans- OOB is part of train data but not used in an individual decision tree and the data is used as validation data for the individual decision tree. At the end average of all the score it is called OOB score for the whole data.

5. Compare feature importance analysis in a single Decision Tree vs. a 
Random Forest.

Ans- In a single Tree the output is single model is used on whole and whereas random forest the data is being divided in further subsets which make impurity averaged at the end which make feature more important and robust.

6.  Write a Python program to: 
● Load the Breast Cancer dataset using 
sklearn.datasets.load_breast_cancer() 
● Train a Random Forest Classifier 
● Print the top 5 most important features based on feature importance scores. 
(Include your Python code and output in the code box below.)

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

importances = clf.feature_importances_

feature_importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

top5 = feature_importance_df.sort_values(by="Importance", ascending=False).head(5)
top5.reset_index(inplace=True, drop=True)

print("Top 5 most important features:")
print(top5)

Top 5 most important features:
                Feature  Importance
0            worst area    0.139357
1  worst concave points    0.132225
2   mean concave points    0.107046
3          worst radius    0.082848
4       worst perimeter    0.080850


7.  Write a Python program to: 
● Train a Bagging Classifier using Decision Trees on the Iris dataset 
● Evaluate its accuracy and compare with a single Decision Tree

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# Results
print("Accuracy of Single Decision Tree:", acc_dt)
print("Accuracy of Bagging Classifier:", acc_bag)


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


8. Write a Python program to: 
● Train a Random Forest Classifier 
● Tune hyperparameters max_depth and n_estimators using GridSearchCV 
● Print the best parameters and final accuracy 

In [14]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier()
param_grid = {
    'n_estimators': list(range(100, 201, 25)),
    'max_depth': list(range(1, 10, 2)) + [None]
}

rf_grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    verbose=3,
    scoring='accuracy',
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

print("Best Parameters: ", rf_grid.best_params_)
print("Best Score: ", rf_grid.best_score_)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Parameters:  {'max_depth': None, 'n_estimators': 200}
Best Score:  0.959746835443038


9. Write a Python program to: 
● Train a Bagging Regressor and a Random Forest Regressor on the California 
Housing dataset 
● Compare their Mean Squared Errors (MSE)

In [16]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
dt = DecisionTreeRegressor()
bag = BaggingRegressor(
    estimator=dt,
    random_state=42,
    verbose=3
)

bag.fit(X_train, y_train)
y_pred_bag = bag.predict(X_test)

rf = RandomForestRegressor(
    n_estimators=10,
    random_state=42,
    verbose=3
)

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print(f"Bagging Model MSE: {mean_squared_error(y_test, y_pred_bag)}")
print(f"Random Forest Model MSE: {mean_squared_error(y_test, y_pred_rf)}")

Building estimator 1 of 10 for this parallel run (total 10)...
Building estimator 2 of 10 for this parallel run (total 10)...
Building estimator 3 of 10 for this parallel run (total 10)...
Building estimator 4 of 10 for this parallel run (total 10)...
Building estimator 5 of 10 for this parallel run (total 10)...
Building estimator 6 of 10 for this parallel run (total 10)...
Building estimator 7 of 10 for this parallel run (total 10)...
Building estimator 8 of 10 for this parallel run (total 10)...
Building estimator 9 of 10 for this parallel run (total 10)...
Building estimator 10 of 10 for this parallel run (total 10)...


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.9s finished
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10
Bagging Model MSE: 0.2824242776841025
Random Forest Model MSE: 0.28422271730492676


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    2.9s finished
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


10.  You are working as a data scientist at a financial institution to predict loan 
default. You have access to customer demographic and transaction history data. 
You decide to use ensemble techniques to increase model performance. 
Explain your step-by-step approach to: 
● Choose between Bagging or Boosting 
● Handle overfitting 
● Select base models 
● Evaluate performance using cross-validation 
● Justify how ensemble learning improves decision-making in this real-world 
context.

### Loan Default Prediction using Ensemble Learning

**Step 1: Choose between Bagging or Boosting**
- **Bagging (Random Forest):** Reduces variance, good baseline, stable performance.
- **Boosting (XGBoost/LightGBM):** Reduces bias, sequentially improves weak learners, usually higher accuracy.

**Step 2: Handle Overfitting**
- Use cross-validation (Stratified K-Fold).
- Apply regularization (max_depth, min_samples_split, learning rate, early stopping).
- Handle class imbalance (SMOTE, class weights).

**Step 3: Select Base Models**
- Decision Trees → common base learner.
- Random Forest for baseline.
- XGBoost/LightGBM for final tuned model.

**Step 4: Evaluate Performance**
- Metrics: AUC-ROC, Precision, Recall, F1-score.
- Use Stratified K-Fold CV for balanced evaluation.

**Step 5: Why Ensemble Learning Helps**
- Combines multiple models → reduces variance & bias.
- Captures complex non-linear patterns in customer data.
- Provides robust, accurate, and interpretable results.
- Improves decision-making, reduces false approvals, and lowers financial risk.
