Assignment Code: DA-AG-014



1.What is Ensemble Learning in machine learning? Explain the key idea
behind it.

- Ensemble Learning in machine learning is a powerful technique where multiple models (called “weak learners” or “base models”) are combined to create a single stronger model (called an “ensemble”). The idea is that while an individual model may have limitations or errors, combining several models can reduce errors and improve accuracy, robustness, and generalization.

-  Key Idea Behind Ensemble Learning

 - The key idea is based on the principle that:

 - “A group of weak learners, when combined appropriately, can perform better than a single strong learner.”

 - Each model may capture different patterns or make different types of mistakes.

 - By aggregating their predictions (through averaging, voting, or weighted combinations), the overall error is reduced.

 - This works because diversity among models reduces the risk of all of them failing on the same data points.

- Why Ensemble Works

 - Reduces Bias – combining models (like in boosting) can make the overall model more flexible and closer to the true relationship.

 - Reduces Variance – averaging results of many models (like in bagging) stabilizes predictions and reduces overfitting.

 - Improves Generalization – ensemble models usually perform better on unseen data compared to a single learner.

-  Common Ensemble Methods

 - Bagging (Bootstrap Aggregating):

 - Trains multiple models on different random samples of the training data.

 - Example: Random Forest (ensemble of decision trees).

- Boosting:

 - Trains models sequentially, where each new model focuses on correcting the mistakes of the previous ones.

 - Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM.

- Stacking:

 - Combines predictions of multiple models using a meta-model (a model that learns how to best combine outputs).


 2.What is the difference between Bagging and Boosting?

- Bagging (Bootstrap Aggregating)

  Main Idea:

 - Train multiple models independently on different random subsets of data (using sampling with replacement).

 - Final prediction is made by majority voting (classification) or averaging (regression).

- Key Points:

 - Parallel training: Models are trained independently at the same time.

 - Data sampling: Each model gets a random bootstrapped dataset (some observations may repeat, some may be left out).

 - Reduces variance: Helps prevent overfitting by stabilizing predictions.

 - Weak learners used: Usually decision trees.

 - Example: Random Forest.

-  Boosting

  Main Idea:

 - Train models sequentially, where each new model tries to correct the errors (misclassified points) of the previous models.

 - Final prediction is made by a weighted combination of all models.

- Key Points:

 - Sequential training: Later models depend on the errors of earlier models.

 - Focus on mistakes: Misclassified data points get higher weights so the next model focuses on them.

 - Reduces bias and variance: Produces a strong learner from weak learners.

 - Weak learners used: Usually shallow decision trees (stumps).

 - Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost.


3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?


- Bootstrap Sampling

 - Definition:
Bootstrap sampling is a random sampling technique with replacement, used to generate multiple datasets from the original training set.

 - If you have a dataset with N samples, you create a new dataset by randomly  - picking N samples from it, with replacement.

 - “With replacement” means the same data point can appear multiple times in the sample, while some points may be left out.

 -  Example:
Original dataset = {1, 2, 3, 4, 5}
Bootstrap sample (size 5) could be = {2, 5, 2, 1, 4}

 - Role of Bootstrap Sampling in Bagging

- Bagging = Bootstrap Aggregating.
The bootstrap sampling step is what creates diversity among the base learners.

 - Here’s how it works in methods like Random Forest:

- Generate bootstrap datasets

 - From the original training data, create multiple random samples using bootstrap sampling.

 - Each base learner (e.g., a decision tree) gets a different dataset.

- Train base learners independently

 - Since the datasets are different, each tree learns slightly different patterns.

 - This ensures diversity (not all trees overfit in the same way).

- Aggregate predictions

 - In classification → take a majority vote.

 - In regression → take the average.

- Reduce variance

 - Individual decision trees are high variance models (they can overfit).

 - By averaging multiple diverse trees trained on bootstrap samples, Bagging reduces variance and improves generalization.

  Example with Random Forest

 - Suppose you have 1,000 training samples.

 - You want to build 100 trees.

- For each tree:

 - Randomly sample 1,000 records with replacement (so some appear multiple times, others may not appear at all).

 - Grow a decision tree on this dataset.

 - Final prediction = average of all 100 trees (regression) or majority vote (classification).

 - This is why Random Forest is powerful:

 - Bootstrap sampling → brings diversity.

 - Aggregation → cancels out noise/errors of individual trees.



4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?


- Out-of-Bag (OOB) Samples

 - When we do bootstrap sampling in Bagging/Random Forest:

 - For a dataset with N samples, we draw N samples with replacement to create a bootstrap dataset.

 - On average, each bootstrap sample contains about 63% of the original data points (because of sampling with replacement).

 - The remaining ~37% of data points are not included in that bootstrap sample.

 - These left-out points are called Out-of-Bag (OOB) samples.

- How OOB Samples Are Used

 - For each tree in the ensemble, the data points that were not used to train that tree (OOB samples) act like a validation/test set.

 - The trained tree can be tested on its OOB samples to check prediction accuracy.

 - Since each data point is likely to be OOB for several trees (not all), we can aggregate the predictions from those trees for that point.

- OOB Score

 - The OOB score is an internal validation accuracy estimate for Bagging/Random Forest models.

- How it’s computed:

 - For each data point in the dataset:

 - Collect predictions only from the trees where this point was OOB.

 - Aggregate those predictions (majority vote for classification, average for regression).

 - Compare the aggregated prediction with the actual value.

 - Compute accuracy (or error) across all data points.

 - This accuracy is called the OOB Score.

- Advantages of OOB Score

 - No need for separate validation set → makes efficient use of the data.

 - Built-in unbiased error estimate → especially useful when dataset is small.

 - Reduces overfitting risk → gives a realistic performance measure during training.

- Example (Random Forest)

 - Suppose you build 500 trees on a dataset of 10,000 samples.

 - Each tree is trained on a bootstrap dataset (≈ 6,300 samples).

 - The remaining ≈ 3,700 samples are OOB for that tree.

 - Each sample is likely to be OOB for about 1/3rd of the trees.

 - You use those predictions to calculate the OOB score (say, 92%).

 - This OOB score serves as a reliable estimate of test accuracy without needing cross-validation.

5.Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.


- Feature Importance in a Single Decision Tree

 - In a Decision Tree, feature importance is calculated based on how much each feature contributes to reducing impurity in the splits.

- Impurity Measures (depending on task):

 - Classification → Gini Index or Entropy.

 - Regression → Variance Reduction (Mean Squared Error).

- Process:

 - Every time the tree splits on a feature, it reduces impurity.

 - The reduction is attributed to that feature.

 - Sum up all impurity reductions for each feature across the tree.

 - Normalize to get a percentage score.

- Limitation:

 - A single tree is unstable — small changes in data can drastically change which features appear at the top and how much importance they get.

 - Can be biased toward features with more categories (categorical) or continuous features with many split points.

- Feature Importance in a Random Forest

 - Random Forest is an ensemble of many trees (each trained on bootstrap samples + random subset of features).

- Process:

 - Compute feature importance individually in each tree (same way as above: impurity reduction).

 - Average (or sum) the importance scores across all trees.

 - Normalize scores so they add up to 1 (or 100%).

- Advantages over single tree:

 - More stable (less sensitive to random fluctuations in training data).

 - Less biased toward features with many categories, because not all trees see the same features (due to feature subsampling).

 - Provides a more robust, reliable measure of which features matter most overall.


### 6.Load the Breast Cancer dataset using
### sklearn.datasets.load_breast_cancer()bold text
### ● Train a Random Forest Classifier
### ● Print the top 5 most important features based on feature importance scores.

In [8]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance
importances = rf.feature_importances_

# Create dataframe for feature importance
feat_imp = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
})

# Sort by importance
feat_imp = feat_imp.sort_values(by="Importance", ascending=False)

# Print top 5
print("Top 5 Important Features:")
print(feat_imp.head(5))


Top 5 Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


### 7. Write a Python program to:
### ● Train a Bagging Classifier using Decision Trees on the Iris dataset
### ● Evaluate its accuracy and compare with a single Decision Tree

In [7]:
# Importing libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

# Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,        # number of trees
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
acc_bag = accuracy_score(y_test, y_pred_bag)

# Print results
print("Accuracy of Single Decision Tree: {:.2f}".format(acc_dt))
print("Accuracy of Bagging Classifier:   {:.2f}".format(acc_bag))

Accuracy of Single Decision Tree: 1.00
Accuracy of Bagging Classifier:   1.00


### 8. Write a Python program to:
### ● Train a Random Forest Classifier
### ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
### ● Print the best parameters and final accuracy

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model
rf = RandomForestClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],   # number of trees
    'max_depth': [2, 4, 6, None]      # depth of trees
}

# GridSearchCV
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Evaluate final model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Final Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'n_estimators': 50}
Final Accuracy: 1.0


### 9: Write a Python program to:
### ● Train a Bagging Regressor and a Random Forest Regressor on the California
### Housing dataset
### ● Compare their Mean Squared Errors (MSE)bold text

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
y_pred_bagging = bagging_reg.predict(X_test)

# Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

# Compute MSE
mse_bagging = mean_squared_error(y_test, y_pred_bagging)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print("Mean Squared Error (Bagging Regressor):", mse_bagging)
print("Mean Squared Error (Random Forest Regressor):", mse_rf)

Mean Squared Error (Bagging Regressor): 0.25592438609899626
Mean Squared Error (Random Forest Regressor): 0.2553684927247781


### 10: You are working as a data scientist at a financial institution to predict loan
### default. You have access to customer demographic and transaction history data.
### You decide to use ensemble techniques to increase model performance.
### Explain your step-by-step approach to:
### ● Choose between Bagging or Boosting
### ● Handle overfitting
### ● Select base models
### ● Evaluate performance using cross-validation
### ● Justify how ensemble learning improves decision-making in this real-world
### context.

- Step-by-Step Approach for Loan Default Prediction using Ensemble Learning
 - 1. Choosing Between Bagging or Boosting

- Bagging (Bootstrap Aggregating):

 - Works well when the base model (e.g., Decision Trees) has high variance.

 - It reduces variance by training multiple models on bootstrapped samples and averaging results.

 - Example: Random Forest.

- Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost, LightGBM):

 - Sequentially builds models where each new model corrects errors of the previous one.

 - Reduces bias and variance, making it suitable for imbalanced datasets like loan defaults.

- Choice in this case:

 Since loan default prediction is a highly imbalanced, complex problem, Boosting (XGBoost/LightGBM) is generally preferred because it can handle:

 - Non-linear relationships

 - Imbalance via weighted loss functions

 - Better performance in financial risk prediction tasks

 - Final Decision: Start with Boosting (XGBoost/LightGBM) but also compare with Bagging (Random Forest) as a baseline.

2. Handling Overfitting

- Overfitting is a major concern in financial data. Techniques include:

 - Regularization in Boosting models (e.g., max_depth, learning_rate, min_child_weight, reg_lambda)

 - Early stopping based on validation AUC/Accuracy

 - Cross-validation to tune hyperparameters

 - Feature selection using importance ranking or domain knowledge (remove redundant/irrelevant features)

 - Bagging methods (like Random Forest) naturally reduce overfitting by averaging across multiple trees.

3. Selecting Base Models

 - Decision Trees → Most common base learners for both Bagging & Boosting.

 - Logistic Regression → Useful as a baseline, especially when interpretability is key.

 - Neural Networks → Could be used in advanced stacking ensembles but risk higher overfitting.

 - Final Choice:

  For Bagging: Decision Trees (Random Forest)

   For Boosting: Decision Trees (XGBoost/LightGBM/CatBoost)

4. Evaluating Performance using Cross-Validation

Use Stratified K-Fold Cross-Validation to ensure class imbalance is respected.

- Key metrics:

 - ROC-AUC Score → captures tradeoff between sensitivity and specificity

 - Precision-Recall AUC → more informative in imbalanced data (loan default prediction)

 - Confusion Matrix → check false positives (granting loan to defaulter is riskier than rejecting a good customer)

 - Perform GridSearchCV or RandomizedSearchCV for hyperparameter tuning.

5. Justification: How Ensemble Learning Improves Decision-Making

- Financial institutions require high accuracy + low false positives

 - False Negative → Bank thinks customer will repay, but customer defaults (high financial risk)

 - False Positive → Bank rejects a good customer (loss of business, but safer)

- Bagging helps:

 - Reduces variance → more stable predictions

- Boosting helps:

 - Reduces both bias & variance

 - Focuses on hard-to-classify customers (edge cases like borderline defaults)

- Overall benefits:

 - More robust and accurate credit risk assessment

 - Supports data-driven decision-making in lending

 - Reduces Non-Performing Assets (NPAs)

 - Increases trust in automated loan approval systems