#1.  What is Ensemble Learning in machine learning? Explain the key idea behind it.

1. Key Idea Behind Ensemble Learning:-

“Many weak or diverse learners together form a strong learner.”

Instead of relying on one model, ensemble learning aggregates predictions from multiple models to:

Reduce errors

Improve generalization

Increase stability and accuracy

2. Why Ensemble Learning Works:-

Ensembles improve performance by addressing three main types of errors:

Bias → Error from overly simple models

Variance → Error from overly complex, unstable models

Noise → Random fluctuations in data

By combining models, ensembles balance bias and variance more effectively.

3. How Models Are Combined:-

Classification → Majority voting / weighted voting

Regression → Averaging predictions

4. Advantages of Ensemble Learning:-

Higher accuracy

Better generalization

More robust to overfitting

Works well with complex datasets

5. Limitations

Increased computational cost

Less interpretability

More complex to implement and tune

#2. What is the difference between Bagging and Boosting?

| Aspect          | Bagging (Bootstrap Aggregating)      | Boosting                                  |
| --------------- | ------------------------------------ | ----------------------------------------- |
| Training Style  | Parallel, independent models         | Sequential, dependent models              |
| Data Sampling   | Random sampling **with replacement** | Re-weighting data (focus on hard samples) |
| Focus           | Reduces **variance**                 | Reduces **bias**                          |
| Handling Errors | All samples treated equally          | Misclassified samples get higher weight   |
| Overfitting     | Helps prevent overfitting            | Can overfit if noisy data                 |
| Model Weighting | Equal weight to all models           | Weighted models                           |
| Speed           | Faster (parallelizable)              | Slower (sequential)                       |




#3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

1. What is Bootstrap Sampling:-

From a dataset of size N, we draw N samples with replacement

Some original samples may appear multiple times

Some samples may not appear at all

2. Role of Bootstrap Sampling in Bagging:-

Bagging = Bootstrap + Aggregation

How it Works:

Generate multiple bootstrap samples from the original dataset

Train a separate model (e.g., decision tree) on each sample

Combine predictions:

Classification → Majority voting

Regression → Averaging

3. Bootstrap Sampling in Random Forest:-

Random Forest uses two sources of randomness:

Bootstrap sampling of data

Random subset of features at each split

This double randomness:

Increases model diversity

Reduces correlation between trees

Improves generalization

4. Why Bootstrap Sampling Improves Performance:-

Reduces variance of unstable models (like decision trees)

Makes models more robust to noise

Prevents overfitting

5. Simple Example

Original data:
[A, B, C, D, E]

Bootstrap sample:
[B, C, C, E, A]

(D is out-of-bag)

#4. What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

1. What are Out-of-Bag (OOB) Samples:-

In bootstrap sampling:

From a dataset of size N, we sample N points with replacement

About 63.2% of unique samples are selected

The remaining ~36.8% are Out-of-Bag samples

These OOB samples were never seen by a given tree during training.

2. How OOB Samples are Used:-

For each data point:

Identify all trees where this point was OOB

Use those trees to make a prediction for that point

Aggregate predictions:

Classification → Majority vote

Regression → Average prediction

3. What is OOB Score:-

The OOB score is the overall accuracy (classification) or R² / MSE (regression) computed using OOB predictions for all samples.

OOB Score =
Correct OOB predictions
Total samples
OOB Score=
Total samples
Correct OOB predictions

4. Why OOB Score is Useful:-

No need for a separate validation or test set

Efficient use of data

Provides an unbiased estimate of generalization performance

Especially useful when data is limited


	​


#5. Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

| Aspect              | Single Decision Tree            | Random Forest        |
| ------------------- | ------------------------------- | -------------------- |
| Number of Models    | One tree                        | Many trees           |
| Stability           | High variance, unstable         | Low variance, stable |
| Sensitivity to Data | Very sensitive to small changes | Much less sensitive  |
| Overfitting         | High risk                       | Reduced risk         |
| Feature Bias        | Can strongly favor one feature  | Bias averaged out    |
| Reliability         | Lower                           | Higher               |
| Interpretability    | Very interpretable              | Less interpretable   |




In [1]:
#6. Write a Python program to:
#   Load the Breast Cancer dataset using
#   sklearn.datasets.load_breast_cancer()
#   Train a Random Forest Classifier
#   Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create DataFrame for easy viewing
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance (descending)
top_5_features = feature_importance_df.sort_values(
    by='Importance', ascending=False
).head(5)

# Print top 5 important features
print("Top 5 Most Important Features:")
print(top_5_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [2]:
#7. Write a Python program to:
#   Train a Bagging Classifier using Decision Trees on the Iris dataset
#   Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Train Bagging Classifier with Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Print accuracies
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [3]:
#8.  Write a Python program to:
#    Train a Random Forest Classifier
#    Tune hyperparameters max_depth and n_estimators using GridSearchCV
#    Print the best parameters and final accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Random Forest model
rf = RandomForestClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit GridSearch
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Predictions on test set
y_pred = best_model.predict(X_test)

# Final accuracy
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", grid_search.best_params_)
print("Final Test Accuracy:", final_accuracy)


Best Hyperparameters: {'max_depth': None, 'n_estimators': 200}
Final Test Accuracy: 0.9707602339181286


In [4]:
#9. Write a Python program to:
#   Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
#   Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -----------------------------
# Bagging Regressor
# -----------------------------
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# -----------------------------
# Random Forest Regressor
# -----------------------------
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.2568358813508342
Random Forest Regressor MSE: 0.25650512920799395


#10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data. You decide to use ensemble techniques to increase model performance. Explain your step-by-step approach to: ● Choose between Bagging or Boosting ● Handle overfitting ● Select base models ● Evaluate performance using cross-validation ● Justify how ensemble learning improves decision-making in this real-world context.

1. Choosing Between Bagging and Boosting:-

Step 1: Understand the data & business risk:

Loan default data is usually:

Imbalanced (fewer defaults than non-defaults)

Noisy (missing values, reporting errors)

High-stakes (false negatives are costly

Decision:-

Start with Bagging (e.g., Random Forest)

Robust to noise

Reduces variance

Then try Boosting (e.g., Gradient Boosting / XGBoost)

Focuses on hard-to-classify defaulters

Often delivers higher predictive power

2. Handling Overfitting (Critical in Finance):-

Train-test split with time awareness:-

Ensure no data leakage (use past data to predict future defaults)

Regularization:-

Limit tree depth (max_depth)

Use min_samples_leaf

Ensemble controls:-

Bagging → More trees, shallower depth

Boosting → Lower learning rate, early stopping

Feature selection:-

Remove highly correlated or unstable features

Out-of-Bag (OOB) / Cross-validation

Continuous performance monitoring

3. Selecting Base Models:-

Why decision trees:-

Handle non-linear relationships

Work with mixed data types

Capture interaction between demographic & transaction features

Base model choices:-

Bagging:

Decision Trees (high variance → bagging benefits)

Boosting:

Shallow decision trees (decision stumps)

4. Evaluating Performance Using Cross-Validation:-

Use Stratified K-Fold CV:-

Maintains default / non-default ratio

Evaluate using business-relevant metrics:-

AUC-ROC → Ranking risk

Precision-Recall → Default detection

F1-score → Balance errors

Compare:-

Single model vs Bagging vs Boosting

Check stability across folds:-

Consistent performance = trustworthy model

5. How Ensemble Learning Improves Decision-Making:-

Business Impact:-

More accurate risk assessment

Reduced false negatives (missed defaulters)

Better loan pricing and approval decisions

Technical Benefits:-

Combines multiple perspectives of data

Reduces model bias and variance

Handles complex customer behavior patterns