1: What is Ensemble Learning in Machine Learning? Explain the key idea behind it.
-> Ensemble Learning is a machine learning technique where multiple models (called base learners) are trained and combined to solve the same problem, with the goal of achieving better performance, higher accuracy, and improved robustness compared to a single model.

Key Idea:

Instead of relying on one model, ensemble learning aggregates predictions from several models so that individual weaknesses are compensated by others.

2: What is the difference between Bagging and Boosting?
-> | Aspect          | Bagging                          | Boosting                                |
| --------------- | -------------------------------- | --------------------------------------- |
| Full form       | Bootstrap Aggregating            | Sequential Error Correction             |
| Training        | Parallel                         | Sequential                              |
| Data sampling   | Random sampling with replacement | Weighted sampling                       |
| Focus           | Reduce variance                  | Reduce bias                             |
| Handling errors | All models treated equally       | Misclassified samples get higher weight |
| Example         | Random Forest                    | AdaBoost, Gradient Boosting             |

Conclusion:

Bagging is best when models overfit.

Boosting is best when models underfit.

3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
-> Bootstrap sampling is a technique where multiple datasets are created by randomly sampling the original dataset with replacement.

Role in Bagging:

Each model is trained on a different bootstrap sample

Some data points may appear multiple times

Some may not appear at all

In Random Forest:

Each decision tree gets a different bootstrap sample

Leads to diverse trees

Reduces correlation among trees

End Result:

Improved stability and reduced variance of the final model.


: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
-> Out-of-Bag (OOB) samples are the data points not selected during bootstrap sampling for a particular model.

Key Points:

- Around 36.8% of data becomes OOB for each tree

- These samples act as validation data

OOB Score Usage:

- Predictions are made on OOB samples

- Accuracy is calculated without a separate test set

Benefits:

- No need for cross-validation

- Saves computation

- Provides unbiased performance estimate

5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
-> | Aspect          | Decision Tree      | Random Forest       |
| --------------- | ------------------ | ------------------- |
| Stability       | High variance      | Stable              |
| Sensitivity     | Sensitive to noise | Robust              |
| Importance bias | Can overfit        | Averaged importance |
| Reliability     | Low                | High                |



In [1]:
# 6: Random Forest on Breast Cancer Dataset (Feature Importance)

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance
importance = pd.Series(rf.feature_importances_, index=data.feature_names)
top5 = importance.sort_values(ascending=False).head(5)

print(top5)


worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [2]:
# 7: Bagging Classifier vs Single Decision Tree (Iris Dataset)

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging Classifier
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


In [3]:
# 8: Random Forest with GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10]
}

rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 5, 'n_estimators': 100}
Best Accuracy: 0.9596180717279925


In [4]:
# 9: Bagging Regressor vs Random Forest Regressor

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

bag = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
rf = RandomForestRegressor(n_estimators=100, random_state=42)

bag.fit(X_train, y_train)
rf.fit(X_train, y_train)

print("Bagging MSE:", mean_squared_error(y_test, bag.predict(X_test)))
print("Random Forest MSE:", mean_squared_error(y_test, rf.predict(X_test)))


Bagging MSE: 0.2582487035129332
Random Forest MSE: 0.25424371393528344


10: Ensemble Learning for Loan Default Prediction
-> Step-by-Step Approach:
1. Choosing Bagging or Boosting

If data is noisy → Bagging

If patterns are complex → Boosting

For loan default → Boosting preferred (better bias handling)

2. Handling Overfitting

Cross-validation

Early stopping

Regularization

Limiting tree depth

3. Selecting Base Models

Decision Trees (interpretable)

Logistic Regression (baseline)

Gradient Boosted Trees (performance)

4. Performance Evaluation

K-Fold Cross-Validation

ROC-AUC

Precision-Recall

Confusion Matrix

5. Why Ensemble Improves Decisions

Reduces financial risk

More stable predictions

Handles class imbalance

Improves regulatory trust

Final Impact:

Ensemble learning leads to accurate, robust, and fair loan approval decisions, minimizing default risk.