###  Question 1
**What is Ensemble Learning in machine learning? Explain the key idea behind it.**

**Answer:**

Ensemble Learning combines multiple weak models (base learners) to form a strong predictive model.

The key idea: a group of weak models together outperform individual models by reducing bias and variance.

**Common methods:** Bagging, Boosting, Stacking.

###  Question 2
**What is the difference between Bagging and Boosting?**

| Aspect | Bagging | Boosting |
|--------|----------|-----------|
| Type | Parallel | Sequential |
| Goal | Reduce variance | Reduce bias & variance |
| Sampling | Random (with replacement) | Weighted (focus on errors) |
| Dependence | Independent models | Each model depends on previous |
| Example | Random Forest | AdaBoost, XGBoost |

###  Question 3
**What is bootstrap sampling and its role in Bagging methods like Random Forest?**

**Answer:**
Bootstrap sampling = randomly drawing samples with replacement.

Each tree gets a different subset → increases diversity → reduces variance → prevents overfitting.

###  Question 4
**What are Out-of-Bag (OOB) samples and how is OOB score used?**

OOB samples = data not chosen in bootstrap for a tree.
Used to test model internally (≈ validation score).
OOB score gives unbiased accuracy without needing extra test data.

###  Question 5
**Compare feature importance in Decision Tree vs Random Forest**

| Aspect | Decision Tree | Random Forest |
|---------|----------------|----------------|
| Basis | Single model impurity | Average across many trees |
| Stability | Sensitive | Stable |
| Bias | High | Low |

###  Question 6
**Write a Python program to:**

Load the Breast Cancer dataset

Train a Random Forest Classifier

Print the top 5 most important features

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=data.feature_names)
print(importances.sort_values(ascending=False).head(5))

worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


### Question 7
**Write a Python program to:**

Train a Bagging Classifier using Decision Trees on the Iris dataset

Evaluate its accuracy and compare with a single Decision Tree

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
acc_dt = accuracy_score(y_test, dt.predict(X_test))

bag = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
acc_bag = accuracy_score(y_test, bag.predict(X_test))

print('Decision Tree Accuracy:', acc_dt)
print('Bagging Classifier Accuracy:', acc_bag)

Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


### Question 8
**Write a Python program to:**

Train a Random Forest Classifier

Tune max_depth and n_estimators using GridSearchCV

Print best parameters and final accuracy

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
param_grid = {'n_estimators': [50, 100, 150], 'max_depth': [3, 5, 7, None]}
rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X, y)
print('Best Params:', grid.best_params_)
print('Best Accuracy:', grid.best_score_)

Best Params: {'max_depth': 3, 'n_estimators': 50}
Best Accuracy: 0.9666666666666668


### Question 9
**Write a Python program to:**

Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset

Compare their Mean Squared Errors (MSE)

In [5]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

bag = BaggingRegressor(random_state=42)
bag.fit(X_train, y_train)
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

print('Bagging Regressor MSE:', mean_squared_error(y_test, bag.predict(X_test)))
print('Random Forest Regressor MSE:', mean_squared_error(y_test, rf.predict(X_test)))

Bagging Regressor MSE: 0.28623579601385674
Random Forest Regressor MSE: 0.25650512920799395


###  Question 10
**Case Study: Loan Default Prediction**

**Approach:**
1. Use Boosting (e.g., XGBoost) since it handles class imbalance well.
2. Prevent overfitting via regularization, early stopping, and CV.
3. Use Decision Trees as base models.
4. Evaluate with Stratified K-Fold and metrics like F1, ROC-AUC.
5. Ensemble improves accuracy and reduces risk in loan approval decisions.

**Answer:**

**Step 1**:

 Choosing Between Bagging or Boosting

Use Boosting (e.g., XGBoost) because loan default prediction often involves imbalanced data and Boosting focuses on difficult cases.

**Step 2**:

 Handle Overfitting

Apply regularization (learning_rate, max_depth).

Use early stopping.

Use cross-validation to monitor performance.

**Step 3**:

 Select Base Models

Use Decision Trees as base learners for interpretability and flexibility.

**Step 4**:

 Evaluate Performance

Use Stratified K-Fold CV.

Evaluate using Precision, Recall, F1-Score, and ROC-AUC.

**Step 5**:

 Why Ensemble Learning Helps

Combines multiple weak learners → stronger prediction.

Handles class imbalance better.

Provides stable, reliable risk predictions.

Improves loan approval decision-making and reduces default losses.