**1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.**

--> Ensemble Learning in machine learning is a technique where multiple models (called base learners) are combined to solve the same problem and improve overall performance.

**Key Idea**

The main idea is:

**A group of weak or diverse models can work together to produce a stronger, more accurate model.**

* Instead of relying on a single model, ensemble methods reduce errors by:

* Averaging predictions (reduces variance)

* Correcting mistakes of previous models (reduces bias)

* Combining strengths of different models

**2. What is the difference between Bagging and Boosting?**

--> **Difference Between Bagging and Boosting**

**Bagging:**

* Models are trained independently and in parallel.

* Reduces variance.

* Example: Random Forest.

**Boosting:**

* Models are trained sequentially, each correcting previous errors.

* Reduces bias.

* Example: AdaBoost.

**3. What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?**

--> Bootstrap sampling is a technique where multiple datasets are created by randomly sampling with replacement from the original dataset.

**Role in Bagging.**

* Each model is trained on a different bootstrap sample.

* This creates diversity among models.

* Their predictions are then averaged (or majority voted).

**4.  What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?**

--> Out-of-Bag (OOB) samples are the data points not selected in a bootstrap sample when training models in bagging methods like Random Forest.

**How OOB Score is Used**

* Each model is tested on its own OOB samples.

* Predictions from all models are combined.

* The overall accuracy on OOB samples is called the OOB score.

**5.  Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.**

--> **Feature Importance: Single Tree vs. Random Forest**

Single Decision Tree:

* Importance is based on how much each feature reduces impurity.

* Can be unstable (high variance).

* Sensitive to small data changes.

Random Forest:

* Averages feature importance over many trees.

* More stable and reliable.

* Less prone to overfitting.

In [2]:
# 6. Write a Python program to:**
# Load the Breast Cancer dataset using
# sklearn.datasets.load_breast_cancer()
# Train a Random Forest Classifier
# Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = pd.Series(model.feature_importances_, index=X.columns)

top5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Important Features:")
print(top5)

Top 5 Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [3]:
# 7. Write a Python program to:
# Train a Bagging Classifier using Decision Trees on the Iris dataset
# Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Classifier Accuracy:", bag_acc)

Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [4]:
# 8. Write a Python program to:
# Train a Random Forest Classifier
# Tune hyperparameters max_depth and n_estimators using GridSearchCV
# Print the best parameters and final accuracy


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10]
}

grid = GridSearchCV(rf, param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Final Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy: 0.9707602339181286


In [5]:
# 9. Write a Python program to:
# Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
# Compare their Mean Squared Errors (MSE)


from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

bag = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bag.fit(X_train, y_train)
bag_pred = bag.predict(X_test)
bag_mse = mean_squared_error(y_test, bag_pred)

rf = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

print("Bagging Regressor MSE:", bag_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25772464361712627


**10. You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.**

**You decide to use ensemble techniques to increase model performance.**

**Explain your step-by-step approach to:**
* **Choose between Bagging or Boosting**
* **Handle overfitting**
* **Select base models**
* **Evaluate performance using cross-validation**
* **Justify how ensemble learning improves decision-making in this real-world
context.**


--> **Choose Between Bagging or Boosting**

* Bagging (e.g., Random Forest) is ideal if the dataset is large, noisy, and prone to high variance, as it reduces variance by averaging multiple models.

* Boosting (e.g., XGBoost, LightGBM, AdaBoost) is preferred if the dataset has complex patterns and class imbalance, since boosting sequentially corrects errors and reduces bias.

* Step: Start by exploring data distribution and model variance. For highly imbalanced loan defaults, Boosting often performs better.

**Handle Overfitting**

* Regularization: Use hyperparameters like max_depth, min_samples_leaf (for trees) to avoid overly complex models.

* Early stopping: Especially in boosting (XGBoost/LightGBM) to stop when performance on validation stops improving.

* Feature selection: Remove redundant or irrelevant features.

* Ensemble methods: Bagging itself reduces overfitting by averaging predictions; boosting can overfit, so careful tuning is essential.

**Select Base Models**

-> Common choices for structured financial data:

* Decision Trees – easy to interpret, good with categorical and numerical features.

* Logistic Regression – useful as a simple base model, especially in boosting.

* Gradient Boosting Trees – strong default choice for tabular data.

* Step: Use trees for both bagging and boosting; optionally try hybrid ensembles (stacking) combining trees + logistic regression for interpretability.

**Evaluate Performance Using Cross-Validation**

* K-Fold CV: Split data into k folds (e.g., 5 or 10) to get reliable performance estimates.

* Metrics for loan default: Use ROC-AUC for imbalanced data, along with accuracy, precision, recall, and F1-score.

* Hyperparameter tuning: Combine CV with GridSearchCV or RandomizedSearchCV to select best parameters.

**Justify How Ensemble Learning Improves Decision-Making**

* Reduced risk of wrong decisions: Ensemble predictions are more robust and accurate than a single model, reducing false defaults or missed defaults.

* Captures complex patterns: Boosting identifies subtle relationships in customer behavior that a single model may miss.

* Stability: Bagging reduces variance, so decisions are consistent across different data samples.

* Business impact: Higher prediction accuracy means better credit risk assessment, lowering financial losses and improving regulatory compliance.