# Assignment start

QUES.1

ANS..Ensemble Learning is a machine learning technique that combines multiple models (called base learners or weak learners) to build a stronger and more accurate predictive model.

The key idea is that instead of relying on a single model, multiple models working together can reduce bias, variance, and improve generalization.

Examples: Bagging, Boosting, Stacking.

Benefits: Higher accuracy, robustness to noise, better stability.

QUES.2

ANS..
| Aspect    | Bagging                                                         | Boosting                                                            |
| --------- | --------------------------------------------------------------- | ------------------------------------------------------------------- |
| Full form | Bootstrap Aggregating                                           | Sequential Boosting                                                 |
| Training  | Models trained in parallel on random subsets (with replacement) | Models trained sequentially, each correcting errors of the previous |
| Focus     | Reduces variance                                                | Reduces bias                                                        |
| Example   | Random Forest                                                   | AdaBoost, Gradient Boosting, XGBoost                                |
| Weighting | Equal weight to all models                                      | Higher weight to misclassified instances                            |


QUES.3

ANS..Bootstrap sampling: Random sampling with replacement from the dataset to create multiple training subsets.

In Bagging, bootstrap sampling ensures each base learner is trained on slightly different data → introduces diversity.

This reduces variance and prevents overfitting. Random Forest uses bootstrap samples to train decision trees independently.

QUES.4

ANS..OOB samples: Data points not selected in the bootstrap sample (roughly 37% of data).

These can act like a built-in validation set.

OOB score: The prediction accuracy computed using OOB samples. It helps evaluate the ensemble without needing an explicit validation set.

QUES.5

ANS..Decision Tree: Feature importance is based on reduction in Gini Impurity or Information Gain at each split. Prone to bias towards dominant features.

Random Forest: Aggregates importance across many trees, making it more stable, less biased, and better at capturing complex relationships

QUES.6

ANS..

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Feature importance
importances = pd.Series(rf.feature_importances_, index=data.feature_names)
top5 = importances.sort_values(ascending=False).head(5)
print("Top 5 Important Features:")
print(top5)


Top 5 Important Features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


QUES.7

ANS..

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))

# Bagging with Decision Trees
bag = BaggingClassifier(DecisionTreeClassifier(), n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
bag_acc = accuracy_score(y_test, bag.predict(X_test))

print("Decision Tree Accuracy:", dt_acc)
print("Bagging Accuracy:", bag_acc)


Decision Tree Accuracy: 1.0
Bagging Accuracy: 1.0


QUES.8

ANS..

In [3]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid.fit(X, y)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 5, 'n_estimators': 100}
Best Accuracy: 0.9596180717279925


QUES.9

ANS..

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42
)

# Bagging Regressor
bag_reg = BaggingRegressor(n_estimators=50, random_state=42)
bag_reg.fit(X_train, y_train)
bag_pred = bag_reg.predict(X_test)

# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)

print("Bagging Regressor MSE:", mean_squared_error(y_test, bag_pred))
print("Random Forest MSE:", mean_squared_error(y_test, rf_pred))


Bagging Regressor MSE: 0.2572988359842641
Random Forest MSE: 0.2553684927247781


QUES.10

ANS..Choice of Method:

Bagging (Random Forest) if high variance, stable predictions needed.

Boosting (XGBoost/LightGBM) if we want to reduce bias and capture complex patterns.

Handling Overfitting:

Use cross-validation, regularization (learning_rate in boosting, max_depth tuning).

Early stopping for boosting.

Selecting Base Models:

Decision Trees (common).

Could also use Logistic Regression, SVM, or Neural Nets in stacking.

Performance Evaluation:

k-fold cross-validation, ROC-AUC, precision-recall for imbalanced loan data.

Why Ensemble Helps in Real World:

Loan default is high-risk, small error is costly.

Ensembles improve accuracy, reduce false negatives (missed defaults).

More reliable predictions lead to better lending decisions and reduced financial risk.

# ASSIGNMENT COMPLETE..