In [None]:
# Question 1: What is Ensemble Learning in machine learning? Explain the key idea behind it.
# Ans 1: Ensemble learning is a machine learning technique that combines multiple models to produce a single, more accurate predictive model.
# The key idea is to aggregate the outputs of several individual models (often called "weak learners") to reduce errors like variance and bias,
#  leading to a stronger and more robust final model than any single model could achieve alone.

# Question 2: What is the difference between Bagging and Boosting?
# Ans 2: Bagging
# Process: Creates multiple models in parallel, each trained on a different random subset of the data (bootstrap samples).
# Goal: Primarily reduces the variance of a model, which helps in preventing overfitting.
# Model weighting: Each model is given equal importance during the final prediction.
# Dependency: Models are independent and do not learn from each other.
# Use case: Best for unstable models that have high variance and low bias.
# Example: Random Forests is a well-known bagging algorithm.
# Boosting
# Process: Builds models sequentially, with each new model trying to correct the errors made by the previous ones.
# Goal: Primarily reduces bias, and also variance, by focusing on difficult-to-classify examples.
# Model weighting: Models are given different weights based on their performance, with more accurate models having a greater influence on the final prediction.
# Dependency: Each new model is dependent on the results of the previous models.
# Use case: Effective for both bias and variance errors, and is often used when a model has high bias.
# Examples: AdaBoost and Gradient Boosting are common boosting algorithms.

# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
# Ans 3: Resampling with replacement: Bootstrap sampling creates new datasets by repeatedly drawing samples from the original dataset.
# Each time a data point is selected, it is put back into the pool, meaning it can be selected multiple times in a single bootstrap sample.
# Creates diverse datasets: Due to "sampling with replacement," each bootstrap sample is different from the others. Some data points may appear
# multiple times, while others may be omitted entirely.
# Simulates multiple experiments: It gives the effect of having multiple, independent training sets without having to collect new data.
# Role in Bagging and Random Forest
# Generates diverse base models: In a Random Forest, each decision tree is trained on a different bootstrap sample. This ensures that each
# tree learns from a slightly different perspective of the data.
# Reduces overfitting: By training on different subsets, the models learn different patterns and nuances in the data, making the ensemble
#  less likely to overfit to the noise of any single training set.
# Combines predictions: After the models are trained, their predictions are combined to form a more robust and reliable final prediction.
#  For classification, this is typically done through a "majority vote," and for regression, it's done by averaging the predictions.

# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
# Ans 4: Out-of-Bag (OOB) samples are data points not used in training a specific model within an ensemble, which are then used to
# calculate the OOB score.The OOB score provides an unbiased estimate of the model's performance on unseen data, similar to a
# cross-validation score, without requiring a separate validation set

# Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
# Ans 5: Decision Tree

# Feature importance is calculated within a single tree.

# Importance is based on decrease in impurity (e.g., Gini or Entropy) caused by each feature’s splits.

# Highly sensitive to small changes in data — can change feature rankings easily.

# May be biased toward features with many categories or continuous values.

# Easier to interpret — you can clearly trace how each feature affects the output.

# Can overfit to noise in the training data.

# Provides a local view of feature importance (for that single model only).

# Random Forest

# Feature importance is calculated across many decision trees in the forest.

# Each tree’s feature importance is computed individually, then averaged over all trees.

# More stable and reliable, since averaging reduces variance.

# Less biased, as multiple trees reduce the effect of any single feature dominating by chance.

# Harder to interpret because it’s an ensemble of many trees.

# Reduces overfitting, giving better generalization performance.

# Provides a global view of feature importance (overall dataset perspective).

In [1]:
# Question 6: Write a Python program to:
# ● Load the Breast Cancer dataset using
# sklearn.datasets.load_breast_cancer()
# ● Train a Random Forest Classifier
# ● Print the top 5 most important features based on feature importance scores.


from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

model = RandomForestClassifier(random_state=42)
model.fit(X, y)

importances = model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Most Important Features:")
print(top_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [5]:
# Question 7: Write a Python program to:
# ● Train a Bagging Classifier using Decision Trees on the Iris dataset
# ● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bag_model.fit(X_train, y_train)
bag_pred = bag_model.predict(X_test)
bag_accuracy = accuracy_score(y_test, bag_pred)

print(f"Single Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy:   {bag_accuracy:.4f}")


Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy:   1.0000


In [3]:
# Question 8: Write a Python program to:
# ● Train a Random Forest Classifier
# ● Tune hyperparameters max_depth and n_estimators using GridSearchCV
# ● Print the best parameters and final accuracy

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [10, 50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

print("Best Hyperparameters:")
print(best_params)
print("\nFinal Test Accuracy:")
print(f"{final_accuracy:.4f}")


Best Hyperparameters:
{'max_depth': None, 'n_estimators': 100}

Final Test Accuracy:
1.0000


In [6]:
# Question 9: Write a Python program to:
# ● Train a Bagging Regressor and a Random Forest Regressor on the California
# Housing dataset
# ● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bagging_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_regressor.fit(X_train, y_train)
bagging_pred = bagging_regressor.predict(X_test)

rf_regressor = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_regressor.fit(X_train, y_train)
rf_pred = rf_regressor.predict(X_test)

bagging_mse = mean_squared_error(y_test, bagging_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

print(f"Bagging Regressor MSE:       {bagging_mse:.4f}")
print(f"Random Forest Regressor MSE: {rf_mse:.4f}")


Bagging Regressor MSE:       0.2579
Random Forest Regressor MSE: 0.2565


In [None]:
# Question 10: You are working as a data scientist at a financial institution to predict loan
# default. You have access to customer demographic and transaction history data.
# You decide to use ensemble techniques to increase model performance.
# Explain your step-by-step approach to:
# ● Choose between Bagging or Boosting
# ● Handle overfitting
# ● Select base models
# ● Evaluate performance using cross-validation
# ● Justify how ensemble learning improves decision-making in this real-world
# context.

# ans 10:
# To predict loan default, first analyze data, remove leakage, and handle imbalance.
# Choose Bagging (Random Forest) for stability or Boosting (XGBoost/LightGBM) for higher accuracy.
# Prevent overfitting using regularization, early stopping, limited depth, and cross-validation.
# Select base models like Decision Trees, Logistic Regression, or Gradient Boosted Trees.
# Use Stratified or Time-based Cross-Validation and metrics like ROC-AUC, Precision, and Recall.
# Evaluate calibration for reliable probabilities.
# Ensemble learning improves decisions by combining multiple models, reducing bias and variance, increasing accuracy, and giving more reliable risk predictions—helping identify defaulters early and improving financial decision-making.