In [None]:
#What is Ensemble Learning in machine learning? Explain the key idea behind it.

'''
->Ensemble Learning in machine learning refers to a technique where multiple models (often called "learners" or "base models") are combined to solve a particular problem—typically to improve overall prediction performance compared to individual models.

Common Ensemble Methods:
Bagging (Bootstrap Aggregating):
Trains multiple versions of the same model on different random subsets of the data.
Final prediction: average (regression) or majority vote (classification).
Example: Random Forest.

Boosting:
Builds models sequentially, where each new model focuses on correcting the errors of the previous ones.
Final prediction: weighted sum or majority vote.
Example: AdaBoost, Gradient Boosting, XGBoost.

Stacking (Stacked Generalization):
Trains multiple diverse models and then combines their outputs using a meta-model.
More flexible but more complex to train.
'''

In [None]:
#What is the difference between Bagging and Boosting?

'''
->| Feature               | **Bagging**                                                                | **Boosting**                                                                   |
| --------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| 🔄 **Model Training** | Models are trained **independently** in **parallel**                       | Models are trained **sequentially**, each correcting the previous              |
| 🗂️ **Data Sampling** | Uses **random subsets** of the data (with replacement — bootstrap samples) | Uses **all data**, but gives more weight to previously misclassified points    |
| 🧠 **Focus**          | Reduces **variance** (helps with overfitting)                              | Reduces **bias** (helps with underfitting)                                     |
| ⚖️ **Weighting**      | All models have **equal weight** in the final prediction                   | Models are **weighted** by accuracy (more accurate models have more influence) |
| ⚙️ **Complexity**     | Easier to parallelize, often faster                                        | More complex an                                                                |
'''

In [None]:
#What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?.

'''
->Bootstrap sampling is a statistical technique used to create multiple datasets by randomly sampling with replacement from an original dataset. It's a key component in Bagging methods like Random Forest.

 Role in Bagging (e.g., Random Forest):
In Bagging, especially Random Forest:
Multiple bootstrap samples are generated from the original training data.
A separate model (e.g., decision tree) is trained on each bootstrap sample.
Predictions from all models are combined:
Averaged for regression tasks.
Majority vote for classification tasks.
'''

In [None]:
#What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

'''
->Out-of-Bag (OOB) samples and the OOB score are concepts used in Bagging methods like Random Forest to evaluate model performance without needing a separate validation set or cross-validation

How Is the OOB Score Used?
Train each model (e.g., decision tree in Random Forest) on its bootstrap sample.
For every data point, collect predictions from the models that did NOT see that point (i.e., for which it was an OOB sample).
Aggregate those OOB predictions (e.g., majority vote or average).

Why OOB Evaluation Is Useful:
Acts like built-in cross-validation.
No need to hold out a separate test or validation set.
Provides an unbiased estimate of model performance on unseen data.
'''

In [None]:
#Compare feature importance analysis in a single Decision Tree vs. a Random Forest.

'''
->1. Decision Tree Feature Importance:
How it's computed:
A single decision tree calculates feature importance based on the reduction in impurity (like Gini impurity or entropy) each time a feature is used to split the data.

Pros:
Easy to interpret.
Fast to compute.

Cons:
High variance — sensitive to training data.
Can be biased toward features with more levels (e.g., categorical variables with many categories).

 2. Random Forest Feature Importance:
How it's computed:
Aggregates feature importances across all trees in the forest.
Each tree computes its own importance (based on impurity reduction), and the values are averaged and normalized.

Optional advanced method:
Permutation Importance: Shuffle each feature and observe how model performance changes — a drop indicates that the feature was important.

Pros:
More stable and generalizable due to averaging.
Reduces the effect of overfitting.

Cons:
Less interpretable than a single tree.
Still somewhat biased toward features with many unique values (unless using permutation importance).
'''

In [3]:
#Write a Python program to:● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer()● Train a Random Forest Classifier● Print the top 5 most important features based on feature importance scores.
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = rf.feature_importances_

# Create a DataFrame for easy viewing
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort by importance and get top 5
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

# Print the top 5 features
print("Top 5 Most Important Features:")
print(top_features)


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [3]:
#Write a Python program to:● Train a Bagging Classifier using Decision Trees on the Iris dataset● Evaluate its accuracy and compare with a single Decision Tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a single Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_preds = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# Train a Bagging Classifier with Decision Trees as base estimators
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging_model.fit(X_train, y_train)
bagging_preds = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_preds)

# Print results
print("Accuracy of single Decision Tree: {:.2f}%".format(dt_accuracy * 100))
print("Accuracy of Bagging Classifier: {:.2f}%".format(bagging_accuracy * 100))
print("Accuracy Improvement: {:.2f}%".format((bagging_accuracy - dt_accuracy) * 100))



Accuracy of single Decision Tree: 100.00%
Accuracy of Bagging Classifier: 100.00%
Accuracy Improvement: 0.00%


In [11]:
#: Write a Python program to:● Train a Random Forest Classifier● Tune hyperparameters max_depth and n_estimators using GridSearchCV● Print the best parameters and final accuracy
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset (replace with your own if needed)
data = load_iris()
X, y = data.data, data.target

# Split data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Define the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid to search
param_grid = {
    'n_estimators': [10, 50, 100, 200],    # Number of trees
    'max_depth': [None, 5, 10, 20]          # Max depth of each tree
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,           # 5-fold cross-validation
    n_jobs=-1,      # Use all processors
    verbose=1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters:", grid_search.best_params_)

# Predict on test set with best estimator
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

# Final accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Final Accuracy: {accuracy:.4f}")


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Parameters: {'max_depth': None, 'n_estimators': 10}
Final Accuracy: 0.9111


In [7]:
#Write a Python program to:● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset● Compare their Mean Squared Errors (MSE)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Bagging Regressor with Decision Trees
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_preds = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_preds)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=50,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_preds = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_preds)

# Print results
print("Mean Squared Error of Bagging Regressor: {:.4f}".format(bagging_mse))
print("Mean Squared Error of Random Forest Regressor: {:.4f}".format(rf_mse))

# Optional: Difference
print("MSE Difference (Bagging - RF): {:.4f}".format(bagging_mse - rf_mse))


Mean Squared Error of Bagging Regressor: 0.2573
Mean Squared Error of Random Forest Regressor: 0.2573
MSE Difference (Bagging - RF): 0.0000


In [None]:
#You are working as a data scientist at a financial institution to predict loan default. You have access to customer demographic and transaction history data.You decide to use ensemble techniques to increase model performance.Explain your step-by-step approach to:● Choose between Bagging or Boosting● Handle overfitting● Select base models● Evaluate performance using cross-validation● Justify how ensemble learning improves decision-making in this real-world context.

'''
->1. Choosing Between Bagging and Boosting
Bagging (e.g., Random Forest):
Good if your base models are high variance (like deep decision trees).
Works by training many independent models on random subsets of data, then averaging/voting.

Boosting (e.g., Gradient Boosting, XGBoost, LightGBM):
Focuses on sequentially correcting mistakes from previous models.
Can achieve higher predictive accuracy but more prone to overfitting if not tuned well.
Often better on imbalanced or complex datasets, common in financial defaults.

2. Handling Overfitting
Use cross-validation (e.g., stratified K-fold) to monitor generalization.

For Bagging:
Limit tree depth.
Use more trees to reduce variance.

3. Selecting Base Models
Typically, decision trees are used as base learners because:
They capture nonlinear relationships and interactions well.
Easy to ensemble and interpret.

4. Evaluating Performance Using Cross-Validation
Use Stratified K-Fold Cross-Validation to maintain the proportion of default vs. non-default cases in each fold.
Evaluate multiple metrics, including:
AUC-ROC (discrimination ability)
Precision-Recall AUC (especially if classes are imbalanced)

5. How Ensemble Learning Improves Decision-Making in This Context
Higher accuracy and robustness: Ensemble models combine multiple weak learners to reduce errors from individual models, improving prediction reliability — critical in high-stakes loan decisions.
Capturing complex patterns: Ensembles can model nonlinearities and interactions in customer demographics and transaction histories that single models may miss.
Reducing risk of costly mistakes: Better prediction reduces false negatives (missing defaults) and false positives (denying credit to good customers), optimizing risk and customer satisfaction.
'''
