Ensemble Learning Assignment

Q1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.


-> Ensemble learning is a machine learning paradigm where multiple models (often called base learners or "weak learners") are trained and combined to solve the same problem. The primary goal is to produce a single, stronger predictive model that is more accurate, reliable, and robust than any individual constituent model.


Key Idea: The Wisdom of the Crowd

The fundamental concept behind ensemble learning is the "Wisdom of the Crowd". Just as a panel of experts often reaches a more accurate diagnosis than a single doctor by pooling diverse perspectives, ensemble methods aggregate the outputs of multiple algorithms to cancel out individual errors and biases.
To be effective, the models within an ensemble must be diverse, meaning they should make different types of mistakes on the same data.



Q2: What is the difference between Bagging and Boosting?


-> Bagging (e.g., Random Forest)
Goal: Reduce variance, prevent overfitting.
Method: Train many base models (like decision trees) independently and in parallel on different bootstrapped samples (sampling with replacement) of the data.

Model Weight: All models have equal weight in the final decision (majority vote or average).
Data Handling: Treats all data points equally in each subset.
Pros: Good for unstable models, parallelizable, robust to outliers.


Boosting (e.g., AdaBoost, XGBoost)
Goal: Reduce bias, improve accuracy.
Method: Train models sequentially; each new model focuses on misclassified or difficult examples from the previous one.
Model Weight: Models are weighted by performance; better models get more say.
Data Handling: Assigns higher weights to misclassified data points, forcing focus on errors.
Pros: High accuracy, reduces bias effectively, but can be slower and sensitive to outliers/noise.



Q3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?


-> Bootstrap sampling is a statistical resampling technique that involves creating multiple new datasets by randomly selecting observations from an original dataset with replacement.

In methods like Random Forest, bootstrap sampling serves as the primary engine for creating model diversity and improving overall performance through the following roles:


Q4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?


-> In ensemble models using bagging (like Random Forest), Out-of-Bag (OOB) samples are the data points that are not selected during the bootstrap sampling process for a specific base learner.
Because bootstrap sampling involves picking samples with replacement, roughly 36.8% of the original data points are left out of each individual tree's training set. These leftover points serve as a "natural" test set for that specific tree.

How the OOB Score is Calculated
The OOB score is an internal performance metric that estimates the model's generalization accuracy without needing a separate validation set.

Individual Prediction:
For every record in the original training set, find all the trees that did not use that record during their training (i.e., the trees for which it was "out-of-bag").

Aggregation:
 Gather the predictions from only those specific trees and aggregate them (by majority vote for classification or averaging for regression).

Final Score:
 Compare these aggregated "out-of-bag predictions" to the actual true labels for every record. The overall accuracy of these predictions is the OOB Score.



 Q5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.


-> 1. Calculation Mechanism
Decision Tree:
Importance is calculated directly based on the reduction in a criterion (like Gini impurity or Mean Squared Error) at each split where the feature is used. The total reduction is summed for that feature and normalized across the tree.
Random Forest: Importance is an aggregated average. It calculates the impurity reduction for each feature across every individual tree in the forest and then averages those scores.

2. Stability and Reliability
Decision Tree (Low Stability):
 A single tree is highly sensitive to small changes in data. It may choose a feature for a top-level split simply by chance or because of a slight quirk in the training set, leading to erratic and potentially misleading importance scores.

Random Forest (High Stability): By averaging scores across hundreds of decorrelated trees—each trained on different data subsets and feature subsets—Random Forest smooths out individual tree "noise". This provides a much more robust and trustworthy estimate of a feature's true predictive power.

3. Handling Correlated Features

Decision Tree (Winner-Take-All):
If two features are highly correlated, a Decision Tree will likely pick one for the split and ignore the other. The first feature gets all the "importance" credit, while the second gets almost none, even if they are equally useful.
Random Forest (Shared Credit):
 Because different trees use different random subsets of features, both correlated features will eventually be selected across the forest. This results in the importance being split between them, giving a more balanced view of the data.

4. Visualization and Interpretation
Decision Tree:
 Highly interpretable and easy to visualize. You can see exactly which feature was used at the "root" (most important) and follow the logic down to the leaves.
Random Forest:
 Considered a "black box". While you get a list of importance scores, you cannot easily visualize the collective decision-making path of 500+ different trees.
Common Bias:
 High Cardinality
Both models share a common weakness: they are biased toward high-cardinality features (features with many unique values, like IDs or continuous numerical data). These features offer more potential split points, giving them a mathematical advantage to be selected more often, which can artificially inflate their importance scores.





Q6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.




In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train a Random Forest Classifier
# random_state ensures reproducibility for 2026 benchmarks
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X, y)

# 3. Print the top 5 most important features
# Feature importances are stored in the 'feature_importances_' attribute
importances = rf_model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': importances
})

# Sort by importance in descending order and select top 5
top_5_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(5)

print("Top 5 Most Important Features:")
print(top_5_features.to_string(index=False))


Top 5 Most Important Features:
             Feature  Importance
          worst area    0.139357
worst concave points    0.132225
 mean concave points    0.107046
        worst radius    0.082848
     worst perimeter    0.080850


Q6:  Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the data into training and testing sets
# Using a fixed random state for reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 1. Train and evaluate a single Decision Tree ---
# Base model (often referred to as a "weak learner")
single_tree_model = DecisionTreeClassifier(random_state=42)
single_tree_model.fit(X_train, y_train)

# Predictions and accuracy
single_tree_predictions = single_tree_model.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")


# --- 2. Train and evaluate a Bagging Classifier ---
# The base_estimator is the Decision Tree. We use 10 trees (n_estimators=10)
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42,
    bootstrap=True,     # Use bootstrap sampling
    max_samples=1.0,    # Use all samples in each bootstrap sample (default)
    max_features=1.0    # Use all features in each bootstrap sample (default)
)
bagging_model.fit(X_train, y_train)

# Predictions and accuracy
bagging_predictions = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)
print(f"Bagging Classifier Accuracy:   {bagging_accuracy:.4f}")

print("\nComparison Summary:")
if bagging_accuracy > single_tree_accuracy:
    print("The Bagging Classifier performed better than the single Decision Tree.")
elif bagging_accuracy < single_tree_accuracy:
    print("The single Decision Tree performed better (this can happen depending on the data split/random state).")
else:
    print("Both models achieved the same accuracy.")




Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy:   1.0000

Comparison Summary:
Both models achieved the same accuracy.


Q8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Define the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# 3. Define the hyperparameter grid to search
# 'n_estimators': number of trees
# 'max_depth': maximum depth of each tree
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

# 4. Initialize GridSearchCV
# cv=5 means 5-fold cross-validation
# n_jobs=-1 uses all available CPU cores for faster processing
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# 5. Run the grid search on the training data
grid_search.fit(X_train, y_train)

# 6. Extract results
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Final evaluation on the unseen test set
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print the final results
print(f"Best Parameters Found: {best_params}")
print(f"Final Accuracy on Test Set: {final_accuracy:.4f}")


Best Parameters Found: {'max_depth': None, 'n_estimators': 200}
Final Accuracy on Test Set: 0.9649


Q9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train Bagging Regressor
# We use a DecisionTreeRegressor as the base estimator
bagging_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_model.fit(X_train, y_train)
y_pred_bagging = bagging_model.predict(X_test)
mse_bagging = mean_squared_error(y_test, y_pred_bagging)

# 3. Train Random Forest Regressor
# Random Forest is an optimized version of bagging specifically for trees
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# 4. Compare Results
print(f"Bagging Regressor MSE:      {mse_bagging:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")

if mse_rf < mse_bagging:
    print("\nRandom Forest performed better (lower MSE).")
else:
    print("\nBagging performed better or equal (lower/equal MSE).")




HTTPError: HTTP Error 403: Forbidden