#Ensemble Learning

Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.
 -  Ensemble learning is a machine learning paradigm where multiple models (often called "weak learners") are trained to solve the same problem and combined to get better results.
The key idea behind it is that a group of weak learners can outperform a single strong learner. The combined model is more robust and less prone to overfitting than individual models because the errors of individual models tend to cancel each other out when aggregated (e.g., through voting or averaging).


Question 2: What is the difference between Bagging and Boosting?
 - Bagging: The primary difference is that bagging trains individual models in parallel using different random subsets of the training data (created with replacement, known as bootstrapping). The final prediction is typically an average or majority vote of the individual models. This reduces variance and helps prevent overfitting.

Boosting: In contrast, boosting trains individual models in sequence, where each new model corrects the errors made by the previous ones. It focuses on misclassified instances by giving them higher weight in subsequent training steps. This primarily reduces bias and converts weak learners into strong learners.


Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
 - Bootstrap sampling is sampling with replacement from the original dataset.
In bagging methods like Random Forest, bootstrap sampling is used to create multiple diverse subsets of the original training data. Each subset is used to train a separate base model (e.g., a decision tree). This process introduces randomness and helps reduce the variance of the overall ensemble model.


Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
 - Out-of-Bag (OOB) samples are data points from the original dataset that were not included in a specific bootstrap sample.
For each base model in a bagging ensemble, the OOB samples can be used as a validation set to estimate the model's performance without the need for a separate, dedicated validation set. The OOB score (e.g., accuracy or error rate) is calculated by averaging the predictions for each data point across all the trees that did not use that data point in their training, providing an internal, unbiased estimate of the ensemble's generalization error.


Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
 - Feature importance in a single Decision Tree is often calculated based on how much each feature reduces impurity (like Gini impurity or entropy) when it is used to split the data.
In a Random Forest, feature importance is calculated by averaging the impurity reduction contributions of each feature across all the individual decision trees in the forest. This ensemble approach provides a more robust and less-biased estimate of feature importance compared to a single tree, which can be highly influenced by the specific structure of the single tree grown from a specific dataset split.




In [1]:
"""Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores."""

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load the dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
feature_names = cancer.feature_names

# Train a Random Forest Classifier
# A specific random state is used for reproducibility of the output
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

In [2]:
# Get feature importances
importances = rf.feature_importances_

# Create a pandas Series for easy sorting
feature_importimportances_series = pd.Series(importances, index=feature_names)

# Sort and get the top 5
top_5_features = feature_importimportances_series.nlargest(5)
print("Top 5 most important features:")
print(top_5_features)

Top 5 most important features:
worst area              0.139357
worst concave points    0.132225
mean concave points     0.107046
worst radius            0.082848
worst perimeter         0.080850
dtype: float64


In [3]:
"""Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree """
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
y_pred_single = single_tree.predict(X_test)
accuracy_single = accuracy_score(y_test, y_pred_single)
print(f"Accuracy of a single Decision Tree: {accuracy_single:.4f}")

# Train a Bagging Classifier using Decision T…

Accuracy of a single Decision Tree: 1.0000


In [4]:
"""Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy """
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# Load the dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [None, 5, 10, 15],
    'n_estimators': [10, 50, 100, 200]
}

# Initialize the Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


In [5]:
# Print the best parameters and best score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate on the test set
best_rf = grid_search.best_estimator_
test_accuracy = best_rf.score(X_test, y_test)
print(f"Final accuracy on the test set: {test_accuracy:.4f}")

Best parameters found: {'max_depth': None, 'n_estimators': 50}
Best cross-validation score: 0.9623
Final accuracy on the test set: 0.9708


In [6]:
"""Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE) """
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.data
y = california_housing.target

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Train models ---

# 1. Bagging Regressor
# The base estimator for BaggingRegressor is a DecisionTreeRegressor by default
bag_reg = BaggingRegressor(random_state=42)
bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# 2. Random Forest Regressor
# Random Forest is an extension of bagging with feature randomness
rnd_forest_reg = RandomForestRegressor(random_state=42)
rnd_forest_reg.fit(X_train, y_train)
y_pred_rnd = rnd_forest_reg.predict(X_test)
mse_rnd = mean_squared_error(y_test, y_pred_rnd)

# --- Compare Mean Squared Errors (MSE) ---

print(f"Bagging Regressor MSE: {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rnd:.4f}")

# --- Example Output ---
# Bagging Regressor MSE: 0.2566
# Random Forest Regressor MSE: 0.2550

Bagging Regressor MSE: 0.2824
Random Forest Regressor MSE: 0.2554
