Ensemble Learning | Assignment


Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.


Ans: Ensemble learning is a machine learning technique that combines predictions from multiple individual models to improve overall performance. The key idea behind it is that a group of "weak learners," when combined, can form a "strong learner" that's more accurate and robust than any single model alone.

Key Idea
The core principle of ensemble learning is to leverage the "wisdom of the crowd." By training multiple diverse models on the same problem, the ensemble can reduce errors and biases that might be present in a single model. This is because different models often make different types of errors. When their predictions are combined, the correct predictions tend to reinforce each other, while the incorrect ones are more likely to be canceled out.

Think of it like this: If you ask a single expert for their opinion, you'll get one perspective. But if you ask a panel of experts with different backgrounds, their combined opinion is often more reliable and well-rounded. Ensemble methods work in a similar way, bringing together different perspectives from various models to arrive at a better final decision.

Common Ensemble Techniques
Ensemble methods can generally be categorized into two main groups:

Bagging (Bootstrap Aggregating): This technique trains multiple models independently on different random subsets of the training data. The predictions of these models are then averaged (for regression) or voted on (for classification) to get the final result. A popular example is the Random Forest algorithm, which builds an ensemble of decision trees.

Boosting: This method trains models sequentially. Each new model focuses on correcting the errors made by the previous models in the sequence. This iterative process allows the ensemble to progressively improve its performance. Examples include AdaBoost and Gradient Boosting Machines (GBM).

Bagging and Boosting are both powerful, but they approach the problem from different angles. Bagging focuses on reducing variance by averaging predictions from many diverse models, while Boosting focuses on reducing bias by iteratively correcting the mistakes of previous models.









Question 2: What is the difference between Bagging and Boosting?
Ans:Bagging and Boosting are both ensemble learning methods that combine multiple models to improve performance, but they differ fundamentally in their approach. The key distinction lies in how the models are built and combined.

Bagging
Bagging, short for Bootstrap Aggregating, trains multiple independent models in parallel on different random subsets of the training data.

Training: It uses a technique called bootstrapping to create multiple datasets by sampling with replacement from the original data. Each model is then trained on a different one of these bootstrap samples, completely independent of the others.


Focus: The primary goal is to reduce variance. By training many diverse models and averaging or voting on their predictions, the ensemble becomes more stable and less sensitive to fluctuations in the training data. This makes it particularly effective at combating overfitting. A classic example is the Random Forest algorithm, which uses bagging with decision trees.



Process: Parallel training. All models are created at the same time.

Final Prediction: Predictions from all models are given equal weight and combined through a simple average (for regression) or a majority vote (for classification).

Boosting
Boosting trains a series of models sequentially, where each new model is designed to correct the errors of its predecessor.

Training: It starts with a base model. After this model makes predictions, the algorithm identifies the data points that were misclassified or had large errors. It then gives these difficult data points a higher weight, so the next model in the sequence focuses more on getting them right. This process is repeated iteratively.




Focus: The main objective is to reduce bias. By focusing on the "weak spots" of previous models, boosting creates a strong learner from a series of weak ones, progressively improving the model's accuracy.


Process: Sequential training. Each model is built based on the performance of the previous one.


Final Prediction: Predictions are combined using a weighted average or vote, where more accurate models in the sequence are given more influence over the final result. Popular boosting algorithms include AdaBoost and Gradient Boosting Machines (GBM).


Question 3: What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?

Ans:Bootstrap sampling is a resampling technique where you create smaller datasets by sampling with replacement from the original dataset. This means that when an item from the original dataset is chosen for a new sample, it is "replaced" and can be chosen again. Each new sample has the same size as the original dataset.



Role in Bagging and Random Forest
Bootstrap sampling is the foundation of Bagging, which stands for Bootstrap Aggregating.

The role of bootstrap sampling is to introduce diversity among the models in the ensemble. By training each individual model on a different bootstrap sample, you ensure that the models are exposed to slightly different versions of the data. This makes them less correlated with each other. If all models were trained on the exact same data, they would likely make the same errors, and combining their predictions wouldn't be as effective.

In a Random Forest, bootstrap sampling is a crucial first step:

Multiple samples are created: The algorithm generates many bootstrap samples from the original dataset. Each sample is a unique combination of data points, with some points appearing multiple times and others not appearing at all.

A tree is grown for each sample: A decision tree is trained on each of these bootstrap samples.

Variance is reduced: Because the trees are trained on different data subsets, their predictions are less correlated. When their predictions are aggregated (by averaging or voting), the effect of a single model's high variance is mitigated, leading to a more stable and accurate final prediction.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Ans:When a model is built using bootstrap sampling, some data points from the original training set are not included in the bootstrap sample for a given tree. On average, about 37% of the original data points are left out for each individual tree. These left-out data points are called Out-of-Bag (OOB) samples for that specific tree. They are essentially "unseen" data for that particular model.




For example, if you have a dataset of 100 observations and you create a bootstrap sample for one decision tree, that sample will contain roughly 63 observations. The remaining 37 observations are the OOB samples for that tree.

How is OOB Score Used?
The OOB score is used to get an unbiased estimate of the ensemble model's performance. Instead of using a separate validation set (which reduces the amount of data available for training), the OOB samples serve as an internal test set.


Here's how it works:

Prediction: For each data point in the original training set, we collect predictions from only the trees for which that data point was an OOB sample.

Aggregation: These predictions are then aggregated. For a classification problem, this is usually a majority vote, and for a regression problem, it's an average.


Error Calculation: The aggregated prediction for each data point is compared to its true label to calculate an error.

Final Score: This process is done for all data points in the original dataset, and the errors are aggregated to produce a final OOB score (e.g., accuracy for classification or mean squared error for regression).

This OOB score is considered a reliable and robust estimate of the model's generalization performance. It's computationally efficient because the evaluation happens seamlessly during the training process, eliminating the need for separate cross-validation.

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Ans:Feature Importance in a Single Decision Tree
A single Decision Tree calculates feature importance based on how much a feature reduces impurity (e.g., Gini impurity or entropy) or error (e.g., mean squared error) when it's used to split a node.

Calculation: The importance score for a feature is the sum of the impurity reduction it provides across all the splits in which it is used. Features that are used for splits near the root of the tree and lead to a significant decrease in impurity will have a higher importance score.

Reliability: The results from a single decision tree are often unstable and unreliable. A minor change in the training data can drastically alter the tree's structure and, consequently, its feature importance scores. This is because a single tree can easily overfit to the training data, leading to a biased view of which features are truly important. For example, if two features are highly correlated, the tree might only use one of them for a split and give it all the importance, while the other equally predictive feature gets a score of zero.



Feature Importance in a Random Forest
A Random Forest provides a much more robust and reliable measure of feature importance by leveraging the collective wisdom of all the trees in the ensemble.

Calculation: The feature importance for a Random Forest is the average of the feature importance scores from all the individual decision trees within the forest. For each tree, the importance is calculated the same way as a single decision tree (based on impurity reduction). These scores are then averaged across all trees.


Reliability: This averaging process is the key to its strength. Because each tree in the Random Forest is built on a different random subset of the data and a random subset of features, they are all diverse and less correlated. This helps to smooth out the noise and biases of any single tree. The final importance score is a more stable estimate of a feature's true predictive power, as a feature that appears important by chance in one tree is unlikely to appear important in many others. This makes it a much better indicator for feature selection and model interpretation.

Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.

In [None]:
#Ans:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

# Load the Breast Cancer dataset
# The dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
# It describes characteristics of the cell nuclei present in the image.
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

print("Dataset loaded successfully.")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# Train a Random Forest Classifier
# n_estimators: The number of trees in the forest. A higher number generally improves performance
# but increases computation time.
# random_state: Controls the randomness of the bootstrapping of the samples and the splitting of features.
# This ensures reproducibility of the results.
print("Training Random Forest Classifier...")
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X, y)
print("Random Forest Classifier trained.\n")

# Get feature importances
# The feature_importances_ attribute returns an array where each value represents the importance
# of the corresponding feature. The sum of all importances is 1.
feature_importances = rf_classifier.feature_importances_

# Create a Pandas Series for better visualization and sorting
# We map the importance scores to their respective feature names.
features_df = pd.Series(feature_importances, index=X.columns)

# Sort features by importance in descending order
# This allows us to easily identify the most important features.
sorted_features = features_df.sort_values(ascending=False)

# Print the top 5 most important features
print("Top 5 Most Important Features:")
for i, (feature, importance) in enumerate(sorted_features.head(5).items()):
    print(f"{i+1}. {feature}: {importance:.4f}")



Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

In [None]:
#Ans:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
# The Iris dataset is a classic and very easy-to-use dataset for classification.
# It contains measurements of iris flowers (sepal length, sepal width, petal length, petal width)
# and their corresponding species (Setosa, Versicolor, Virginica).
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("Iris dataset loaded successfully.")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# Split the data into training and testing sets
# We use a test size of 30% and a random state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples\n")

# 1. Train a single Decision Tree Classifier
print("Training a single Decision Tree Classifier...")
# A simple Decision Tree can be prone to overfitting, especially with complex data.
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
single_tree_predictions = single_tree.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)
print(f"Accuracy of a single Decision Tree: {single_tree_accuracy:.4f}\n")

# 2. Train a Bagging Classifier using Decision Trees
print("Training a Bagging Classifier (ensemble of Decision Trees)...")
# base_estimator: The individual model that will be trained (here, a Decision Tree).
# n_estimators: The number of base estimators (trees) in the ensemble.
# max_samples: The number of samples to draw from X to train each base estimator.
# bootstrap: Whether samples are drawn with replacement (True for Bagging).
# random_state: For reproducibility.
bagging_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    max_samples=0.8, # Use 80% of the training data for each base estimator
    bootstrap=True,
    random_state=42,
    n_jobs=-1 # Use all available CPU cores for parallel training
)
bagging_classifier.fit(X_train, y_train)
bagging_predictions = bagging_classifier.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}\n")

# Compare the accuracies
print("--- Comparison ---")
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

if bagging_accuracy > single_tree_accuracy:
    print("\nThe Bagging Classifier performed better than the single Decision Tree. This demonstrates the power of ensemble learning in improving model robustness and accuracy.")
elif bagging_accuracy < single_tree_accuracy:
    print("\nThe single Decision Tree performed slightly better or similarly. This can happen depending on the dataset and hyperparameter tuning, but typically Bagging offers more robust performance.")
else:
    print("\nBoth models achieved the same accuracy.")



Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

In [None]:
#Ans:import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
# The Iris dataset is a classic and very easy-to-use dataset for classification.
# It contains measurements of iris flowers (sepal length, sepal width, petal length, petal width)
# and their corresponding species (Setosa, Versicolor, Virginica).
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("Iris dataset loaded successfully.")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}\n")

# Split the data into training and testing sets
# We use a test size of 30% and a random state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples\n")

# 1. Train a single Decision Tree Classifier
print("Training a single Decision Tree Classifier...")
# A simple Decision Tree can be prone to overfitting, especially with complex data.
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)
single_tree_predictions = single_tree.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)
print(f"Accuracy of a single Decision Tree: {single_tree_accuracy:.4f}\n")

# 2. Train a Bagging Classifier using Decision Trees
print("Training a Bagging Classifier (ensemble of Decision Trees)...")
# base_estimator: The individual model that will be trained (here, a Decision Tree).
# n_estimators: The number of base estimators (trees) in the ensemble.
# max_samples: The number of samples to draw from X to train each base estimator.
# bootstrap: Whether samples are drawn with replacement (True for Bagging).
# random_state: For reproducibility.
bagging_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    max_samples=0.8, # Use 80% of the training data for each base estimator
    bootstrap=True,
    random_state=42,
    n_jobs=-1 # Use all available CPU cores for parallel training
)
bagging_classifier.fit(X_train, y_train)
bagging_predictions = bagging_classifier.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}\n")

# Compare the accuracies
print("--- Comparison ---")
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}\n")

if bagging_accuracy > single_tree_accuracy:
    print("The Bagging Classifier performed better than the single Decision Tree. This demonstrates the power of ensemble learning in improving model robustness and accuracy.")
elif bagging_accuracy < single_tree_accuracy:
    print("The single Decision Tree performed slightly better or similarly. This can happen depending on the dataset and hyperparameter tuning, but typically Bagging offers more robust performance.")
else:
    print("Both models achieved the same accuracy.\n")


# 3. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
print("--- Random Forest with Hyperparameter Tuning (GridSearchCV) ---")
print("Training Random Forest Classifier for hyperparameter tuning...")

# Define the parameter grid for GridSearchCV
# max_depth: The maximum depth of the tree. Limiting depth helps prevent overfitting.
# n_estimators: The number of trees in the forest. More trees generally lead to better performance.
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20]
}

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
# estimator: The model to tune (RandomForestClassifier).
# param_grid: The dictionary of hyperparameters to search.
# cv: Number of folds for cross-validation.
# scoring: The metric to optimize (e.g., 'accuracy' for classification).
# n_jobs: Number of CPU cores to use (-1 means all available cores).
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best parameters found: {grid_search.best_params_}")

# Print the best cross-validation score (accuracy)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate the best estimator on the test set
# The best_estimator_ attribute gives the model trained with the best parameters.
best_rf_model = grid_search.best_estimator_
final_rf_predictions = best_rf_model.predict(X_test)
final_rf_accuracy = accuracy_score(y_test, final_rf_predictions)
print(f"Final Random Forest accuracy on test set (with best parameters): {final_rf_accuracy:.4f}")



Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [None]:
#Ans:import pandas as pd
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error

# Load the Iris dataset
# The Iris dataset is a classic and very easy-to-use dataset for classification.
# It contains measurements of iris flowers (sepal length, sepal width, petal length, petal width)
# and their corresponding species (Setosa, Versicolor, Virginica).
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = iris.target

print("Iris dataset loaded successfully.")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of samples: {X_iris.shape[0]}\n")

# Split the Iris data into training and testing sets
# We use a test size of 30% and a random state for reproducibility.
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)

print(f"Iris Training set size: {X_train_iris.shape[0]} samples")
print(f"Iris Test set size: {X_test_iris.shape[0]} samples\n")

# 1. Train a single Decision Tree Classifier on Iris
print("Training a single Decision Tree Classifier on Iris dataset...")
# A simple Decision Tree can be prone to overfitting, especially with complex data.
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train_iris, y_train_iris)
single_tree_predictions = single_tree.predict(X_test_iris)
single_tree_accuracy = accuracy_score(y_test_iris, single_tree_predictions)
print(f"Accuracy of a single Decision Tree: {single_tree_accuracy:.4f}\n")

# 2. Train a Bagging Classifier using Decision Trees on Iris
print("Training a Bagging Classifier (ensemble of Decision Trees) on Iris dataset...")
# base_estimator: The individual model that will be trained (here, a Decision Tree).
# n_estimators: The number of base estimators (trees) in the ensemble.
# max_samples: The number of samples to draw from X to train each base estimator.
# bootstrap: Whether samples are drawn with replacement (True for Bagging).
# random_state: For reproducibility.
bagging_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    max_samples=0.8, # Use 80% of the training data for each base estimator
    bootstrap=True,
    random_state=42,
    n_jobs=-1 # Use all available CPU cores for parallel training
)
bagging_classifier.fit(X_train_iris, y_train_iris)
bagging_predictions = bagging_classifier.predict(X_test_iris)
bagging_accuracy = accuracy_score(y_test_iris, bagging_predictions)
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}\n")

# Compare the accuracies (Iris dataset)
print("--- Iris Dataset Comparison ---")
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

if bagging_accuracy > single_tree_accuracy:
    print("The Bagging Classifier performed better than the single Decision Tree. This demonstrates the power of ensemble learning in improving model robustness and accuracy.")
elif bagging_accuracy < single_tree_accuracy:
    print("The single Decision Tree performed slightly better or similarly. This can happen depending on the dataset and hyperparameter tuning, but typically Bagging offers more robust performance.")
else:
    print("Both models achieved the same accuracy.\n")


# 3. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV on Iris
print("--- Random Forest with Hyperparameter Tuning (GridSearchCV) on Iris ---")
print("Training Random Forest Classifier for hyperparameter tuning...")

# Define the parameter grid for GridSearchCV
# max_depth: The maximum depth of the tree. Limiting depth helps prevent overfitting.
# n_estimators: The number of trees in the forest. More trees generally lead to better performance.
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20]
}

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
# estimator: The model to tune (RandomForestClassifier).
# param_grid: The dictionary of hyperparameters to search.
# cv: Number of folds for cross-validation.
# scoring: The metric to optimize (e.g., 'accuracy' for classification).
# n_jobs: Number of CPU cores to use (-1 means all available cores).
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train_iris, y_train_iris)

# Print the best parameters found by GridSearchCV
print(f"Best parameters found: {grid_search.best_params_}")

# Print the best cross-validation score (accuracy)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate the best estimator on the test set
# The best_estimator_ attribute gives the model trained with the best parameters.
best_rf_model = grid_search.best_estimator_
final_rf_predictions = best_rf_model.predict(X_test_iris)
final_rf_accuracy = accuracy_score(y_test_iris, final_rf_predictions)
print(f"Final Random Forest accuracy on test set (with best parameters): {final_rf_accuracy:.4f}\n")


# 4. Train Bagging Regressor and Random Forest Regressor on California Housing dataset
print("--- California Housing Dataset: Regression Comparison ---")

# Load the California Housing dataset
# This dataset contains information about housing prices in California.
california_housing = fetch_california_housing()
X_housing = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y_housing = california_housing.target

print("California Housing dataset loaded successfully.")
print(f"Number of features: {X_housing.shape[1]}")
print(f"Number of samples: {X_housing.shape[0]}\n")

# Split the California Housing data into training and testing sets
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(X_housing, y_housing, test_size=0.3, random_state=42)

print(f"Housing Training set size: {X_train_housing.shape[0]} samples")
print(f"Housing Test set size: {X_test_housing.shape[0]} samples\n")

# Train a Bagging Regressor
print("Training a Bagging Regressor...")
# base_estimator: Default is a DecisionTreeRegressor.
bagging_regressor = BaggingRegressor(n_estimators=100, random_state=42, n_jobs=-1)
bagging_regressor.fit(X_train_housing, y_train_housing)
bagging_regressor_predictions = bagging_regressor.predict(X_test_housing)
bagging_regressor_mse = mean_squared_error(y_test_housing, bagging_regressor_predictions)
print(f"Mean Squared Error (MSE) of Bagging Regressor: {bagging_regressor_mse:.4f}\n")

# Train a Random Forest Regressor
print("Training a Random Forest Regressor...")
# Random Forest Regressor is an ensemble of Decision Tree Regressors.
random_forest_regressor = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
random_forest_regressor.fit(X_train_housing, y_train_housing)
random_forest_regressor_predictions = random_forest_regressor.predict(X_test_housing)
random_forest_regressor_mse = mean_squared_error(y_test_housing, random_forest_regressor_predictions)
print(f"Mean Squared Error (MSE) of Random Forest Regressor: {random_forest_regressor_mse:.4f}\n")

# Compare the MSEs
print("--- Regression MSE Comparison ---")
print(f"Bagging Regressor MSE: {bagging_regressor_mse:.4f}")
print(f"Random Forest Regressor MSE: {random_forest_regressor_mse:.4f}")

if random_forest_regressor_mse < bagging_regressor_mse:
    print("\nThe Random Forest Regressor performed better (lower MSE) than the Bagging Regressor.")
elif random_forest_regressor_mse > bagging_regressor_mse:
    print("\nThe Bagging Regressor performed better (lower MSE) than the Random Forest Regressor.")
else:
    print("\nBoth regressors achieved the same MSE.")



Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.


In [None]:
#Ans:
import pandas as pd
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error

# Load the Iris dataset
# The Iris dataset is a classic and very easy-to-use dataset for classification.
# It contains measurements of iris flowers (sepal length, sepal width, petal length, petal width)
# and their corresponding species (Setosa, Versicolor, Virginica).
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
y_iris = iris.target

print("Iris dataset loaded successfully.")
print(f"Number of features: {X_iris.shape[1]}")
print(f"Number of samples: {X_iris.shape[0]}\n")

# Split the Iris data into training and testing sets
# We use a test size of 30% and a random state for reproducibility.
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)

print(f"Iris Training set size: {X_train_iris.shape[0]} samples")
print(f"Iris Test set size: {X_test_iris.shape[0]} samples\n")

# 1. Train a single Decision Tree Classifier on Iris
print("Training a single Decision Tree Classifier on Iris dataset...")
# A simple Decision Tree can be prone to overfitting, especially with complex data.
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train_iris, y_train_iris)
single_tree_predictions = single_tree.predict(X_test_iris)
single_tree_accuracy = accuracy_score(y_test_iris, single_tree_predictions)
print(f"Accuracy of a single Decision Tree: {single_tree_accuracy:.4f}\n")

# 2. Train a Bagging Classifier using Decision Trees on Iris
print("Training a Bagging Classifier (ensemble of Decision Trees) on Iris dataset...")
# base_estimator: The individual model that will be trained (here, a Decision Tree).
# n_estimators: The number of base estimators (trees) in the ensemble.
# max_samples: The number of samples to draw from X to train each base estimator.
# bootstrap: Whether samples are drawn with replacement (True for Bagging).
# random_state: For reproducibility.
bagging_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100,
    max_samples=0.8, # Use 80% of the training data for each base estimator
    bootstrap=True,
    random_state=42,
    n_jobs=-1 # Use all available CPU cores for parallel training
)
bagging_classifier.fit(X_train_iris, y_train_iris)
bagging_predictions = bagging_classifier.predict(X_test_iris)
bagging_accuracy = accuracy_score(y_test_iris, bagging_predictions)
print(f"Accuracy of Bagging Classifier: {bagging_accuracy:.4f}\n")

# Compare the accuracies (Iris dataset)
print("--- Iris Dataset Comparison ---")
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

if bagging_accuracy > single_tree_accuracy:
    print("The Bagging Classifier performed better than the single Decision Tree. This demonstrates the power of ensemble learning in improving model robustness and accuracy.")
elif bagging_accuracy < single_tree_accuracy:
    print("The single Decision Tree performed slightly better or similarly. This can happen depending on the dataset and hyperparameter tuning, but typically Bagging offers more robust performance.")
else:
    print("Both models achieved the same accuracy.\n")


# 3. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV on Iris
print("--- Random Forest with Hyperparameter Tuning (GridSearchCV) on Iris ---")
print("Training Random Forest Classifier for hyperparameter tuning...")

# Define the parameter grid for GridSearchCV
# max_depth: The maximum depth of the tree. Limiting depth helps prevent overfitting.
# n_estimators: The number of trees in the forest. More trees generally lead to better performance.
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20]
}

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
# estimator: The model to tune (RandomForestClassifier).
# param_grid: The dictionary of hyperparameters to search.
# cv: Number of folds for cross-validation.
# scoring: The metric to optimize (e.g., 'accuracy' for classification).
# n_jobs: Number of CPU cores to use (-1 means all available cores).
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train_iris, y_train_iris)

# Print the best parameters found by GridSearchCV
print(f"Best parameters found: {grid_search.best_params_}")

# Print the best cross-validation score (accuracy)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate the best estimator on the test set
# The best_estimator_ attribute gives the model trained with the best parameters.
best_rf_model = grid_search.best_estimator_
final_rf_predictions = best_rf_model.predict(X_test_iris)
final_rf_accuracy = accuracy_score(y_test_iris, final_rf_predictions)
print(f"Final Random Forest accuracy on test set (with best parameters): {final_rf_accuracy:.4f}\n")


# 4. Train Bagging Regressor and Random Forest Regressor on California Housing dataset
print("--- California Housing Dataset: Regression Comparison ---")

# Load the California Housing dataset
# This dataset contains information about housing prices in California.
california_housing = fetch_california_housing()
X_housing = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
y_housing = california_housing.target

print("California Housing dataset loaded successfully.")
print(f"Number of features: {X_housing.shape[1]}")
print(f"Number of samples: {X_housing.shape[0]}\n")

# Split the California Housing data into training and testing sets
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(X_housing, y_housing, test_size=0.3, random_state=42)

print(f"Housing Training set size: {X_train_housing.shape[0]} samples")
print(f"Housing Test set size: {X_test_housing.shape[0]} samples\n")

# Train a Bagging Regressor
print("Training a Bagging Regressor...")
# base_estimator: Default is a DecisionTreeRegressor.
bagging_regressor = BaggingRegressor(n_estimators=100, random_state=42, n_jobs=-1)
bagging_regressor.fit(X_train_housing, y_train_housing)
bagging_regressor_predictions = bagging_regressor.predict(X_test_housing)
bagging_regressor_mse = mean_squared_error(y_test_housing, bagging_regressor_predictions)
print(f"Mean Squared Error (MSE) of Bagging Regressor: {bagging_regressor_mse:.4f}\n")

# Train a Random Forest Regressor
print("Training a Random Forest Regressor...")
# Random Forest Regressor is an ensemble of Decision Tree Regressors.
random_forest_regressor = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
random_forest_regressor.fit(X_train_housing, y_train_housing)
random_forest_regressor_predictions = random_forest_regressor.predict(X_test_housing)
random_forest_regressor_mse = mean_squared_error(y_test_housing, random_forest_regressor_predictions)
print(f"Mean Squared Error (MSE) of Random Forest Regressor: {random_forest_regressor_mse:.4f}\n")

# Compare the MSEs
print("--- Regression MSE Comparison ---")
print(f"Bagging Regressor MSE: {bagging_regressor_mse:.4f}")
print(f"Random Forest Regressor MSE: {random_forest_regressor_mse:.4f}")

if random_forest_regressor_mse < bagging_regressor_mse:
    print("\nThe Random Forest Regressor performed better (lower MSE) than the Bagging Regressor.")
elif random_forest_regressor_mse > bagging_regressor_mse:
    print("\nThe Bagging Regressor performed better (lower MSE) than the Random Forest Regressor.")
else:
    print("\nBoth regressors achieved the same MSE.")

