1. What is Ensemble Learning in machine learning? Explain the key idea
behind it.
   - Ensemble Learning is a machine learning technique in which multiple models (called base learners or weak learners) are trained and then combined to make a single, more powerful model. The goal is to achieve better predictive performance, higher stability, and improved generalization than any individual model alone

2. What is the difference between Bagging and Boosting?
  - Bagging and Boosting are two fundamental ensemble learning techniques, but they differ in how models are trained, combined, and what kind of errors they address.

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
  - Bootstrap sampling is a resampling technique where multiple datasets are created from the original training data by:
  Randomly sampling data points with replacement
  Each bootstrap sample has the same size as the original dataset
  Some observations may appear multiple times, while others may not appear at all

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
  - In bagging-based ensemble methods (such as Random Forest), each base model (e.g., a decision tree) is trained on a bootstrap sample—a dataset created by sampling with replacement from the original training set.

5. Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.
  - Feature importance analysis helps us understand which input features contribute most to a model’s predictions. While both Decision Trees and Random Forests provide feature importance, the meaning, stability, and reliability differ significantly.

#6. Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.


In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Load the Breast Cancer dataset
bc_data = load_breast_cancer()
X = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)
y = bc_data.target

print("Dataset loaded successfully.")
print(f"Shape of features (X): {X.shape}")
print(f"Shape of target (y): {y.shape}")
print("First 5 rows of features:")
display(X.head())

Dataset loaded successfully.
Shape of features (X): (569, 30)
Shape of target (y): (569,)
First 5 rows of features:


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Now, let's split the data into training and testing sets and train a Random Forest Classifier.

In [2]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

print("Random Forest Classifier trained successfully.")
print(f"Training accuracy: {rf_classifier.score(X_train, y_train):.4f}")
print(f"Test accuracy: {rf_classifier.score(X_test, y_test):.4f}")

Random Forest Classifier trained successfully.
Training accuracy: 1.0000
Test accuracy: 0.9649


Finally, we will extract and print the top 5 most important features based on the trained model's feature importance scores.

In [3]:
# Get feature importances from the trained classifier
feature_importances = rf_classifier.feature_importances_

# Create a pandas Series for easy sorting and mapping to feature names
feature_importance_series = pd.Series(feature_importances, index=X.columns)

# Sort the features by importance in descending order
sorted_features = feature_importance_series.sort_values(ascending=False)

# Print the top 5 most important features
print("\nTop 5 most important features:")
display(sorted_features.head(5))


Top 5 most important features:


Unnamed: 0,0
worst area,0.153892
worst concave points,0.144663
mean concave points,0.10621
worst radius,0.077987
mean concavity,0.068001


7. Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree

In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("Iris dataset loaded successfully.")
print(f"Shape of features (X): {X.shape}")
print(f"Shape of target (y): {y.shape}")
print("First 5 rows of features:")
display(X.head())

Iris dataset loaded successfully.
Shape of features (X): (150, 4)
Shape of target (y): (150,)
First 5 rows of features:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Next, we'll split the data into training and testing sets, and then train a single Decision Tree Classifier.

In [5]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a single Decision Tree Classifier
single_tree_classifier = DecisionTreeClassifier(random_state=42)
single_tree_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy
single_tree_predictions = single_tree_classifier.predict(X_test)
single_tree_accuracy = accuracy_score(y_test, single_tree_predictions)

print(f"Single Decision Tree Classifier Accuracy: {single_tree_accuracy:.4f}")

Single Decision Tree Classifier Accuracy: 1.0000


Now, let's train a Bagging Classifier using Decision Trees as base estimators and evaluate its performance.

In [9]:
# Train a Bagging Classifier with Decision Trees as base estimators
bagging_classifier = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=100, # Number of base estimators (trees)
    random_state=42
)
bagging_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy
bagging_predictions = bagging_classifier.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_predictions)

print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

Bagging Classifier Accuracy: 1.0000


Finally, let's compare the accuracies of the single Decision Tree and the Bagging Classifier.

In [8]:
print("\n--- Comparison ---")
print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

if bagging_accuracy > single_tree_accuracy:
    print("The Bagging Classifier performed better than the single Decision Tree.")
elif bagging_accuracy < single_tree_accuracy:
    print("The single Decision Tree performed better than the Bagging Classifier.")
else:
    print("Both classifiers achieved the same accuracy.")


--- Comparison ---
Single Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy: 1.0000
Both classifiers achieved the same accuracy.


Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy

### 8. Hyperparameter Tuning for Random Forest Classifier using GridSearchCV

We will perform hyperparameter tuning for a Random Forest Classifier on the Breast Cancer dataset. We'll tune `max_depth` and `n_estimators` using `GridSearchCV` to find the optimal combination for better performance.

In [10]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
bc_data = load_breast_cancer()
X_bc = pd.DataFrame(bc_data.data, columns=bc_data.feature_names)
y_bc = bc_data.target

# Split data into training and testing sets
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(X_bc, y_bc, test_size=0.2, random_state=42)

print("Breast Cancer Dataset loaded and split successfully.")
print(f"Training features shape: {X_train_bc.shape}")
print(f"Test features shape: {X_test_bc.shape}")

Breast Cancer Dataset loaded and split successfully.
Training features shape: (455, 30)
Test features shape: (114, 30)


Now, let's define the parameter grid and set up GridSearchCV to find the best hyperparameters for our Random Forest Classifier.

In [11]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30]  # Maximum depth of the tree
}

# Initialize a Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')

# Fit GridSearchCV to the training data
grid_search.fit(X_train_bc, y_train_bc)

print("GridSearchCV completed.")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV completed.


Finally, we'll print the best parameters found by GridSearchCV and evaluate the model's accuracy using these parameters on the test set.

In [13]:
# Print the best parameters found by GridSearchCV
print(f"\nBest parameters found: {grid_search.best_params_}")

# Get the best estimator (model) from GridSearchCV
best_rf_model = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred_bc = best_rf_model.predict(X_test_bc)

# Evaluate the accuracy of the best model
final_accuracy = accuracy_score(y_test_bc, y_pred_bc)

print(f"Final model accuracy with best parameters: {final_accuracy:.4f}")


Best parameters found: {'max_depth': None, 'n_estimators': 200}
Final model accuracy with best parameters: 0.9649


9. Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)

In [14]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
housing_data = fetch_california_housing(as_frame=True)
X_housing = housing_data.data
y_housing = housing_data.target

print("California Housing Dataset loaded successfully.")
print(f"Shape of features (X_housing): {X_housing.shape}")
print(f"Shape of target (y_housing): {y_housing.shape}")
print("First 5 rows of features:")
display(X_housing.head())

California Housing Dataset loaded successfully.
Shape of features (X_housing): (20640, 8)
Shape of target (y_housing): (20640,)
First 5 rows of features:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


Next, we'll split the data into training and testing sets and then train both the Bagging Regressor and the Random Forest Regressor.

In [15]:
# Split data into training and testing sets
X_train_housing, X_test_housing, y_train_housing, y_test_housing = train_test_split(X_housing, y_housing, test_size=0.2, random_state=42)

# 1. Train a Bagging Regressor
bagging_regressor = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=42), # Base estimator
    n_estimators=100, # Number of base estimators
    random_state=42,
    n_jobs=-1 # Use all available cores
)
bagging_regressor.fit(X_train_housing, y_train_housing)

# 2. Train a Random Forest Regressor
random_forest_regressor = RandomForestRegressor(
    n_estimators=100, # Number of trees in the forest
    random_state=42,
    n_jobs=-1 # Use all available cores
)
random_forest_regressor.fit(X_train_housing, y_train_housing)

print("Bagging Regressor and Random Forest Regressor trained successfully.")

Bagging Regressor and Random Forest Regressor trained successfully.


Finally, we'll make predictions with both models and compare their Mean Squared Errors (MSE).

In [16]:
# Make predictions on the test set
bagging_predictions_housing = bagging_regressor.predict(X_test_housing)
rf_predictions_housing = random_forest_regressor.predict(X_test_housing)

# Calculate Mean Squared Error for both models
mse_bagging = mean_squared_error(y_test_housing, bagging_predictions_housing)
mse_random_forest = mean_squared_error(y_test_housing, rf_predictions_housing)

print(f"\nMean Squared Error (Bagging Regressor): {mse_bagging:.4f}")
print(f"Mean Squared Error (Random Forest Regressor): {mse_random_forest:.4f}")

print("\n--- Comparison ---")
if mse_bagging < mse_random_forest:
    print("The Bagging Regressor achieved a lower Mean Squared Error.")
elif mse_bagging > mse_random_forest:
    print("The Random Forest Regressor achieved a lower Mean Squared Error.")
else:
    print("Both regressors achieved the same Mean Squared Error.")


Mean Squared Error (Bagging Regressor): 0.2559
Mean Squared Error (Random Forest Regressor): 0.2554

--- Comparison ---
The Random Forest Regressor achieved a lower Mean Squared Error.
