# Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind it.
-  Ensemble Learning is a machine learning technique in which multiple individual models, called base learners,
are combined to build a stronger and more accurate model.

The key idea behind ensemble learning is that a group of weak or moderately accurate models can work together
to produce better predictions than any single model alone. By combining their outputs through methods such as
voting, averaging, or weighting, ensemble models reduce errors caused by bias, variance, or noise.

Common ensemble methods include Bagging, Boosting, and Random Forest.
Ensemble learning improves model performance, robustness, and generalization on unseen data.


# Question 2: What is the difference between Bagging and Boosting?
-  Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning techniques used to improve model performance,
but they differ in how models are trained and combined.

Bagging trains multiple models independently on different random subsets of the training data created using bootstrap sampling.
All models are given equal importance, and their predictions are combined using averaging (for regression) or majority voting (for classification).
Bagging mainly helps in reducing variance and is effective for high-variance models such as Decision Trees.

Boosting trains models sequentially, where each new model focuses more on the instances that were misclassified by previous models.
Models are weighted based on their performance, and final predictions are made using weighted voting.
Boosting mainly helps in reducing bias and can convert weak learners into strong learners.

In summary, Bagging reduces variance by parallel training, while Boosting reduces bias by sequentially correcting errors.


# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?
-  Bootstrap sampling is a statistical resampling technique in which multiple training datasets are generated
from the original dataset by randomly selecting data points with replacement.
Because sampling is done with replacement, some observations may appear multiple times in a bootstrap sample,
while some may not appear at all.

In Bagging methods such as Random Forest, bootstrap sampling plays a crucial role in creating diversity among models.
Each decision tree in a Random Forest is trained on a different bootstrap sample of the data,
which ensures that the trees are not identical.

This diversity reduces the variance of the model and helps prevent overfitting.
By aggregating predictions from many independently trained trees,
Random Forest achieves better accuracy, stability, and generalization compared to a single decision tree.


# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?
-  Out-of-Bag (OOB) samples are the data points that are not selected during bootstrap sampling
when training individual models in bagging-based ensemble methods such as Random Forest.
Since bootstrap sampling is done with replacement, on average about 63% of the data is used to train each model,
while the remaining 37% becomes Out-of-Bag samples.

The OOB score is used as an internal validation method to evaluate the performance of ensemble models.
For each data point, predictions are made using only the models for which that data point was not included
in the training sample, and these predictions are then aggregated.

The OOB score provides an unbiased estimate of model accuracy without requiring a separate validation dataset.
This makes it an efficient and reliable evaluation technique, especially for large ensemble models like Random Forest.


# Question 5: Compare feature importance analysis in a single Decision Tree vs. a Random Forest.
- Feature importance analysis measures how much each feature contributes to the predictive power of a model.

In a single Decision Tree, feature importance is calculated based on how much each feature reduces impurity
(e.g., Gini or Entropy) at each split. The importance is derived from the single tree structure, so it can be
biased toward features with more levels or continuous values, and it may vary significantly if the tree changes.

In a Random Forest, feature importance is averaged across all the trees in the ensemble.
Each tree is trained on a different bootstrap sample with random feature selection at each split,
which reduces bias and provides a more stable and reliable estimate of feature importance.
Random Forest also reduces variance compared to a single tree, making the importance scores more generalizable.

In summary, while a single Decision Tree can give quick insights into feature importance, Random Forest
provides a more robust and accurate assessment by aggregating multiple trees.


# Question 6: Write a Python program to: ● Load the Breast Cancer dataset using sklearn.datasets.load_breast_cancer() ● Train a Random Forest Classifier ● Print the top 5 most important features based on feature importance scores. (Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for easy visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print top 5 most important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


# Question 7: Write a Python program to: ● Train a Bagging Classifier using Decision Trees on the Iris dataset ● Evaluate its accuracy and compare with a single Decision Tree (Include your Python code and output in the code box below.)

In [6]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# Train Bagging Classifier using Decision Trees
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(), # Changed 'base_estimator' to 'estimator'
    n_estimators=50, random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
accuracy_bag = accuracy_score(y_test, y_pred_bag)

# Print accuracy results
print("Accuracy of Single Decision Tree:", accuracy_dt)
print("Accuracy of Bagging Classifier:", accuracy_bag)


Accuracy of Single Decision Tree: 1.0
Accuracy of Bagging Classifier: 1.0


# Question 8: Write a Python program to: ● Train a Random Forest Classifier ● Tune hyperparameters max_depth and n_estimators using GridSearchCV ● Print the best parameters and final accuracy (Include your Python code and output in the code box below.)
-

In [4]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10, 15]
}

# Setup GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

# Train and find best hyperparameters
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_

# Evaluate final model
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Final Accuracy:", final_accuracy)


Best Parameters: {'max_depth': 5, 'n_estimators': 150}
Final Accuracy: 0.9707602339181286


# Question 9: Write a Python program to: ● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset ● Compare their Mean Squared Errors (MSE) (Include your Python code and output in the code box below.)

In [5]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor using Decision Trees
bagging = BaggingRegressor(n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)

# Train Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

# Print Mean Squared Errors
print("MSE of Bagging Regressor:", mse_bag)
print("MSE of Random Forest Regressor:", mse_rf)


MSE of Bagging Regressor: 0.25787382250585034
MSE of Random Forest Regressor: 0.25650512920799395
