1. What is Ensemble Learning in machine learning? Explain the key idea behind it?
- Ensemble Learning is a machine learning paradigm where multiple models (often called "base learners" or "weak learners") are trained to solve the same problem and combined to yield better results.
Key idea behind it is Error Reduction-
The fundamental goal of ensemble learning is to reduce the three main types of error that lead to poor model performance: Bias, Variance.
Bias (Underfitting): When a model is too simple to capture the underlying patterns of the data.

Variance (Overfitting): When a model is too sensitive to the specific noise in the training data and fails to generalize to new data.

2. What is the difference between Bagging and Boosting?
- Bagging:- Bagging, short for Bootstrap Aggregating, is an ensemble machine learning technique designed to improve the stability and accuracy of algorithms by reducing variance.
Boosting:- Boosting is an ensemble machine learning technique that combines a series of "weak learners" (models that are only slightly better than random guessing) into a single "strong learner."

3. What is bootstrap sampling and what role does it play in Bagging methods
like Random Forest?
- Bootstrap Sampling is a statistical technique used to estimate the distribution of a population by repeatedly sampling from a single dataset with replacement.
The Role of Bootstrapping in Random Forest:-
1. Decoupling the Trees (Independency)- Each tree in a Random Forest is trained on its own unique bootstrap sample. Because each tree sees a slightly different version of the data, they develop different "opinions.
2. Handling Outliers- If your dataset has a "weird" outlier, bootstrapping ensures that only some trees see that outlier.

4. What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?
- Out-of-Bag (OOB) samples are the data points "left over" during the bootstrap sampling process in Bagging methods like Random Forest. Because they were never shown to a specific tree during its training, they act as a natural, "built-in" test set for that individual tree.

5.  Compare feature importance analysis in a single Decision Tree vs. a
Random Forest?
- 1. The Single Decision Tree:
In a single tree, feature importance is calculated based on which features are used to split the data at each node.
-> Logic: The feature that provides the greatest reduction in Gini impurity or Entropy (at the root) is often considered the most important.
-> Stability (Low): If you change your dataset even slightly, the root node might change entirely. This makes the feature importance ranking very volatile—it can flip-flop based on minor noise in the data.

2. The Random Forest:
-> Logic: Since each tree is built on a different random subset of data (Bagging) and a different random subset of features, many different variables get a "chance to shine."
-> Stability (High): Because it averages hundreds of trees, the importance scores are much more stable. Adding or removing a few rows of data won't significantly change the rankings.

6.



In [1]:
# Write a Python program to:
# Load the Breast Cancer dataset using
#sklearn.datasets.load_breast_cancer()
#● Train a Random Forest Classifier
#● Print the top 5 most important features based on feature importance scores.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd


data = load_breast_cancer()
X = data.data
y = data.target


feature_names = data.feature_names
df = pd.DataFrame(X, columns=feature_names)


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)


importances = rf.feature_importances_


feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})


feature_importance_df = feature_importance_df.sort_values(
    by='Importance', ascending=False
)


print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))



Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


In [2]:
#Write a Python program to:
#● Train a Bagging Classifier using Decision Trees on the Iris dataset
#● Evaluate its accuracy and compare with a single Decision Tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)


dt_predictions = dt.predict(X_test)


dt_accuracy = accuracy_score(y_test, dt_predictions)


bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)


bag_predictions = bagging.predict(X_test)

bag_accuracy = accuracy_score(y_test, bag_predictions)


print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bag_accuracy)


Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


In [3]:
# Write a Python program to:
#● Train a Random Forest Classifier
#● Tune hyperparameters max_depth and n_estimators using GridSearchCV
#● Print the best parameters and final accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


rf = RandomForestClassifier(random_state=42)


param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}


grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)


grid_search.fit(X_train, y_train)


best_rf = grid_search.best_estimator_


y_pred = best_rf.predict(X_test)


final_accuracy = accuracy_score(y_test, y_pred)


print("Best Hyperparameters:", grid_search.best_params_)
print("Final Accuracy:", final_accuracy)



Best Hyperparameters: {'max_depth': None, 'n_estimators': 100}
Final Accuracy: 1.0


In [4]:
#Write a Python program to:
#● Train a Bagging Regressor and a Random Forest Regressor on the California Housing dataset
#● Compare their Mean Squared Errors (MSE)

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error


data = fetch_california_housing()
X = data.data
y = data.target


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=42
)
bagging_reg.fit(X_train, y_train)


bagging_predictions = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_predictions)


rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)


rf_predictions = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_predictions)


print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.2568358813508342
Random Forest Regressor MSE: 0.25650512920799395
