<a href="https://colab.research.google.com/github/waquasadnankarimi/Function/blob/main/Ensemble_Learning9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1: What is Ensemble Learning in machine learning? Explain the key idea
behind it.

Answer:
- Ensemble Learning in machine learning refers to a technique where multiple models (called learners) are combined to solve a problem and achieve better performance than any single model alone.

**Key Idea: Wisdom of the Crowd**
 - **Diversity**: Different models (e.g., decision trees, neural networks) learn different patterns from the data.
 - **Compensation**: Errors made by one model are often corrected by others in the ensemble, as their individual predictions balance out.
 - **Aggregation**: A final decision is made by combining their outputs, often through voting (classification) or averaging (regression).

**How it Works (Simplified)**

- Build Multiple Models: Train several base models (e.g., Decision Trees, Support Vector Machines) on the training data.
- Combine Predictions: Aggregate their individual predictions using methods like voting or averaging to get a final, superior prediction.

Question 2: What is the difference between Bagging and Boosting?

Answer:
- Bagging and Boosting are two major ensemble learning techniques, but they differ in how they train models and how they reduce errors,

| **Feature**      | **Bagging (Bootstrap Aggregating)**                   | **Boosting**                                                       |
| ---------------- | ----------------------------------------------------- | ------------------------------------------------------------------ |
| **Training**     | Parallel (independent models)                         | Sequential (each model depends on previous)                        |
| **Goal**         | Reduce Variance (avoid overfitting)                   | Reduce Bias (avoid underfitting)                                   |
| **Data**         | Random subsets with replacement (bootstrapping)       | Focuses more on misclassified samples of previous models           |
| **Model Weight** | All models have equal weight                          | Models weighted by performance (better models have more influence) |
| **Approach**     | Aggregate/average predictions of multiple weak models | Iteratively convert weak models into a strong model                |
| **Use Case**     | Works well for unstable, high-variance models         | Works well for simple, high-bias models                            |
| **Examples**     | Random Forest, Bagged Trees                           | Gradient Boosting, AdaBoost, XGBoost                               |

Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

Answer:
- Bootstrap sampling creates diverse training sets for Bagging (Bootstrap Aggregation) by repeatedly drawing data with replacement, which reduces variance and overfitting.

**Role in Bagging (Random Forest)**
- **Creates Diversity**: Each decision tree in a Random Forest is trained on a different bootstrap sample, making each tree learn slightly different patterns from the data.
- **Reduces Overfitting**: By training on varied subsets, individual trees are less likely to overfit to noise in the data, a common issue with single, deep decision trees.
- **Enhances Stability**: Aggregating predictions (majority vote for classification, average for regression) from these diverse, weaker models results in a single, more robust, and accurate final prediction.
- **Out-of-Bag (OOB) Error**: The data points not selected in a specific bootstrap sample (the "out-of-bag" data) can be used to estimate the model's performance without needing a separate validation set.

Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to
evaluate ensemble models?

Answer:
- Out-of-Bag (OOB) samples are data points left out from the bootstrap sample used to train individual trees in bagging ensembles (like Random Forests).

**What are OOB Samples?**
- Bootstrap Sampling: Bagging methods train multiple models (e.g., decision trees) by sampling the original dataset with replacement to create slightly different training sets (bootstrap samples) for each model.
- Left-Out Data: Because sampling is with replacement, some original data points are never chosen for a specific tree's training set; these are the "out-of-bag" (OOB) samples, making up about 37% of the data on average.

**How the OOB Score Evaluates Models:**

- Individual Tree Prediction: For each OOB sample, identify all the trees in the ensemble that did not see it during their training.
- Majority Vote/Average: Use these specific trees to predict the outcome (class or regression value) for that OOB sample.
- Calculate Error/Score: Compare these predictions to the sample's true label to find the error (OOB Error) or accuracy (OOB Score) for that sample.
- Aggregate: Average the errors or scores across all samples in the dataset to get the final OOB score, a robust, internal validation metric that reflects how well the model generalizes

Question 5: Compare feature importance analysis in a single Decision Tree vs. a
Random Forest.

Answer:
- Feature Importance in a Single Decision Tree
- In a single Decision Tree, feature importance is determined by how much each feature reduces the impurity (e.g., Gini impurity or entropy) of the nodes it splits [1].
  - Calculation: Importance is typically measured as the total reduction in impurity caused by a specific feature across all splits in which it was used [1].
- Characteristics:
  - Local Importance: The results can be highly unstable and dependent on the specific training data and the tree's structure. A small change in the data can drastically alter which features are selected for the top splits [1].
  - Bias Towards High Cardinality: Decision trees inherently favor continuous or high-cardinality categorical variables, as these features offer more splitting possibilities and thus a greater potential for impurity reduction, which can skew the importance scores [1].
  - Prone to Overfitting: The importance scores are derived from a potentially overfit model, meaning they might reflect the noise in the training data rather than the true underlying relationships [1]

**Feature Importance in a Random Forest**

- A Random Forest is an ensemble of many decision trees, so feature importance is computed by:
  - Averaging feature contributions across many trees
  - Using bootstrap + random feature selection increases diversity
  - Producing more reliable and stable rankings

| **Feature**           | **Single Decision Tree**                                               | **Random Forest**                                           |
| --------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------- |
| **Calculation Basis** | Total impurity reduction from splits within one tree                   | Average impurity reduction across all trees in the ensemble |
| **Stability**         | Highly unstable; sensitive to data variations                          | Stable and robust due to aggregation                        |
| **Reliability**       | Less reliable; may overfit and capture noise                           | More reliable; provides a global, robust estimate           |
| **Bias**              | Strong bias toward features with many unique values (high cardinality) | Bias reduced due to feature randomness and averaging        |


- In essence, while both methods use the same underlying metric (impurity reduction), the Random Forest provides a far more robust, stable, and reliable measure of global feature importance by leveraging the "wisdom of the crowd" principle [1].


In [1]:
'''
Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
'''
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

importances = model.feature_importances_

indices = np.argsort(importances)[::-1][:5]

print("Top 5 Important Features:\n")
for i in indices:
    print(f"{feature_names[i]}: {importances[i]:.4f}")


Top 5 Important Features:

worst area: 0.1394
worst concave points: 0.1322
mean concave points: 0.1070
worst radius: 0.0828
worst perimeter: 0.0808


In [3]:
'''
Question 7: Write a Python program to:
● Load the Breast Cancer dataset using
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores.
'''
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, y_pred_dt)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bag)

print(f"Decision Tree Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

Decision Tree Accuracy: 1.0000
Bagging Classifier Accuracy: 1.0000


In [4]:
'''
Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
'''
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

rf = RandomForestClassifier(random_state=42)

param_grid = {
    "n_estimators": [50, 100, 150],
    "max_depth": [None, 3, 5, 7]
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Final Test Accuracy: {accuracy:.4f}")


Best Parameters: {'max_depth': None, 'n_estimators': 100}
Final Test Accuracy: 1.0000


In [6]:
'''
Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
'''
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

bagging = BaggingRegressor(
    estimator=DecisionTreeRegressor(), # Changed base_estimator to estimator
    n_estimators=100,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
mse_bag = mean_squared_error(y_test, y_pred_bag)


rf = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)

print(f"Bagging Regressor MSE: {mse_bag:.4f}")
print(f"Random Forest Regressor MSE: {mse_rf:.4f}")


Bagging Regressor MSE: 0.2559
Random Forest Regressor MSE: 0.2554


In [9]:
'''
Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.
'''
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10, n_redundant=5,
    n_clusters_per_class=2, weights=[0.8,0.2], flip_y=0.05, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)

bagging_acc = accuracy_score(y_test, y_pred_bag)
bagging_auc = roc_auc_score(y_test, bagging.predict_proba(X_test)[:,1])

print("=== Bagging Classifier ===")
print(f"Accuracy: {bagging_acc:.4f}")
print(f"AUC: {bagging_auc:.4f}")
print(classification_report(y_test, y_pred_bag))

gbc = GradientBoostingClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1, 0.2]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(gbc, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)

best_gbc = grid.best_estimator_
y_pred_gbc = best_gbc.predict(X_test)

gbc_acc = accuracy_score(y_test, y_pred_gbc)
gbc_auc = roc_auc_score(y_test, best_gbc.predict_proba(X_test)[:,1])

print("\n=== Gradient Boosting Classifier ===")
print(f"Best Parameters: {grid.best_params_}")
print(f"Accuracy: {gbc_acc:.4f}")
print(f"AUC: {gbc_auc:.4f}")
print(classification_report(y_test, y_pred_gbc))

print("\n=== Comparison ===")
print(f"Bagging Accuracy: {bagging_acc:.4f}, AUC: {bagging_auc:.4f}")
print(f"Boosting Accuracy: {gbc_acc:.4f}, AUC: {gbc_auc:.4f}")


=== Bagging Classifier ===
Accuracy: 0.9033
AUC: 0.9492
              precision    recall  f1-score   support

           0       0.89      1.00      0.94       234
           1       1.00      0.56      0.72        66

    accuracy                           0.90       300
   macro avg       0.94      0.78      0.83       300
weighted avg       0.91      0.90      0.89       300


=== Gradient Boosting Classifier ===
Best Parameters: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 150}
Accuracy: 0.9267
AUC: 0.9573
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       234
           1       0.96      0.70      0.81        66

    accuracy                           0.93       300
   macro avg       0.94      0.84      0.88       300
weighted avg       0.93      0.93      0.92       300


=== Comparison ===
Bagging Accuracy: 0.9033, AUC: 0.9492
Boosting Accuracy: 0.9267, AUC: 0.9573
