# Question 1:  What is Ensemble Learning in machine learning? Explain the key idea behind i

**Ensemble Learning** is a machine learning approach where **several different models are trained on the same problem and their predictions are combined** to improve accuracy, reduce errors, and make the final model more reliable than a single model.


# Question 2: What is the difference between Bagging and Boosting?

### Difference between **Bagging** and **Boosting**

| Basis            | **Bagging (Bootstrap Aggregating)**                | **Boosting**                                                          |
| ---------------- | -------------------------------------------------- | --------------------------------------------------------------------- |
| Main idea        | Trains multiple models **independently**           | Trains models **sequentially**                                        |
| Focus            | Reduces **variance**                               | Reduces **bias and variance**                                         |
| Data usage       | Uses **random samples** of data (with replacement) | Uses the **same data**, but gives more weight to misclassified points |
| Model dependency | Models are **not dependent** on each other         | Each model **depends on the previous one**                            |
| Handling errors  | All models are treated **equally**                 | Misclassified data gets **more importance**                           |
| Speed            | Can be trained **in parallel**                     | Training is **slower** (sequential)                                   |
| Example          | Random Forest                                      | AdaBoost, Gradient Boosting                                           |

**In short:**

* **Bagging** builds many independent models and averages their results.
* **Boosting** builds models step by step, focusing more on correcting previous mistakes.


# Question 3: What is bootstrap sampling and what role does it play in Bagging methods like Random Forest?

**Bootstrap sampling** is a resampling technique where **multiple new datasets are created by randomly selecting data points from the original dataset with replacement**.

### Role of Bootstrap Sampling in Bagging (e.g., Random Forest)

In Bagging methods like **Random Forest**, bootstrap sampling:

* Creates **different training datasets** for each model
* Helps models learn **slightly different patterns**
* Reduces **overfitting and variance**
* Improves **overall model stability and accuracy**

**In simple words:**
Bootstrap sampling allows each tree in Random Forest to train on a **different version of the data**, making the combined model stronger and more reliable.


# Question 4: What are Out-of-Bag (OOB) samples and how is OOB score used to evaluate ensemble models?

**Out-of-Bag (OOB) samples** are the data points that **are not selected during bootstrap sampling** when training each model in a bagging-based ensemble.

### How OOB Score Is Used

In ensemble models like **Random Forest**:

* Each model is trained on its own bootstrap sample
* The data not used for training that model (OOB samples) act as **test data**
* Predictions on OOB samples are aggregated
* The **OOB score** measures model performance (accuracy for classification, error for regression)

### Why OOB Score Is Useful

* No need for a **separate validation dataset**
* Provides a **reliable estimate** of model performance
* Saves **time and data**

**In short:**
OOB samples work like a built-in test set to evaluate ensemble models efficiently.


# Question 5: Compare feature importance analysis in a single Decision Tree vs.  Random Forest.

### Feature Importance: **Decision Tree vs. Random Forest**

| Aspect                       | **Single Decision Tree**                                 | **Random Forest**                                |
| ---------------------------- | -------------------------------------------------------- | ------------------------------------------------ |
| How importance is calculated | Based on **impurity reduction** at each split            | **Averaged** impurity reduction across all trees |
| Stability                    | **Unstable** (small data change → big importance change) | **More stable and reliable**                     |
| Overfitting effect           | Can give **misleading importance** due to overfitting    | Reduces overfitting by using many trees          |
| Feature bias                 | Can favor features with more split points                | Bias is **reduced** due to randomness            |
| Overall reliability          | **Less reliable**                                        | **More reliable and robust**                     |

### Summary

* A **single Decision Tree** shows feature importance based on one model, so results can be noisy.
* A **Random Forest** combines feature importance from many trees, giving a **more accurate and trustworthy** importance ranking.


# Question 6: Write a Python program to:
● Load the Breast Cancer dataset using
sklearn.datasets.load_breast_cancer()
● Train a Random Forest Classifier
● Print the top 5 most important features based on feature importance scores. (Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importance scores
importances = rf.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

# Sort features by importance (descending)
feature_importance_df = feature_importance_df.sort_values(
    by='Importance', ascending=False
)

# Print top 5 most important features
print("Top 5 Most Important Features:")
print(feature_importance_df.head(5))


Top 5 Most Important Features:
                 Feature  Importance
23            worst area    0.139357
27  worst concave points    0.132225
7    mean concave points    0.107046
20          worst radius    0.082848
22       worst perimeter    0.080850


# Question 7: Write a Python program to:
● Train a Bagging Classifier using Decision Trees on the Iris dataset
● Evaluate its accuracy and compare with a single Decision Tree
(Include your Python code and output in the code box below.)

In [4]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Bagging Classifier (FIX: use 'estimator' instead of 'base_estimator')
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=50,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Print results
print("Decision Tree Accuracy:", dt_accuracy)
print("Bagging Classifier Accuracy:", bagging_accuracy)



Decision Tree Accuracy: 1.0
Bagging Classifier Accuracy: 1.0


# Question 8: Write a Python program to:
● Train a Random Forest Classifier
● Tune hyperparameters max_depth and n_estimators using GridSearchCV
● Print the best parameters and final accuracy
(Include your Python code and output in the code box below.)

Below is a clean, updated (scikit-learn ≥ 1.2), and exam-ready Python program that trains a Random Forest, tunes max_depth and n_estimators using GridSearchCV, and prints the best parameters and final accuracy.

In [5]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Random Forest model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train GridSearch
grid_search.fit(X_train, y_train)

# Best model
best_model = grid_search.best_estimator_

# Predictions and accuracy
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print("Final Accuracy:", final_accuracy)


Best Parameters: {'max_depth': None, 'n_estimators': 200}
Final Accuracy: 0.9707602339181286


# Question 9: Write a Python program to:
● Train a Bagging Regressor and a Random Forest Regressor on the California
Housing dataset
● Compare their Mean Squared Errors (MSE)
(Include your Python code and output in the code box below.)

In [6]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Bagging Regressor
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42
)
bagging_reg.fit(X_train, y_train)
bagging_pred = bagging_reg.predict(X_test)
bagging_mse = mean_squared_error(y_test, bagging_pred)

# Train Random Forest Regressor
rf_reg = RandomForestRegressor(
    n_estimators=100,
    random_state=42
)
rf_reg.fit(X_train, y_train)
rf_pred = rf_reg.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print results
print("Bagging Regressor MSE:", bagging_mse)
print("Random Forest Regressor MSE:", rf_mse)


Bagging Regressor MSE: 0.25787382250585034
Random Forest Regressor MSE: 0.25650512920799395


# Question 10: You are working as a data scientist at a financial institution to predict loan
default. You have access to customer demographic and transaction history data.
You decide to use ensemble techniques to increase model performance.
Explain your step-by-step approach to:
● Choose between Bagging or Boosting
● Handle overfitting
● Select base models
● Evaluate performance using cross-validation
● Justify how ensemble learning improves decision-making in this real-world
context.

In [7]:
# Import required libraries
from sklearn.datasets import load_breast_cancer   # proxy for loan default data
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, roc_auc_score

# Load dataset (simulating loan default data)
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Boosting model
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_model.fit(X_train, y_train)

# Predictions
y_pred = gb_model.predict(X_test)
y_prob = gb_model.predict_proba(X_test)[:, 1]

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

# Cross-validation
cv_scores = cross_val_score(gb_model, X, y, cv=5, scoring='accuracy')

# Print results
print("Test Accuracy:", accuracy)
print("ROC-AUC Score:", roc_auc)
print("Cross-Validation Accuracy Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())


Test Accuracy: 0.9590643274853801
ROC-AUC Score: 0.9951499118165785
Cross-Validation Accuracy Scores: [0.92982456 0.94736842 0.97368421 0.98245614 0.98230088]
Mean CV Accuracy: 0.9631268436578171
