**Assignment 1 - Theory**

1. **Can we use Bagging for regression problems?**  
   Yes, Bagging can be used for regression problems. A common example is the **Bagging Regressor**, which aggregates predictions from multiple base regressors (e.g., Decision Trees) and averages their outputs to improve stability and accuracy.  

2. **What is the difference between multiple model training and single model training?**  
   - **Single Model Training**: A single model learns from the data and makes predictions independently. It can be prone to overfitting or underfitting.  
   - **Multiple Model Training (Ensemble Learning)**: Multiple models are trained on the data and their predictions are combined (e.g., averaging, voting) to improve performance, robustness, and generalization.  

3. **Explain the concept of feature randomness in Random Forest.**  
   In Random Forest, feature randomness refers to selecting a **random subset of features** at each split in a decision tree. This introduces diversity in the trees and helps reduce overfitting while improving generalization.  

4. **What is OOB (Out-of-Bag) Score?**  
   OOB Score is a performance estimate calculated using the samples that were **not included in the bootstrap sample** for training each tree in a Bagging model. It provides an unbiased measure of model accuracy.  

5. **How can you measure the importance of features in a Random Forest model?**  
   Feature importance can be measured using:  
   1. **Mean Decrease in Impurity (Gini Importance)** – How much a feature reduces impurity across all splits.  
   2. **Permutation Importance** – The decrease in model performance when a feature’s values are randomly shuffled.  

6. **Explain the working principle of a Bagging Classifier.**  
   A Bagging Classifier works as follows:  
   1. Create multiple subsets of the training data using **bootstrap sampling**.  
   2. Train a separate model (e.g., Decision Tree) on each subset.  
   3. Aggregate predictions using **majority voting** (for classification).  

7. **How do you evaluate a Bagging Classifier's performance?**  
   A Bagging Classifier's performance can be evaluated using:  
   - **Accuracy** (for classification)  
   - **Confusion Matrix, Precision, Recall, F1-score**  
   - **ROC-AUC Score** (for binary classification)  
   - **Cross-validation** to ensure robustness  

8. **How does a Bagging Regressor work?**  
   A Bagging Regressor works by:  
   1. Generating multiple bootstrap samples from the dataset.  
   2. Training individual regressors (e.g., Decision Trees) on each subset.  
   3. Averaging the predictions from all regressors to reduce variance and improve accuracy.  

9. **What is the main advantage of ensemble techniques?**  
   Ensemble techniques improve **accuracy, stability, and robustness** by combining multiple models, reducing overfitting and variance.  

10. **What is the main challenge of ensemble methods?**  
    The main challenges include:  
    - **Computational cost** (training multiple models)  
    - **Complexity** (harder to interpret than single models)  
    - **Risk of overfitting** if not tuned properly  

11. **Explain the key idea behind ensemble techniques.**  
    The key idea is to combine multiple weak models to create a **stronger model** that performs better than any individual model alone.  

12. **What is a Random Forest Classifier?**  
    A **Random Forest Classifier** is an ensemble method that builds multiple Decision Trees and aggregates their predictions using **majority voting**. It introduces randomness in feature selection and bootstrap sampling to improve generalization.  

13. **What are the main types of ensemble techniques?**  
    1. **Bagging** – Reduces variance (e.g., Random Forest).  
    2. **Boosting** – Reduces bias (e.g., AdaBoost, XGBoost).  
    3. **Stacking** – Combines multiple models using a meta-learner.  
    4. **Voting/Averaging** – Combines predictions from multiple models.  

14. **What is ensemble learning in machine learning?**  
    Ensemble learning is a technique where multiple models are combined to improve overall accuracy, reduce overfitting, and enhance robustness.  

15. **When should we avoid using ensemble methods?**  
    - When interpretability is critical (e.g., medical applications).  
    - When a single simple model performs well enough.  
    - When computational resources are limited.  

16. **How does Bagging help in reducing overfitting?**  
    Bagging reduces overfitting by training multiple models on different random subsets of data and averaging their predictions, which reduces variance.  

17. **Why is Random Forest better than a single Decision Tree?**  
    - **Less Overfitting** – Averaging multiple trees improves generalization.  
    - **Higher Accuracy** – More stable and robust predictions.  
    - **Handles Missing Data & Feature Importance** better.  

18. **What is the role of bootstrap sampling in Bagging?**  
    Bootstrap sampling creates diverse training subsets by randomly selecting samples **with replacement**, ensuring different trees learn different patterns and reduce overfitting.  

19. **What are some real-world applications of ensemble techniques?**  
    - **Finance** (fraud detection, risk assessment)  
    - **Healthcare** (disease prediction, medical diagnosis)  
    - **E-commerce** (recommendation systems)  
    - **Image Recognition** (object detection, facial recognition)  

20. **What is the difference between Bagging and Boosting?**  

| Feature  | Bagging | Boosting |  
|----------|---------|---------|  
| Goal  | Reduce variance  | Reduce bias & variance |  
| How it Works | Trains multiple models independently and averages results | Trains models sequentially, giving more weight to misclassified samples |  
| Overfitting | Less prone | More prone if not tuned properly |  
| Example Algorithms | Random Forest, Bagging Classifier | AdaBoost, Gradient Boosting, XGBoost |  





Practical Questions


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor, StackingClassifier

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix, roc_auc_score, precision_recall_curve, classification_report,

precision_score, recall_score, f1_score

from sklearn.datasets import load_breast_cancer, make_regression

# Load Breast Cancer dataset for classification
data = load_breast_cancer()

X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Load dataset for regression

X_reg, y_reg = make_regression(n_samples=500, n_features=5, noise=0.1, random_state=42)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# 31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None]}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

# 32. Train a Bagging Regressor with different base estimators and compare performance
for n in [10, 50, 100]:
    bagging_reg = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=n, random_state=42)
    
    bagging_reg.fit(X_train_reg, y_train_reg)
    
    y_pred_reg = bagging_reg.predict(X_test_reg)
    
    print(f"Bagging Regressor ({n} estimators) MSE:", mean_squared_error(y_test_reg, y_pred_reg))

# 33. Train a Random Forest Classifier and analyze misclassified samples
rf_clf.fit(X_train, y_train)

misclassified = X_test[y_test != rf_clf.predict(X_test)]

print("Misclassified Samples Count:", len(misclassified))

# 34. Compare Bagging Classifier with Decision Tree Classifier
dt_clf = DecisionTreeClassifier()

dt_clf.fit(X_train, y_train)

y_pred_dt = dt_clf.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))

print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred_bag))

# 35. Train a Random Forest Classifier and visualize confusion matrix
y_pred_rf = rf_clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred_rf)

sns.heatmap(cm, annot=True, fmt='d')

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.title('Confusion Matrix')

plt.show()

# 36. Train a Stacking Classifier and compare accuracy
stacking_clf = StackingClassifier(estimators=[('dt', DecisionTreeClassifier()), ('svm', SVC(probability=True)), ('lr', LogisticRegression())],

final_estimator=LogisticRegression())

stacking_clf.fit(X_train, y_train)

y_pred_stack = stacking_clf.predict(X_test)

print("Stacking Classifier Accuracy:", accuracy_score(y_test, y_pred_stack))

# 37. Print the top 5 most important features from Random Forest
feature_importances = pd.Series(rf_clf.feature_importances_, index=data.
feature_names).nlargest(5)

print("Top 5 Features:", feature_importances)

# 38. Evaluate Bagging Classifier using Precision, Recall, and F1-score
print("Classification Report for Bagging Classifier:")

print(classification_report(y_test, y_pred_bag))

# 39. Analyze effect of max_depth on accuracy
for depth in [5, 10, None]:
    rf_depth = RandomForestClassifier(max_depth=depth, random_state=42)
    
    rf_depth.fit(X_train, y_train)
    
    print(f"Max Depth {depth} Accuracy:", accuracy_score(y_test, rf_depth.predict(X_test)))

# 40. Compare Bagging Regressor with different base estimators
dt_bagging = BaggingRegressor(base_estimator=DecisionTreeRegressor(), random_state=42)

knn_bagging = BaggingRegressor(base_estimator=KNeighborsRegressor(), random_state=42)

dt_bagging.fit(X_train_reg, y_train_reg)

knn_bagging.fit(X_train_reg, y_train_reg)

print("DT Bagging MSE:", mean_squared_error(y_test_reg, dt_bagging.predict(X_test_reg)))

print("KNN Bagging MSE:", mean_squared_error(y_test_reg, knn_bagging.predict(X_test_reg)))

# 41. Evaluate Random Forest Classifier using ROC-AUC Score
y_pred_proba_rf = rf_clf.predict_proba(X_test)[:, 1]

print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_proba_rf))

# 42. Evaluate Bagging Classifier using Cross-Validation
cv_scores = cross_val_score(bagging_clf, X, y, cv=5, scoring='accuracy')

print("Cross-Validation Scores:", cv_scores)

# 43. Plot Precision-Recall curve for Random Forest
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba_rf)

plt.plot(recall, precision)

plt.xlabel("Recall")

plt.ylabel("Precision")

plt.title("Precision-Recall Curve")

plt.show()

# 44. Train a Stacking Classifier with Random Forest and Logistic Regression
stacking_clf2 = StackingClassifier(estimators=[('rf', RandomForestClassifier()), ('lr', LogisticRegression())], final_estimator=LogisticRegression())

stacking_clf2.fit(X_train, y_train)

y_pred_stack2 = stacking_clf2.predict(X_test)

print("Stacking Classifier (RF+LR) Accuracy:", accuracy_score(y_test,
y_pred_stack2))

# 45. Compare Bagging Regressor with different bootstrap sample levels
for bootstrap_val in [True, False]:
    
    bagging_reg_bs = BaggingRegressor(base_estimator=DecisionTreeRegressor(),
    bootstrap=bootstrap_val, random_state=42)
    
    bagging_reg_bs.fit(X_train_reg, y_train_reg)
    
    print(f"Bootstrap {bootstrap_val} MSE:", mean_squared_error(y_test_reg, bagging_reg_bs.predict(X_test_reg)))
