#Theory questions


**1. Can we use Bagging for regression problems?**  
Yes, Bagging can be used for regression problems. In this case, it combines the predictions of multiple regression models (like Decision Tree Regressors) by averaging their outputs to reduce variance and improve accuracy.

**2. What is the difference between multiple model training and single model training?**  
Single model training involves building one predictive model, while multiple model training involves training several models independently (e.g., in Bagging) or sequentially (e.g., in Boosting) to form a stronger overall predictor.

**3. Explain the concept of feature randomness in Random Forest.**  
In Random Forest, feature randomness means that each decision tree considers only a random subset of features when splitting nodes. This helps in decorrelating the trees and improving ensemble diversity.

**4. What is OOB (Out-of-Bag) Score?**  
The OOB score is an internal validation score in Bagging. Since each model is trained on a bootstrap sample, the data not included (out-of-bag) is used to evaluate model performance, providing an unbiased accuracy estimate.

**5. How can you measure the importance of features in a Random Forest model?**  
Feature importance can be measured using metrics like Gini Importance or Mean Decrease in Impurity (MDI), which calculates how much each feature contributes to reducing impurity across all trees.

**6. Explain the working principle of a Bagging Classifier.**  
A Bagging Classifier creates multiple models by training them on different bootstrap samples of the dataset. Their outputs are combined using majority voting for classification tasks to produce the final result.

**7. How do you evaluate a Bagging Classifier’s performance?**  
You can evaluate it using accuracy, precision, recall, F1-score, confusion matrix, ROC-AUC, or OOB score, depending on the nature of the classification problem.

**8. How does a Bagging Regressor work?**  
A Bagging Regressor trains multiple regressors on random subsets (with replacement) of the training data and combines their predictions by averaging the outputs.

**9. What is the main advantage of ensemble techniques?**  
The main advantage is improved performance—higher accuracy and robustness—by reducing variance (Bagging), bias (Boosting), or both.

**10. What is the main challenge of ensemble methods?**  
The main challenges are increased computational cost, reduced interpretability, and the risk of overfitting if not properly managed.

**11. Explain the key idea behind ensemble techniques.**  
Ensemble techniques combine the outputs of multiple models to produce a more accurate and stable prediction than any individual model.

**12. What is a Random Forest Classifier?**  
A Random Forest Classifier is an ensemble of decision trees, where each tree is trained on a random subset of the data and features, and classification is done using majority voting.

**13. What are the main types of ensemble techniques?**  
The main types are Bagging (Bootstrap Aggregating), Boosting, Stacking, and Voting.

**14. What is ensemble learning in machine learning?**  
Ensemble learning is a method that combines multiple models to improve predictive performance and generalization compared to individual models.

**15. When should we avoid using ensemble methods?**  
Ensemble methods should be avoided when interpretability is crucial, when computational resources are limited, or when a single model performs sufficiently well.

**16. How does Bagging help in reducing overfitting?**  
Bagging reduces overfitting by training multiple models on different random samples, thereby reducing the variance and making the model more generalizable.

**17. Why is Random Forest better than a single Decision Tree?**  
Random Forest is better because it reduces overfitting and variance by averaging multiple decision trees trained on different data and feature subsets.

**18. What is the role of bootstrap sampling in Bagging?**  
Bootstrap sampling creates diverse training subsets by sampling with replacement. This diversity among models helps in reducing overfitting and variance.

**19. What are some real-world applications of ensemble techniques?**  
Applications include spam detection, fraud detection, image classification, medical diagnosis, stock price prediction, and recommendation systems.

**20. What is the difference between Bagging and Boosting?**  
Bagging trains models independently in parallel on random subsets, focusing on variance reduction. Boosting trains models sequentially, where each model corrects errors from the previous ones, aiming to reduce both bias and variance.

---


#Practical question

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor, StackingClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, mean_squared_error, roc_auc_score, classification_report, confusion_matrix, precision_recall_curve, auc
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Dataset for classification
X_c, y_c = load_breast_cancer(return_X_y=True)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y_c, test_size=0.2, random_state=42)

# Dataset for regression
X_r, y_r = load_diabetes(return_X_y=True)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size=0.2, random_state=42)


In [None]:
# Train a Bagging Classifier using Decision Trees
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
model.fit(X_train_c, y_train_c)
print("Accuracy:", model.score(X_test_c, y_test_c))


In [None]:
# Train a Bagging Regressor and evaluate using MSE
model = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=10, random_state=42)
model.fit(X_train_r, y_train_r)
preds = model.predict(X_test_r)
print("MSE:", mean_squared_error(y_test_r, preds))


In [None]:
# Random Forest Classifier on Breast Cancer dataset
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
importances = rf.feature_importances_
print("Feature Importances:", importances)


In [None]:
# Random Forest Regressor vs Decision Tree Regressor
rf = RandomForestRegressor(random_state=42)
tree = DecisionTreeRegressor(random_state=42)
rf.fit(X_train_r, y_train_r)
tree.fit(X_train_r, y_train_r)
print("RF MSE:", mean_squared_error(y_test_r, rf.predict(X_test_r)))
print("DT MSE:", mean_squared_error(y_test_r, tree.predict(X_test_r)))


In [None]:
# Compute OOB Score
rf = RandomForestClassifier(oob_score=True, random_state=42)
rf.fit(X_train_c, y_train_c)
print("OOB Score:", rf.oob_score_)


In [None]:
# Bagging Classifier using SVM
svm_bag = BaggingClassifier(base_estimator=SVC(probability=True), n_estimators=10, random_state=42)
svm_bag.fit(X_train_c, y_train_c)
print("Accuracy (SVM Bagging):", svm_bag.score(X_test_c, y_test_c))


In [None]:
# RF with different number of trees
for n in [10, 50, 100, 200]:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train_c, y_train_c)
    print(f"Trees: {n}, Accuracy: {rf.score(X_test_c, y_test_c)}")


In [None]:
# Bagging with Logistic Regression
log_bag = BaggingClassifier(base_estimator=LogisticRegression(max_iter=1000), n_estimators=10, random_state=42)
log_bag.fit(X_train_c, y_train_c)
probs = log_bag.predict_proba(X_test_c)[:, 1]
print("AUC Score:", roc_auc_score(y_test_c, probs))


In [None]:
# RF Regressor feature importance
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_r, y_train_r)
print("Feature Importances:", rf.feature_importances_)


In [None]:
# Compare accuracy: Bagging vs Random Forest
bag = BaggingClassifier(n_estimators=50, random_state=42)
rf = RandomForestClassifier(n_estimators=50, random_state=42)
bag.fit(X_train_c, y_train_c)
rf.fit(X_train_c, y_train_c)
print("Bagging Accuracy:", bag.score(X_test_c, y_test_c))
print("RF Accuracy:", rf.score(X_test_c, y_test_c))


In [None]:
# GridSearchCV on RF Classifier
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, None]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid.fit(X_train_c, y_train_c)
print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)


In [None]:
# Bagging Regressor with different base estimators
for base in [DecisionTreeRegressor(), KNeighborsRegressor()]:
    model = BaggingRegressor(base_estimator=base, n_estimators=10, random_state=42)
    model.fit(X_train_r, y_train_r)
    print(type(base).__name__, "MSE:", mean_squared_error(y_test_r, model.predict(X_test_r)))


In [None]:
# Analyze misclassified samples in RF
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
y_pred = rf.predict(X_test_c)
misclassified = np.where(y_pred != y_test_c)[0]
print("Number of misclassified samples:", len(misclassified))


In [None]:
# Compare Bagging vs Single Decision Tree
bag = BaggingClassifier(random_state=42)
tree = DecisionTreeClassifier()
bag.fit(X_train_c, y_train_c)
tree.fit(X_train_c, y_train_c)
print("Bagging Accuracy:", bag.score(X_test_c, y_test_c))
print("Decision Tree Accuracy:", tree.score(X_test_c, y_test_c))


In [None]:
# Confusion Matrix of RF
from sklearn.metrics import ConfusionMatrixDisplay
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
ConfusionMatrixDisplay.from_estimator(rf, X_test_c, y_test_c)
plt.title("Random Forest Confusion Matrix")
plt.show()


In [None]:
# Stacking Classifier with Decision Tree, SVM, and Logistic Regression
estimators = [('dt', DecisionTreeClassifier()), ('svm', SVC(probability=True))]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train_c, y_train_c)
print("Stacking Accuracy:", stack.score(X_test_c, y_test_c))


In [None]:
# Top 5 important features in RF
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
importances = pd.Series(rf.feature_importances_).sort_values(ascending=False)
print("Top 5 important features:\n", importances.head())


In [None]:
# Bagging Classifier Precision, Recall, F1
model = BaggingClassifier(random_state=42)
model.fit(X_train_c, y_train_c)
y_pred = model.predict(X_test_c)
print(classification_report(y_test_c, y_pred))


In [None]:
# Effect of max_depth in RF
for d in [2, 5, 10, None]:
    rf = RandomForestClassifier(max_depth=d, random_state=42)
    rf.fit(X_train_c, y_train_c)
    print(f"Max Depth: {d}, Accuracy: {rf.score(X_test_c, y_test_c)}")


In [None]:
# ROC-AUC Score of RF
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
probs = rf.predict_proba(X_test_c)[:, 1]
print("ROC AUC:", roc_auc_score(y_test_c, probs))


In [None]:
# Evaluate Bagging with cross-validation
bag = BaggingClassifier(random_state=42)
scores = cross_val_score(bag, X_c, y_c, cv=5)
print("Cross-Val Scores:", scores)
print("Mean Accuracy:", scores.mean())


In [None]:
# Precision-Recall Curve for RF
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_c, y_train_c)
probs = rf.predict_proba(X_test_c)[:, 1]
precision, recall, _ = precision_recall_curve(y_test_c, probs)
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid()
plt.show()


In [None]:
# Stacking Classifier with RF and Logistic Regression
estimators = [('rf', RandomForestClassifier()), ('lr', LogisticRegression(max_iter=1000))]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train_c, y_train_c)
print("Stacked Accuracy:", stack.score(X_test_c, y_test_c))


In [None]:
# Bagging Regressor with different bootstrap sizes
for bs in [0.5, 0.7, 1.0]:
    model = BaggingRegressor(max_samples=bs, random_state=42)
    model.fit(X_train_r, y_train_r)
    mse = mean_squared_error(y_test_r, model.predict(X_test_r))
    print(f"Bootstrap: {bs}, MSE: {mse}")
