<a href="https://colab.research.google.com/github/tgarg535/Machine-Learning/blob/main/Ensemble_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Theoretical Questions**

### **1. Can we use Bagging for regression problems?**

**Yes.** While Bagging (Bootstrap Aggregating) is often discussed in classification, it is highly effective for regression. In a Bagging Regressor, several models (usually Decision Trees) are trained on different subsets of the data, and the final prediction is the **average** of the predictions from all individual models.

### **2. Difference between multiple model training and single model training?**

* **Single Model:** A single algorithm (like one Decision Tree) learns from the entire dataset. It is prone to **high variance** (overfitting) or **high bias** (underfitting).
* **Multiple Model (Ensemble):** Multiple models are trained and their results combined. This approach captures different patterns in the data and typically results in a more **robust and generalized** model.

### **3. Explain the concept of feature randomness in Random Forest.**

In a standard Decision Tree, the algorithm looks at every available feature to find the best split. In **Random Forest**, for each split in each tree, the algorithm only considers a **random subset of features**. This ensures that the trees in the forest are de-correlated; even if one feature is a very strong predictor, other features will still be forced into the models of other trees.

### **4. What is OOB (Out-of-Bag) Score?**

When performing bootstrap sampling, roughly **1/3 of the data** is not used to train a particular tree. This unseen data is called the "Out-of-Bag" set. The **OOB Score** is a validation technique where each tree predicts the values for its own OOB samples. It provides an unbiased estimate of the model's accuracy without needing a separate validation set.

### **5. How can you measure feature importance in Random Forest?**

Random Forest measures importance by looking at how much the **prediction error increases** when a feature's values are permuted (shuffled) or how much the **Gini impurity/Mean Squared Error decreases** at nodes where that feature is used for splitting.

### **6. Explain the working principle of a Bagging Classifier.**

1. **Bootstrapping:** Create multiple random subsets of the training data with replacement.
2. **Parallel Training:** Train a base classifier (e.g., a Decision Tree) on each subset independently.
3. **Aggregating:** For a new data point, each tree provides a class prediction. The final output is determined by **Majority Voting**.

### **7. How do you evaluate a Bagging Classifier’s performance?**

Performance is evaluated using standard classification metrics: **Accuracy**, **Precision**, **Recall**, **F1-Score**, and the **OOB Score**. For unbalanced datasets, the **Area Under the ROC Curve (AUC-ROC)** is preferred.

### **8. How does a Bagging Regressor work?**

It follows the same process as the Bagging Classifier but instead of voting, it **averages** the numerical outputs of all base regressors. This averaging significantly reduces the variance of the final prediction.

### **9. What is the main advantage of ensemble techniques?**

The primary advantage is **Generalization**. Ensembles reduce the risk of relying on a single "lucky" or "unlucky" model. By combining multiple perspectives, they provide higher accuracy and are more stable against noise in the data.

### **10. What is the main challenge of ensemble methods?**

* **Computational Cost:** Training multiple models requires more memory and processing power.
* **Interpretability:** It is much harder to explain "why" an ensemble made a decision compared to a single Decision Tree (the "Black Box" problem).

---

### **11. Key idea behind ensemble techniques?**

The core philosophy is **"The Wisdom of the Crowd."** A group of moderately performing, diverse models will collectively outperform any single expert model by canceling out individual errors.

### **12. What is a Random Forest Classifier?**

It is an ensemble of **Decision Trees** specifically. It improves upon basic Bagging by introducing **feature randomness** (selecting a random subset of features at each node), which makes the trees more diverse and the overall forest more accurate.

### **13. Main types of ensemble techniques?**

* **Bagging:** Parallel training (e.g., Random Forest).
* **Boosting:** Sequential training (e.g., AdaBoost, XGBoost).
* **Stacking:** Training a meta-model to combine the outputs of different base models.

### **14. What is ensemble learning in machine learning?**

Ensemble learning is the process by which multiple models, such as classifiers or regressors, are strategically generated and combined to solve a particular computational intelligence problem.

### **15. When should we avoid using ensemble methods?**

Avoid ensembles if:

1. **Interpretability** is the highest priority (e.g., legal or medical explanations).
2. The dataset is extremely small (can lead to overfitting).
3. Computation resources or real-time latency limits are very strict.

### **16. How does Bagging help in reducing overfitting?**

Overfitting happens when a model captures noise as if it were a pattern. Since each tree in Bagging sees a different subset of data, noise in one subset likely won't appear in others. When the models are averaged, the **noise cancels out**, but the true patterns (which exist in all subsets) remain.

### **17. Why is Random Forest better than a single Decision Tree?**

A single tree is highly sensitive to the specific data it is trained on (High Variance). Random Forest averages many trees, which **stabilizes the variance** and prevents the model from being overly influenced by outliers or specific data points.

### **18. What is the role of bootstrap sampling in Bagging?**

Bootstrap sampling (sampling with replacement) creates **diversity**. It ensures that each model in the ensemble is trained on a slightly different version of the data, which is the foundation for the ensemble's ability to generalize.

### **19. Real-world applications of ensemble techniques?**

* **Banking:** Credit scoring and fraud detection.
* **Healthcare:** Disease prediction based on multiple symptoms.
* **E-commerce:** Recommendation engines (combining different user-behavior models).
* **Finance:** Stock market trend prediction.

### **20. What is the difference between Bagging and Boosting?**

| Feature | Bagging | Boosting |
| --- | --- | --- |
| **Arrangement** | Parallel | Sequential |
| **Focus** | Reduces Variance | Reduces Bias |
| **Weighting** | All models have equal weight | Models are weighted by performance |

---



---
#**Practical Questions**

### **21. Random Forest Hyperparameter Tuning (GridSearchCV)**

Tuning parameters like `max_depth` and `n_estimators` is crucial to prevent overfitting.

```python
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")

```

### **22. Bagging Regressor with Different Estimator Counts**

Increasing the number of base estimators usually stabilizes the error until it hits a plateau.

```python
from sklearn.ensemble import BaggingRegressor
import matplotlib.pyplot as plt

estimators = [10, 50, 100, 200]
scores = []

for n in estimators:
    model = BaggingRegressor(n_estimators=n).fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

plt.plot(estimators, scores, marker='o')
plt.title("Estimators vs R-squared Score")
plt.show()

```

### **23. Random Forest: Analyze Misclassified Samples**

Checking which samples the model got wrong can reveal patterns of noise or overlapping classes.

```python
y_pred = rf.predict(X_test)
misclassified_indices = (y_test != y_pred)
print(f"Misclassified samples count: {misclassified_indices.sum()}")
# Inspect first 5 misclassified instances
print(X_test[misclassified_indices][:5])

```

### **24. Bagging Classifier vs. Single Decision Tree**

This demonstrates the power of aggregation in reducing variance.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier().fit(X_train, y_train)
bag = BaggingClassifier(n_estimators=100).fit(X_train, y_train)

print(f"Single Tree Accuracy: {tree.score(X_test, y_test):.4f}")
print(f"Bagging Accuracy: {bag.score(X_test, y_test):.4f}")

```

### **25. Random Forest Confusion Matrix**

The confusion matrix shows exactly where the model is confusing one class for another.

```python
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, rf.predict(X_test))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

```

### **26. Stacking Classifier (Trees, SVM, Logistic Regression)**

Stacking uses a "Meta-Learner" (usually Logistic Regression) to decide which base model's prediction to trust most.

```python
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

estimators = [
    ('dt', DecisionTreeClassifier()),
    ('svc', SVC(probability=True))
]

stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack_clf.fit(X_train, y_train)
print(f"Stacking Accuracy: {stack_clf.score(X_test, y_test):.4f}")

```

### **27. Top 5 Important Features (Random Forest)**

```python
import pandas as pd

importances = rf.feature_importances_
feature_df = pd.DataFrame({'Feature': iris.feature_names, 'Importance': importances})
print(feature_df.sort_values(by='Importance', ascending=False).head(5))

```

### **28. Bagging: Precision, Recall, and F1-score**

```python
from sklearn.metrics import classification_report
print(classification_report(y_test, bag.predict(X_test)))

```

### **29. Random Forest: Effect of max_depth**

Limiting `max_depth` is the primary way to control overfitting in Random Forests.

```python
depths = [1, 3, 5, 10, None]
for d in depths:
    rf_depth = RandomForestClassifier(max_depth=d).fit(X_train, y_train)
    print(f"Depth {d} Accuracy: {rf_depth.score(X_test, y_test):.4f}")

```

### **30. Bagging Regressor: Tree vs. Neighbors**

You can use non-tree models as base estimators for Bagging.

```python
from sklearn.neighbors import KNeighborsRegressor

bag_tree = BaggingRegressor(base_estimator=DecisionTreeRegressor()).fit(X_train, y_train)
bag_knn = BaggingRegressor(base_estimator=KNeighborsRegressor()).fit(X_train, y_train)

print(f"Bagging (Tree) MSE: {mean_squared_error(y_test, bag_tree.predict(X_test)):.2f}")
print(f"Bagging (KNN) MSE: {mean_squared_error(y_test, bag_knn.predict(X_test)):.2f}")

```

### **31. Random Forest: ROC-AUC Score**

```python
from sklearn.metrics import roc_auc_score
# Using predict_proba for multi-class ROC-AUC
probs = rf.predict_proba(X_test)
print(f"ROC-AUC Score: {roc_auc_score(y_test, probs, multi_class='ovr'):.4f}")

```

### **32. Bagging: Cross-Validation**

```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(BaggingClassifier(), X, y, cv=5)
print(f"Mean CV Accuracy: {scores.mean():.4f}")

```

### **33. Random Forest: Precision-Recall Curve**

Useful for imbalanced datasets to see the trade-off between precision and recall.

### **34. Stacking: Random Forest + Logistic Regression**

```python
estimators = [('rf', RandomForestClassifier(n_estimators=10))]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)

```

### **35. Bagging: Effect of Bootstrap Levels**

Varying `max_samples` (the percentage of data each base learner sees) affects diversity.

```python
samples = [0.5, 0.7, 0.9, 1.0]
for s in samples:
    bag_samples = BaggingClassifier(max_samples=s).fit(X_train, y_train)
    print(f"Sample Fraction {s} Accuracy: {bag_samples.score(X_test, y_test):.4f}")

```


### **36. Random Forest Classifier: Training with 5-Fold Cross-Validation**

Cross-validation ensures that the Random Forest's performance is consistent across different subsets of the data, providing a more reliable estimate than a single train-test split.

```python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X, y, cv=5)

print(f"Cross-Validation Accuracy Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.4f}")

```

### **37. Bagging Classifier: Impact of `n_estimators` on Accuracy**

This exercise helps identify the point of diminishing returns, where adding more trees no longer improves model performance.

```python
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier

n_range = [1, 5, 10, 20, 50, 100]
accuracies = []

for n in n_range:
    clf = BaggingClassifier(n_estimators=n, random_state=42).fit(X_train, y_train)
    accuracies.append(clf.score(X_test, y_test))

plt.plot(n_range, accuracies, marker='o', linestyle='--')
plt.title("Bagging: Impact of n_estimators")
plt.xlabel("Number of Estimators")
plt.ylabel("Accuracy")
plt.show()

```

### **38. Random Forest: Visualizing the First Decision Tree**

A Random Forest is an ensemble of trees. Visualizing an individual tree from the forest helps in understanding the decision-making logic of the base estimators.

```python
from sklearn.tree import plot_tree

rf = RandomForestClassifier(n_estimators=10, max_depth=3).fit(X_train, y_train)
plt.figure(figsize=(15, 10))
plot_tree(rf.estimators_[0], feature_names=iris.feature_names, filled=True)
plt.title("Visualization of a Single Tree in the Forest")
plt.show()

```

### **39. Bagging Regressor: Prediction Intervals**

By looking at the predictions of every individual tree in a Bagging ensemble, we can estimate the uncertainty (variance) of the final prediction.

```python
import numpy as np
from sklearn.ensemble import BaggingRegressor

bag_reg = BaggingRegressor(n_estimators=100).fit(X_train, y_train)
# Collect predictions from all 100 trees for a single test sample
sample_preds = [tree.predict(X_test[0].reshape(1, -1)) for tree in bag_reg.estimators_]

print(f"Mean Prediction: {np.mean(sample_preds):.2f}")
print(f"Standard Deviation (Uncertainty): {np.std(sample_preds):.2f}")

```

### **40. Comparing Random Forest and Extra Trees Classifier**

Extra Trees (Extremely Randomized Trees) differ from Random Forest by choosing split points randomly for each feature rather than searching for the best possible split, which can further reduce variance.

```python
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
et = ExtraTreesClassifier(n_estimators=100).fit(X_train, y_train)

print(f"Random Forest Score: {rf.score(X_test, y_test):.4f}")
print(f"Extra Trees Score: {et.score(X_test, y_test):.4f}")

```

### **41. Random Forest: Analyzing Feature Correlation**

If two features are highly correlated, Random Forest might split their "importance" score between them. This exercise identifies such relationships before training.

```python
import seaborn as sns
import pandas as pd

df = pd.DataFrame(X, columns=iris.feature_names)
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='RdYlGn')
plt.title("Feature Correlation Heatmap")
plt.show()

```

### **42. Bagging Classifier: Logistic Regression as Base Estimator**

Bagging isn't restricted to trees; using Logistic Regression as a base learner can create a stable, low-variance linear ensemble.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier

bag_lr = BaggingClassifier(base_estimator=LogisticRegression(max_iter=1000), n_estimators=10)
bag_lr.fit(X_train, y_train)
print(f"Bagging (Logistic Regression) Accuracy: {bag_lr.score(X_test, y_test):.4f}")

```

### **43. Random Forest: Balanced Class Weights**

For imbalanced datasets, setting `class_weight='balanced'` adjusts the weights of the classes inversely proportional to their frequencies in the input data.

```python
rf_balanced = RandomForestClassifier(n_estimators=100, class_weight='balanced')
rf_balanced.fit(X_train, y_train)

```

### **44. Random Forest Regressor: Evaluating with R² and MAE**

A robust evaluation uses multiple metrics to ensure the model isn't just accurate on average, but also has low absolute error.

```python
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
y_pred = rf_reg.predict(X_test)

print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred):.4f}")

```

### **45. Stacking Regressor: Combining Random Forest and KNN**

Stacking uses a final "meta-regressor" to learn how to best combine the numerical predictions of diverse base models.

```python
from sklearn.ensemble import StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge

estimators = [
    ('rf', RandomForestRegressor(n_estimators=10)),
    ('knn', KNeighborsRegressor(n_neighbors=5))
]

stack_reg = StackingRegressor(estimators=estimators, final_estimator=Ridge())
stack_reg.fit(X_train, y_train)
print(f"Stacking Regressor R² Score: {stack_reg.score(X_test, y_test):.4f}")

```

---



