# Ensemble Methods

Ensemble methods combine multiple models - base learner to produce a single stronger model. 

- Different models make different mistakes and aggregation smooths them out
- Base learners are usually simple ( decision trees, linear models).

## Bagging

- Bagging focuses on variance reduction
- Multiple models are trained independently
- Each model sees a different bootstrap sample ( sampling with replacement)
- Predictions are combined by averaging or majority vote.
- Works best for high variance models ( decision trees)

## Boosting

- Focuses on bias reduction
- Models are trained sequentially
- Each new model focuses on previous mistakes
- Predictions are combined using weighted sums

## Stacking

- Combines different types of models
- Multiple diverse base models are trained in parallel
- Their predictions are used as features for a meta-model
- The meta model learns how to weight or combine base predictions.
- logistic regression on top of RF + SVM + GBM

## Voting

Voting combines multiple models using simple aggregation rules

- Hard voting : majority class vote
- Soft voting : Averaging predicted probabilities
- Models are usually diverse

---------------
- High variance, unstable model → **Bagging / Random Forest**
- High bias, underfitting → **Boosting**
- Very different model types → **Stacking**
- Simple ensemble baseline → **Voting**
- Categorical-heavy data → **CatBoost**
- Massive dataset, speed matters → **LightGBM**

# Random Forest

## Bias Variance Trade off

Decision trees has low bias and very high variance . Small changes in data → Very different trees. This makes trees unstable and overfitting-prone. How do we keep the low bias of the trees, but reduce variance?

**Train many different trees and average them**

- outlier detection using isolated forests ( unsupervised method)
- Hyper parameters of RandomForest

## Randomness in Random Forest

- Bootstrap Sampling → Randomly sample with replacement
- Feature Subtraction → Column Randomness

## Out of Bag Score (OOB Score)

- Each tree is trained on a **bootstrap sample** (random sample with replacement) of the data.
- About **63%** of the data is used in a tree’s training.
- The remaining **~37%** is **not used** for that tree — these are called **Out-of-Bag (OOB)** samples
- For each data point, it is predicted by trees that did not see it during training. These predictions are aggregated and compared with the true value → This accuracy is computed and called as OOB Accuracy.

## Proximity Matrix

Measures how often two data points end up in the same leaf node across all trees in the forest. 

$$
P(i,j) = \frac{\text{Number of trees where i \& j in the same leaf}}{\text{Total Number of trees}}
$$

- Gives some kinda similarity measure ( data driven)
- If a point has missing values ; then find points that are most similar to it ( high proximity ). Use their known values to fill in missing ones.
- Outlier Detection Using Proximity Matrix : Rarely appear in same leaf as others → low proximity to everyone.

$$
\text{Outlier Score }(i) = \frac{1}{\sum_j \text{Proximity(i,j)}^2}
$$

- Large Score → Isolated → Likely outlier

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import defaultdict

# -----------------------------
# Load example dataset
# -----------------------------
data = load_iris()
X, y = data.data, data.target

# Split data (just for demonstration)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------------
# Random Forest with OOB
# -----------------------------
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',
    oob_score=True,   # enable Out-Of-Bag score
    random_state=42
)

rf.fit(X_train, y_train)

print(f"OOB Score: {rf.oob_score_:.3f}")
print(f"Test Score: {rf.score(X_test, y_test):.3f}")

# -----------------------------
# Proximity Matrix Calculation
# -----------------------------
# Idea: For each tree, get leaf indices. Count how often samples i & j share same leaf.
n_samples = X_train.shape[0]
proximity = np.zeros((n_samples, n_samples))

# For each tree
for tree in rf.estimators_:
    # Leaf indices for each training sample
    leaf_indices = tree.apply(X_train)
    # Compare every pair of samples
    for i in range(n_samples):
        for j in range(i, n_samples):
            if leaf_indices[i] == leaf_indices[j]:
                proximity[i, j] += 1
                if i != j:
                    proximity[j, i] += 1  # symmetric

# Normalize by number of trees
proximity /= len(rf.estimators_)
print("Proximity Matrix (first 5 samples):")
print(proximity[:5, :5])

# -----------------------------
# Outlier Detection Using Proximity
# -----------------------------
# Outlier Score: 1 / sum_j Proximity(i,j)^2
outlier_scores = 1 / np.sum(proximity**2, axis=1)
print("Outlier Scores (first 10 samples):")
print(outlier_scores[:10])

# Identify top outliers
top_outliers = np.argsort(outlier_scores)[-3:]
print("Top 3 Outliers indices:", top_outliers)


OOB Score: 0.917
Test Score: 1.000
Proximity Matrix (first 5 samples):
[[1.   0.79 0.   1.   1.  ]
 [0.79 1.   0.   0.79 0.79]
 [0.   0.   1.   0.   0.  ]
 [1.   0.79 0.   1.   1.  ]
 [1.   0.79 0.   1.   1.  ]]
Outlier Scores (first 10 samples):
[0.02582685 0.03834812 0.03830229 0.02582685 0.02582685 0.07265647
 0.04187447 0.02582685 0.02582685 0.02582685]
Top 3 Outliers indices: [68 59 62]
