# Advanced ML Week 5 Notes

## Ensemble Learning and XGBoost

Ensemble learning exists to 
- reduce overfitting
- improve predictive performance

Ensemble learning involves training a bunch of base models from the input data, and then aggregating the models in some way to produce a final output model

### Bias and Variance Trade-Off

The optimal model complexity is one that minimizes both bias and variance (which is usually the one that minimizes the total error)

### Base Models (weak learners)

Base models are our standard ML models (random forest, decision tree, SVM, KNN) that comprise our ensemble learning methods

#### Voting and Averaging
- Hard voting: marjority rule
- Soft voting: weighted probabilities
    - Weighted averaging
    
### Practical Considerations
- Overfitting and underfitting
- Model interpretability
- Computational efficiency

## Bagging (Bootstrap Aggregation)
- Creating multiple subsets of data with replacement
- Training separate models on each subset
- Aggregate the results

### Example 1: Random Forest
- Ensemble of decision trees
- Build each tree on a random subset with a random set of features used to split the trees
- Reduced correlation between trees, reduced overfitting
- Improved generalization capability
- For classification, majority voting
- For regression, average predictions from all trees

### Example 2: Extra Trees (Extremely randomized trees)
- More randomized than random forest
- Use randomized thresholds in tree-building process
- Improved tree diversity
- No bootstrap sampling, no resampling
- Random split then best split
- Used when faster training time is required and variance is an issue

### Example 3: Scikit-learn bagging
- Weak learners include not only decision tree but also SVM and KNN

In [3]:
# Bagging example

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier

In [4]:
# Create synthetic classification dataset
X, y = make_classification(random_state = 1)

# Configure the ensemble model
model = BaggingClassifier(n_estimators=50)

# Configure the resampling method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the ensemble on the data using the resampling method
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# Report ensemble performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Mean Accuracy: 0.943 (0.080)


## Boosting
- Sequentially training models
- Each model corrects errors of the previous one
- Adjust weights of misclassified samples

### Ada (Adaptive) Boosting
- Adaptive boosting method
- Focuses on hard-to-classify samples
- Weight adjustment mechanism
- Weighted majority vote (classification) or weighted sum (regression) of the weak learners
- Sensitive to noisy data and dependency on weak learners

### Gradient Boosting
- Optimize a loss function by fitting the residuals of the previous prediction
- Iteratively adds models to reduce error
- Regulatizion techniques using Gradient descent optimization
- Flexible to use various loss fucntions and weak learners
- Typically high accuracy, especially on structured/tabular data
#### Disadvantages:
- Slower to train
- More parameters to tune
- More complex to implement and optimize

### XG Boosting (Extreme Gradient Boosting)
- Better

### Comparison
**Strategy:**
- Ada: sequential with weigth adjustment
- Gradient: sequential with gradient descent
- XG: optimized gradient boosting

**Weak Learners:**
- Ada: Decision stumps or shallow trees
- Gradient: decision trees (various depths)
- XG: decision trees with optimizations

**Handling Errors:**
- Ada: Adjusts weights of misclassified samples
- Gradient: fits residuals of combined predicitons
- XG: fits residuals with regularization

**Training Speed:**
- Ada: Fast
- Gradient: Moderate to slow
- XG: Fast (due to parallel processing)

**Prediction speed:**
- Ada: fast
- Gradient: moderate
- XG: fast

**Regularization:**
- Ada: no
- Gradient: no (basic versions)
- XG: Yes (L1 and L2 regularization)

**Complexity:**
- Ada: simple
- Gradient: moderate
- XG: high

**Overfitting:**
- Ada: Prone on noisy data
- Gradient: Less prone (depends on parameters)
- XG: less prone due to regularization

**Handling Large Data:**
- Ada: moderate
- Gradient: moderate
- XG: Excellent (optimized for large-scale data)

**Tuning Parameters:**
- Ada: few parameters
- Gradient: more parameters
- XG: many parameters

## Stacking (Stacked Generalization)
- Training multiple different models (level-0 models/base learner)
- Use another model (meta-learner) to aggregate predictions
- Meta-learner is trained on the output of the base models

### Stacking Features
- Model diversity - leveraging individual model's strength
- Meta-learning - learns hwo to best combine the base model's predictions
- Use cross-validation to generate the training data for the meta-learner
- No specific algorithm: can use any combo of modes as base learners and any model as a meta-learner

### Advantages
- High flexibility
- Better performance
- Customizability

### Disadvantages
- Complexity: complex to implement and tune
- Computational cost: training multiple models and meta-learner
- Risk of overfitting: need properly validate the model

## When to Use Each Boosting
- AdaBoost: simple problem, need a quick, easy-to-implement solution, be cautious with noisy data
- Gradient Boosting: need a flexible and powerful model that can handle various type of data and loss function. be prepared for longer training time and more parameter tuning
- XGBoost: need high performance and efficiency, especially for large dataset. Ideal for competitive machine learning tasks.

## When to Use Each Ensemble
- Bagging: want to reduce variance and improve stability, especially with high variance models (decision trees), suitable for parallel processing
- Boosting: need to improve accuracy by reducing bias and variance. Ideal for challenging datasets and problems where performance is critical
- Stacking: when you have diverse models. Suitable for complex tasks that can afford computational complexity

# XG Boost

### What is it?
- Optimized distributed gradient boosting library
- Focus on efficiency and model performance
- Include several enhancements: regularization, tree pruning, parallel processing

### Benefit (Optimization + Efficiency)
- Regularization (L1 & L2) -> prevent overfitting
- Parallel processing -> efficiency with parallel tree construction
- Handling missing values
- Tree pruning -> advanced pruning based on max depth
- High flexibility and portability
- Faster training time and efficient prediction

### Disadvantages
- Complexity in both understanding and implementation
- Resource intensive - requiring more memory and processing power