# Introduction
Boosting is a powerful ensemble Machine Learning technique used in both classification and regression tasks. It combines the predictions from multiple weak learners (oftentimes Decision Trees) to create a strong learner with improved performance.

### Core idea
- Boosting iteratively trains weak learners, where each learner focuses on correcting the errors of the previous ones.
- Imagine a group of average students (weak learners) working together to solve a problem. Each student learns from the mistakes of the others, ultimately leading to a better understanding of the problem.

### Boosting algorithm
1. Initialize weights: Each data point in the training set is assigned and equal weight.
2. Train weak learner: A weak learner (e.g., Decision Tree) is trained on the weighted data.
3. Calculate error: The error of the weak learner is calculated based on the assigned weights. Misclassified points receive higher weights, focusing the next learner on those challenging examples.
4. Adjust weights: Weights of the data points are adjusted based on the errors. More weight is given to points that the previous learners got wrong.
5. Repeat: Steps 2 to 4 are repeated for multiple iterations, with each new learner focusing on the most difficult cases from the previous learner.
6. Final prediction: The final prediction is made by combining the predictions from all the weak learners in the ensemble, often using a weighted voting (for classification) or averaging (for regression) approach.

### Benefits of Boosting
- Improved accuracy: By combining weaker models, boosting can achieve higher accuracy compared to individual learners.
- Can handle complex problems: Boosting can learn complex relationships in the data that might be challenging for a single model.
- Handles imbalanced data: Some boosting algorithms can effectively handle imbalanced datasets where certain classes have fewer data points.

### Common Boosting algorithms
- AdaBoost (Adaptive Boosting): A popular boosting algorithm that focuses on improving the weights of misclassified examples.
- Gradient Boosting: A more general framework where the focus is on minimizing a loss function (e.g., squared error for regression) in each iteration. Common examples include,
    - XGBoost: A powerful and scalable gradient boosting algorithm known for its performance.
    - LightGBM: Another efficient gradient boosting algorithm with good performance and speed.

### Considerations
- Overfitting: Boosting algorithms can be prone to overfitting if not carefully tuned. Techniques like regularization can be used to mitigate this risk.
- Computational cost: Training a boosted ensemble can be computationally expensive compared to a single model, especially with many iterations.

# Bagging V. Boosting
### Boosting for high bias and low variance models
- Boositng is often used when the base learners (e.g., Decision Trees) tend to have high bias (underfitting) and low variance (low model complexity).
- Bagging would not be ideal in this scenario because it focuses on averaging predictions from diverse models (high variance). Averaging underfitting models won't significantly improve performance.

### Additive combining in Boosting
Boosting addresses the high bias by sequentially training models (like Decision Trees) in an "additive" fashion. Each subsequent model "boosts" the overall performance by focusing on the erros of the previous model. Consider the following examples,
- Imagine a dataset with target values (predicted v. actual values).
- The first model (weak learner) might underfit the data, leading to sigificant errors (differences between predicted and actual values).
- The second model is trained specifically on these errors, trying to learn from the mistakes of the first model. It essentially adds its corrective predictions to the first model's predictions.
- This process continues iteratively, with each subsequent model focusing on the remaining errors from the previous ensemble.

### Comparison with Bagging
- Boosting is a sequential process. Each model builds upon the previous one.
- Bagging, on the other hand, trains models in parallel on different data subsets (bootstrap samples).

### Addressing bias in Boosting
- Boosting achieves bias reduction by focusing on the errors of previous models. Each iteration aims to learn from the shortcomings of the ensemble so far, gradually reducing the overall bias.
- Additionally, Boosting algorithms often adjust the weights of data points during training. Points that were misclassified by previous models receive higher weights, forcing the next model to pay more attention to those challenging examples.

### Boosting algorithms and considerations
- Common Boosting algorithms like AdaBoost and Gradient Boosting implement these principles in different ways.
- While Boosting offers advantages, it can be computationally expensive due to the sequential nature of training. However, the performance gains can often outweigh the increased training time.

# Additive Combining In Boosting
### Step-by-step breakdown
1. Simple model and residuals:
    - The average or the mean model is the simple starting point ($M_0$).
    - It predicts the average target value ($y_{cap}$) abd the residuals ($error_{i0}$) are calculated for each data point ($y_i$) as the difference between the actual value and the average prediction.
2. Model on residuals:
    - A second model ($M_1$) is trained on the residuals ($error_{i0}$). $M_1$ aims to fit these errors.
    - In Boosting terminology, M1 is called a weak larner. It typically has high bias (underfitting) and low variance (low complexity), focusing on specific patterns in the errors.
3. Additive prediction: The final prediction for each data point is achieved by adding the predictiona from $M_0$ (the average model) and $M_1$ (error model).
    - $h_0(x_i) + h_1(x_i)$.
    - This is where the additive aspect comes in. The corrections from each model are progressively added.
4. Iterative process (optional):
    - Boosting algorithms can continue this process by training additional models ($M_2$, $M_3$, etc) on the remaining errors from the previous ensemble.
    - Each subsequent model focuses on the errors the ensemble has not yet captured effectively.

### Example
Say that out of 100 data points, 80 data points have been correctly predicted by $M_0$. $M_1$ tries to learn from the remaining 20 data points. Now say that $M_1$ has been able to predict 16 out of 20 data points correctly. The sum of $M_0$'s prediction and $M_1$'s corrections (16 out of 20) provides a potentially better overall prediction.

### Classification v. regression
- In regression, residuals represent the difference between the actual target value and the predicted value.
- In classification, boosting algorithms might use probability differences instead of residuals for error calculations.

### Addressing bias
By iteratively focusing on the errors of previous models, boosting gradually reduces the overall bias. Each model in the ensemble adds it corrective power to improve the final prediction.

### High bias v. high variance
- High bias: High training error and high testing error (model underfits the data).
- High variance: Low training error but high testing error (model overfits the training data).
- Boosting is particualarly effective for models with high bias because it sequentially refines the predictions to reduce bias.