<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/Machine_Learning/Tree_based/03-01-03-00-tree-based-models-gradient-boosted-introduction-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

# 3. Gradient Boosted Trees (GBT)

GradGradient Boosted Trees (GBT) are a powerful ensemble learning method that builds decision trees sequentially to minimize a loss function using gradient descent. Known for high predictive accuracy, especially with structured data, GBT is widely used in various applications. This section summarizes how GBT works and highlights different models, including XGBoost, LightGBM, and CatBoost, each with unique strengths for specific scenarios.




## Overview

**Gradient Boosted Trees (GBT)** is a powerful machine learning technique used for both regression and classification tasks. It is an ensemble method that builds multiple decision trees sequentially, where each tree corrects the errors of the previous ones by minimizing a loss function using gradient descent. GBT is widely used due to its high accuracy and ability to model complex, non-linear relationships in data. Popular implementations include **XGBoost**, **LightGBM**, and **CatBoost**.

### Key Components

-   `Loss Function`: Guides the optimization (e.g., mean squared error for regression, log-loss for classification, or custom losses).
-   `Learning Rate*` Controls the contribution of each tree, balancing speed and accuracy.
-   `Regularization`: Techniques like tree depth limits, L1/L2 penalties, or shrinkage prevent overfitting.
-   `Gradient Descent`: Updates are made in the direction of the negative gradient of the loss function.

### How Gradient Boosted Trees (GBT) Work

GBT builds an ensemble of decision trees in a **sequential** manner, unlike bagging methods (e.g., Random Forest), which build trees independently. The core idea is to iteratively improve predictions by fitting new trees to the residuals (errors) of the previous model, guided by gradient descent to minimize a specified loss function.

1.  Initialization

-   Start with an initial prediction, typically a constant value (e.g., the mean of the target variable for regression or log-odds for classification).

2.  Compute Residuals

-   Calculate the residuals (errors) between the current predictions and the actual target values based on a loss function (e.g., mean squared error for regression, log-loss for classification).

3.  Fit a Decision Tree

-   Train a new decision tree to predict the residuals. The tree is typically shallow (e.g., limited depth) to prevent overfitting.

4.  Update Predictions

-   Add the new tree’s predictions to the existing model, scaled by a **learning rate** (a small value, e.g., 0.1, to control the contribution of each tree and ensure stability).

5.  Iterate

-   Repeat steps 2–4 for a specified number of trees (iterations) or until the loss converges.

6.  Final Prediction

-   Combine the predictions from all trees (weighted by the learning rate) to produce the final output:
    -   For `regression`: Sum the initial prediction and all tree outputs.
    -   For `classification`: Combine tree outputs (e.g., log-odds) and apply a transformation (e.g., sigmoid for binary classification).

Below is a visual representation of the GBT process:

![alt text](http://drive.google.com/uc?export=view&id=1mX4VA1y7gj4dTO8FWcIY3QZGq08brJ_T)



### Advantages

-   High predictive accuracy, often outperforming other algorithms.
-   Captures complex, non-linear relationships and feature interactions.
-   Flexible with various loss functions and regularization techniques.

### Disadvantages

-   Computationally intensive and sensitive to hyperparameter tuning.
-   Less interpretable than single decision trees.
-   Can overfit if not properly regularized.

### Advantages

-   Reduces overfitting by averaging out noise across multiple trees.
-   Handles high-dimensional data well (especially Random Forest and Extra Trees).
-   Provides feature importance scores (e.g., in Random Forest).
-   Parallelizable, as trees are trained independently.

### Limitations

-   Less interpretable than a single decision tree.
-   Computationally expensive for large datasets or many trees.
-   May not perform as well as boosting methods (e.g., Gradient Boosting, XGBoost) for certain tasks.

## Different Types of GBT Models

Here's a brief overview of XGBoost, LightGBM, CatBoost, AdaBoost, and GBM models, focusing on their key characteristics and use cases:

1.  **GBM (Gradient Boosting Machine)**:

   -   `Overview`: A general gradient boosting framework that builds trees sequentially, minimizing a loss function (e.g., mean squared error) via gradient descent.
   -   `Key Features`: Flexible loss functions, supports regression and classification, and focuses on reducing residuals in each step.
   -   `Use Cases`: Applied in predictive modeling tasks like financial forecasting, medical diagnosis, or customer churn prediction.
   -   `Strengths`: Strong theoretical foundation, customizable, and effective for small-to-medium datasets.
   -   `Weaknesses`: Slower than optimized frameworks like XGBoost or LightGBM; sensitive to hyperparameter settings.
    
2.  **LightGBM (Light Gradient Boosting Machine)**:

   -   `Overview`: A gradient boosting framework by Microsoft, optimized for efficiency and large datasets. It uses histogram-based learning and leaf-wise tree growth.
   -   `Key Features`: Faster training than XGBoost, lower memory usage, and support for categorical features without one-hot encoding.
   -   `Use Cases`: Suitable for large-scale datasets in tasks like fraud detection, recommendation systems, and time-series forecasting.
   -   `Strengths`: High speed, memory efficiency, and good performance on imbalanced data.
   -   `Weaknesses`: May overfit on small datasets; less interpretable due to leaf-wise splitting.
   
3.  **XGBoost (Extreme Gradient Boosting)**:

   -   `Overview`: An optimized gradient boosting framework designed for speed and performance. It uses a regularized objective function to reduce overfitting and supports parallel tree building.
   -   `Key Features`: Handles missing values, supports custom loss functions, and includes L1/L2 regularization. It uses a second-order approximation for optimization.
   -   `Use Cases`: Widely used in structured/tabular data tasks like classification, regression, and ranking problems (e.g., Kaggle competitions).
   -   `Strengths`: High accuracy, scalability, and robust handling of diverse datasets.
   -   `Weaknesses`: Can be computationally intensive and requires careful hyperparameter tuning.
   
4.  **CatBoost (Categorical Boosting)**:

   -   `Overview`: A gradient boosting library by Yandex, designed to handle categorical features natively. It uses ordered boosting to reduce overfitting.
   -   `Key Features`: Automatic categorical feature encoding, robust to noisy data, and symmetric tree structures for faster predictions.
   -   `Use Cases`: Ideal for datasets with many categorical variables, such as customer segmentation or risk modeling.
   -   `Strengths`: Reduces preprocessing needs, strong out-of-box performance, and good handling of categorical data.
   -   `Weaknesses`: Slower training compared to LightGBM; less flexible for custom loss functions.
   
5.  **AdaBoost (Adaptive Boosting)**:

  -   Overview: A boosting algorithm that combines weak learners (e.g., shallow decision trees) by focusing on misclassified samples in each iteration.
  -   Key Features: Assigns weights to samples, increasing weights for misclassified ones, and combines predictions via weighted voting or averaging.
  -   Use Cases: Used in simpler classification tasks or when interpretability is needed, like face detection or text classification.
  -   Strengths: Simple, less prone to overfitting than some other boosting methods, and works well with weak models.
  -   Weaknesses: Sensitive to noisy data and outliers; less competitive compared to modern gradient boosting methods.


Below is a table summarizing the key differences between **GBM**, **lightGNM**, **XGBoost**, **CatBoost**, and **AdaBoost**, based on algorithmic approach, categorical feature handling, speed/scalability, overfitting/robustness, tree structure, ease of use, and use case suitability.

| **Aspect** | **XGBoost** | **LightGBM** | **CatBoost** | **AdaBoost** | **GBM** |
|------------|------------|------------|------------|------------|------------|
| **Algorithmic Approach** | Gradient boosting with second-order approximation, L1/L2 regularization | Histogram-based gradient boosting, leaf-wise tree growth | Ordered boosting, symmetric trees | Weight-based boosting, combines weak learners via voting | Standard gradient boosting, minimizes loss via gradient descent |
| **Categorical Features** | Requires one-hot encoding or preprocessing | Native support via feature value splitting | Native, automatic encoding with target statistics | Requires preprocessing (e.g., one-hot encoding) | Requires preprocessing (e.g., one-hot encoding) |
| **Speed & Scalability** | Fast, parallelized, but memory-intensive | Fastest, low memory via histogram binning, great for large datasets | Slower training, fast predictions, good for medium datasets | Fast for small datasets, scales poorly | Slower, less optimized for large datasets |
| **Overfitting & Robustness** | Regularization reduces overfitting, sensitive to noise without tuning | Leaf-wise growth risks overfitting on small data, needs parameter control | Ordered boosting reduces overfitting, robust to noisy/categorical data | Less overfitting, highly sensitive to outliers/noise | Prone to overfitting without tuning, no built-in regularization |
| **Tree Structure** | Level-wise growth, balanced trees | Leaf-wise growth, deeper asymmetric trees | Symmetric trees for stability and speed | Shallow trees (e.g., stumps) as weak learners | Level-wise growth, balanced trees |
| **Ease of Use & Tuning** | Needs careful tuning (learning rate, max depth) | Less tuning than XGBoost, but leaf-wise needs attention | Minimal tuning, strong out-of-box performance | Simple, few parameters, less competitive | Extensive tuning needed, less user-friendly |
| **Use Case Suitability** | Structured data, Kaggle, ranking tasks | Large-scale, high-dimensional (e.g., fraud detection, recommendations) | Categorical-heavy data (e.g., customer segmentation, risk modeling) | Simple tasks, interpretability (e.g., text classification, face detection) | Custom loss functions, small/medium predictive tasks (e.g., forecasting) |

This table highlights the trade-offs, with **LightGBM** excelling in speed, **CatBoost** in categorical feature handling, **XGBoost** in flexibility, **AdaBoost** in simplicity, and **GBM** in customization.

## R-Packages for GBT Models

Below is a table listing R packages for **GBM**, **lightGNM**, **XGBoost**, **CatBoost**, and **AdaBoost**, models, along with their key details. These packages enable the implementation of these boosting algorithms in R for machine learning tasks.

| **Model** | **R Package** | **Description** | **Installation** | **Key Features** |
|---------------|---------------|---------------|---------------|---------------|
| **XGBoost** | `xgboost` | A scalable and efficient implementation of extreme gradient boosting. | `install.packages("xgboost")` | Parallel tree boosting, supports custom loss functions, handles missing values, L1/L2 regularization. |
| **LightGBM** | `lightgbm` | A high-performance gradient boosting framework optimized for speed and memory. | `install.packages("lightGBM")` | Histogram-based learning, leaf-wise tree growth, native categorical feature support, GPU support. |
| **CatBoost** | `catboost` | A gradient boosting library designed for categorical feature handling. | for linux: `devtools::install_url('https://github.com/catboost/catboost/releases/ download/v1.2.2/catboost-R-Linux-1.2.2.tgz`) | Automatic categorical feature encoding, symmetric trees, ordered boosting, GPU support. |
| **AdaBoost** | `adabag` | Implements adaptive boosting for classification tasks. | `install.packages("adabag")` | Focuses on reweighting misclassified samples, supports decision stumps, simple to use. |
| **GBM** | `gbm` | General gradient boosting machine for regression and classification. | `install.packages("gbm")` | Flexible loss functions, supports regression/classification, but slower and less optimized. |

## Summary and Conclusion

Gradient Boosted Trees (GBT) are a powerful ensemble learning technique that builds decision trees sequentially to minimize a loss function using gradient descent. They excel in predictive accuracy, especially for structured data, and are widely used in various applications. This next section provides an overview of GBT, how it works, and the different types of GBT models available. Each model has unique features and strengths, making them suitable for different scenarios.

## Addtional Resources and Further Reading

Here are some recommended resources and further reading materials for learning about Gradient Boosted Trees (GBT), a powerful ensemble machine learning technique. These include books, academic papers, online courses, and tutorials, with links where available:

### Books

1. **"Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani**
   - **Description**: Provides an accessible introduction to boosting methods, including gradient boosting with trees.
   - **Link**: [Springer](https://link.springer.com/book/10.1007/978-1-4614-7138-7) (free PDF available at [ISL Book](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)).
   - **Relevance**: Chapter 10 covers boosting, including gradient-based approaches.

2. **"The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman**
   - **Description**: Offers a deep dive into gradient boosting, including its theoretical foundations and practical implementations with trees.
   - **Link**: [Springer](https://link.springer.com/book/10.1007/978-0-387-84858-7) (free PDF available at [ESL Book](https://hastie.su.domains/ElemStatLearn/)).
   - **Relevance**: Chapter 10 provides detailed insights into gradient boosting and its tree-based variants.

### Academic Papers
3. **"Stochastic Gradient Boosting" by Jerome H. Friedman (1999)**
   - **Description**: The seminal paper introducing gradient boosting, laying the groundwork for GBT.
   - **Link**: [Stanford](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf) (free PDF).
   - **Relevance**: Foundational work on gradient boosting, with applications to trees.

4. **"Greedy Function Approximation: A Gradient Boosting Machine" by Jerome H. Friedman (2001)**
   - **Description**: Expands on gradient boosting, focusing on its implementation with decision trees.
   - **Link**: [Springer](https://link.springer.com/article/10.1023/A:1010933404324) (free access with registration).
   - **Relevance**: Details the algorithm behind modern GBT implementations like XGBoost.

### Online Courses and Tutorials
5. **"Machine Learning by Andrew Ng" (Coursera)**
   - **Description**: Covers ensemble methods, including an introduction to boosting techniques that lead to GBT.
   - **Link**: [Coursera](https://www.coursera.org/learn/machine-learning) (free to audit, subscription for certificate).
   - **Relevance**: Week 6 touches on ensemble methods, including boosting.

6. **"Advanced Machine Learning with TensorFlow" (Coursera)**
   - **Description**: Includes sections on gradient boosting with trees, using TensorFlow implementations.
   - **Link**: [Coursera](https://www.coursera.org/learn/advanced-machine-learning-tensorflow) (free to audit, subscription for certificate).
   - **Relevance**: Focuses on practical GBT applications.

### Websites and Documentation
7. **XGBoost Documentation**
   - **Description**: Official documentation for the XGBoost library, a leading implementation of gradient boosted trees.
   - **Link**: [XGBoost](https://xgboost.readthedocs.io/en/stable/) (free).
   - **Relevance**: Comprehensive guide with examples and parameter tuning for GBT.

8. LightGBM Documentation
   - **Description**: Documentation for LightGBM, another efficient GBT framework optimized for large datasets.
   - **Link**: [LightGBM](https://lightgbm.readthedocs.io/en/latest/) (free).
   - **Relevance**: Practical resource for implementing GBT with a focus on speed and scalability.

9. **Scikit-Learn Documentation**
   - **Description**: Provides guides and examples for the `GradientBoostingClassifier` and `GradientBoostingRegressor` in Python.
   - **Link**: [Scikit-Learn](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting) (free).
   - **Relevance**: Includes theory and code examples for GBT.

### Additional Resources
10. **"Gradient Boosting from Scratch" (Towards Data Science)**
    - **Description**: A blog post explaining the intuition and mechanics of gradient boosted trees with step-by-step examples.
    - **Link**: [Towards Data Science](https://towardsdatascience.com/understanding-gradient-boosting-machines-9be756fe76ab) (free).
    - **Relevance**: Beginner-friendly overview with practical insights.

11. **YouTube: "Gradient Boosting Part 1 (of 4): Regression Main Ideas" by StatQuest with Josh Starmer**
    - **Description**: A video series explaining gradient boosting with trees, starting with regression concepts.
    - **Link**: [YouTube](https://www.youtube.com/watch?v=3CC4N4z3GJc) (free).
    - **Relevance**: Visual and intuitive introduction to GBT.


## Table of Contents


This section of the tutorial is divided into several parts, each focusing on a specific type of gradient boosted tree model. The models covered include:

3.1 [Gradient Boosting Machine (GBM)](03-01-03-02-tree-based-models-gradient-boosted-gbm-r.qmd)

3.2 [Light Gradient Boosting Machine (lightGBM)](03-01-03-02-tree-based-models-gradient-boosted-lightgbm-r.qmd)

3.3 [Extreme Gradient Boosting (XGboost)](03-01-03-03-tree-based-models-gradient-boosted-xgboost-r.qmd)

3.4 [Categorical Boosting (CatBoost)](03-01-03-04-tree-based-models-gradient-boosted-catboost.qmd)

3.5 [Adaptive Boosting (AdaBoost)](03-01-03-05-tree-based-models-gradient-boosted-adaboost-r.qmd)

3.6 [Gradient Boosted Survival Model](03-01-03-06-tree-based-models-gradient-boosted-survival-model-r.qmd)