# **LightGBM** ðŸ’¡
LightGBM (Light Gradient Boosting Machine) is a high-performance, open-source distributed gradient boosting framework developed by Microsoft, specialized for efficient, scalable training on large datasets. It utilizes histogram-based, leaf-wise tree growth, and techniques like Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to achieve faster training speeds, lower memory usage, and higher accuracy compared to traditional methods.

![light](https://media.geeksforgeeks.org/wp-content/uploads/20250519151930194905/level_wise_tree_growth_1.webp)

## The working (LightGBM Classification)

* #### Get a base prediction
Just like XGBoost, we start with a naive guess for all data points (usually 0.5 probability or log(odds) of 0).

$$p = \text{Base probability}$$
$$\text{prediction}_{0} = \log(\frac{p}{1-p})$$

* #### Calculate Gradients and Hessians
In XGBoost, we calculated "residuals". In LightGBM, we formalize this as **Gradients ()** (First derivative) and **Hessians ()** (Second derivative).
*Note: For classification, the Gradient is effectively the residual.*

$$g_{i} = p_{i} - y_{i} \quad (\text{similar to residual } r_i)$$
$$h_{i} = p_{i} * (1 - p_{i}) \quad (\text{weight of the data point})$$

* #### Apply GOSS (Gradient-based One-Side Sampling)
*This is the unique "Light" step.* instead of using **all** data points to build the tree, LightGBM assumes data points with **small gradients** () are already well-trained.

1. **Sort** all data points by their gradient .
2. **Keep** the top a% of data points (large gradients).
3. **Randomly Sample** b% from the remaining data (small gradients).
4. **Amplify** the sampled small gradients by a constant factor $\frac{1-a}{b}$ to keep the math balanced.

$$\text{Data Used} = \text{Top } a\% + \text{Weighted Sample of } b\%$$

* #### Make the tree (Leaf-wise Growth)
Unlike XGBoost which grows "level-wise" (layer by layer), LightGBM grows "leaf-wise". It hunts for the **single leaf** that yields the maximum profit and splits only that one.

We use a histogram-based algorithm to find the split. The decision is based on the **Gain**, calculated using our GOSS-weighted sums.

$$G = \sum g_{i} \quad (\text{Sum of gradients in node})$$
$$H = \sum h_{i} \quad (\text{Sum of hessians in node})$$
$$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G_{Root}^2}{H_{Root} + \lambda} \right]$$

*Note: The structure is identical to your XGBoost "Similarity" formula, just substituting  with  and denominator with .*

* #### Calculate Leaf Output values
Once the tree stops growing (based on `num_leaves` or `max_depth`), we calculate the output value for each leaf.

$$\text{Leaf Value} (w) = - \frac{G}{H + \lambda}$$

* #### Make Prediction (Update)
We update the log(odds) for each data point by adding the prediction from the new tree, scaled by the learning rate.

$$v_{new} = v_{old} + \eta * w$$
$$\eta = \text{Learning Rate}$$
$$w = \text{Leaf value for the data point}$$

* #### Repeat
Repeat the process:

1. Calculate new $gi$ and $hi$ based on new predictions.
2. Resample data using GOSS.
3. Build a new tree leaf-wise.

The final probability is obtained by passing the final sum through the Sigmoid function:

$$p = \sigma(v_{final}) = \frac{1}{1 + e^{-v_{final}}}$$

## The working (LightGBM Regression)

* #### Get a base prediction
For regression, the simplest start is the average (mean) of the target values.

$$p = \text{Average of all target values } (y)$$

* #### Calculate Gradients and Hessians
We calculate the first derivative (Gradient) and second derivative (Hessian) of the Loss Function (usually Mean Squared Error).
*Note: For Squared Error, the Hessian is just 1. This makes the math very clean.*

$$g_{i} = p_{i} - y_{i} \quad (\text{This is effectively the residual})$$
$$h_{i} = 1 \quad (\text{Constant for MSE loss})$$

* #### Apply GOSS (Gradient-based One-Side Sampling)
We sort the data based on how "wrong" our prediction is (the Gradient ).

1. **Sort** data by error size .
2. **Keep** the top a% (Large errors).
3. **Sample** b% from the rest (Small errors).
4. **Amplify** the small error samples by $\frac{1-a}{b}$.

$$\text{Data Used} = \text{Top } a\% + \text{Weighted Sample of } b\%$$

* #### Make the tree (Leaf-wise Growth)
We find the best split by calculating the **Gain**.
Since $h_i = 1$, the sum of Hessians ($H$) is simply the **count of data points ()** in that node.

$$G = \sum g_{i} \quad (\text{Sum of residuals})$$
$$H = \sum h_{i} = n \quad (\text{Count of samples})$$
$$\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{n_L + \lambda} + \frac{G_R^2}{n_R + \lambda} - \frac{G_{Root}^2}{n_{Root} + \lambda} \right]$$

*Note: This formula looks for the split that reduces the Variance the most.*

* #### Calculate Leaf Output values
We calculate the output value for each leaf. This represents the "correction" we want to add to our prediction.

$$\text{Leaf Value } (w) = - \frac{G}{H + \lambda} = - \frac{\sum (p - y)}{n + \lambda}$$

*In simple terms: The Leaf Value is the average of the residuals in that leaf, shrunk slightly by Î».*

* #### Make Prediction (Update)
We update the prediction for each data point by adding the leaf value, scaled by the learning rate.

$$p_{new} = p_{old} + \eta * w$$
$$\eta = \text{Learning Rate}$$

* #### Repeat
Repeat the process until the number of trees is reached:

1. Calculate new residuals ($gi$) based on the updated $p$.
2. Resample using GOSS.
3. Build a new tree to predict those residuals.

The final output is simply the sum of the base prediction and all the weighted tree outputs:

$$p_{final} = p_{initial} + \sum_{k=1}^{T} \eta * w_k$$

## Implementation

In [1]:
# Get the data set

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
print(data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [2]:
import pandas as pd

X, y = pd.DataFrame(data['data'], columns=data['feature_names']), data['target']
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
# split into train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [8]:
import lightgbm as lgb

lgb_model = lgb.LGBMClassifier()
lgb_model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 268, number of negative: 158
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001198 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4272
[LightGBM] [Info] Number of data points in the train set: 426, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.629108 -> initscore=0.528392
[LightGBM] [Info] Start training from score 0.528392


In [10]:
y_pred = lgb_model.predict(X_test)
y_pred

array([1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_pred=y_pred, y_true=y_test))
print(confusion_matrix(y_pred=y_pred, y_true=y_test))
print(classification_report(y_pred=y_pred, y_true=y_test))

0.951048951048951
[[51  3]
 [ 4 85]]
              precision    recall  f1-score   support

           0       0.93      0.94      0.94        54
           1       0.97      0.96      0.96        89

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143

