<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# LightGBM: A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you a quick example of how to train LightGBM model in recommendation scenario. 
LightGBM \[1\] is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

## Global Settings and Imports

In [1]:
import sys, os
sys.path.append("../../")
import lightgbm as lgb
import pandas as pd
import category_encoders as ce
import reco_utils.recommender.lightgbm.lightgbm_utils as lgb_utils

print("System version: {}".format(sys.version))
print("LightGBM version: {}".format(lgb.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
LightGBM version: 2.2.1


### Parameter Setting
After data preparation, let's set the main related parameters for LightGBM now. Basically, the task is a binary classification, so the objective function is set to binary logloss, and 'AUC' metric is used as usual.

Generally, we can adjust the number of leaves (MAX_LEAF), the minimum number of data in each leaf (MIN_DATA), maximum number of trees (NUM_OF_TREES), the learning rate of trees (TREE_LEARINING_RATE) and EARLY_STOPPING_ROUNDS (to avoid overfitting) in the model to get better performance.

Besides, we can also adjust some other listed parameters in the following to optimize the results, which are shown in [5] concretely.

In [2]:
MAX_LEAF = 128
MIN_DATA = 40
NUM_OF_TREES = 200
TREE_LEARNING_RATE = 0.15
EARLY_STOPPING_ROUNDS = 20
METRIC = "auc"

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'num_class': 1,
    'objective': "binary",
    'metric': METRIC,
    'num_leaves': MAX_LEAF,
    'min_data': MIN_DATA,
    'boost_from_average': True,
    #set it according to your cpu cores.
    'num_threads': 20,
    'feature_fraction': 0.8,
    'learning_rate': TREE_LEARNING_RATE,
}

## Data Preparation
Here we use CSV format as the example data input. Our example data is a sample (about 1 million samples) from Criteo dataset [2]. The Criteo dataset is a well-known industry benchmarking dataset for developing CTR prediction models, and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset. <br>
Specifically, there are 39 columns of features in Criteo, where 13 columns are numerical features (I1-I13) and the other 26 columns are categorical features (C1-C26).

First, we prepared three files (train_file (first 80%), valid_file (middle 10%) and test_file (last 10%)) in the data root directory , cut from the original data. <br>
Notably, considering the Criteo is a kind of time-series streaming data, which is also very common in recommendation scenario, we split the data by its order.

In [3]:
data_path = '../../tests/resources/lightgbm'
train_file = os.path.join(data_path, r'tiny_criteo0.csv')
valid_file = os.path.join(data_path, r'tiny_criteo1.csv')
test_file = os.path.join(data_path, r'tiny_criteo2.csv')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(train_file):
    # to do: upload our test resources.
    download_lgb_resources(r'https://recodatasets.blob.core.windows.net/lightgbm/', data_path, 'resources.zip')

test_data = pd.read_csv(test_file)
display(test_data.head())
del test_data

Unnamed: 0,Id,Label,I1,I2,I3,I4,I5,I6,I7,I8,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,10900000,1,4.0,0,11.0,11.0,64.0,24.0,9.0,46.0,...,27c07bd6,4bcc9449,21ddcdc9,b1252a9d,87508095,,c7dc6720,4f0948e6,e8b83407,61f8e249
1,10900001,0,0.0,474,18.0,18.0,2381.0,25.0,49.0,25.0,...,3486227d,99d2d39e,,,7212cd0a,,423fab69,a98f5ada,,
2,10900002,0,,-1,4.0,2.0,155268.0,,0.0,3.0,...,1e88c74f,157482f0,21ddcdc9,b1252a9d,e2e82c3c,ad3062eb,3a171ecb,32ebc486,001f3601,e539c901
3,10900003,0,,2,20.0,14.0,5175.0,57.0,4.0,27.0,...,07c540c4,395856b0,21ddcdc9,a458ea53,8e4884c0,,32c7478e,b936bfbe,001f3601,3464ae5c
4,10900004,0,1.0,87,105.0,5.0,8.0,5.0,7.0,6.0,...,8efede7f,775e80fe,21ddcdc9,a458ea53,3ee29a07,,423fab69,c83e0347,ea9a246c,2fede552


## Basic Usage
### Ordinal Encoding
Considering LightGBM could handle the low-frequency features and missing value by itself, for basic usage, we only encode the string-like categorical features by an ordinal encoder.

In [4]:
cate_cols = ['C'+str(i) for i in range(1, 27)]
label_col = 'Label'
ord_encoder = ce.ordinal.OrdinalEncoder(cols=cate_cols)

def encode_csv(file_path, encoder, label_col, typ='fit', del_col='Id'):
    print('Processing %s .'% file_path)
    df = pd.read_csv(file_path)
    if typ == 'fit':
        df = encoder.fit_transform(df)
    else:
        df = encoder.transform(df)
    y = df[label_col].values
    del df[label_col]
    del df[del_col]
    return df, y

train_x, train_y = encode_csv(train_file, ord_encoder, label_col)
valid_x, valid_y = encode_csv(valid_file, ord_encoder, label_col, 'transform')
test_x, test_y = encode_csv(test_file, ord_encoder, label_col, 'transform')

print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
      .format(trn_x_shape=train_x.shape,
              trn_y_shape=train_y.shape,
              vld_x_shape=valid_x.shape,
              vld_y_shape=valid_y.shape,
              tst_x_shape=test_x.shape,
              tst_y_shape=test_y.shape,))

Processing ../../tests/resources/lightgbm/tiny_criteo0.csv .
Processing ../../tests/resources/lightgbm/tiny_criteo1.csv .
Processing ../../tests/resources/lightgbm/tiny_criteo2.csv .
Train Data Shape: X: (800000, 39); Y: (800000,).
Valid Data Shape: X: (100000, 39); Y: (100000,).
Test Data Shape: X: (100000, 39); Y: (100000,).



### Create model
When both hyper-parameters and data are ready, we can create a model:

In [5]:
lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params, categorical_feature=cate_cols)
lgb_eval = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
lgb_model = lgb.train(params,
                      lgb_train,
                      num_boost_round=NUM_OF_TREES,
                      verbose_eval=5,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=lgb_eval,
                      categorical_feature=cate_cols)

Training until validation scores don't improve for 20 rounds.
[5]	valid_0's auc: 0.754661
[10]	valid_0's auc: 0.764253
[15]	valid_0's auc: 0.770664
[20]	valid_0's auc: 0.774737
[25]	valid_0's auc: 0.779145
[30]	valid_0's auc: 0.780233
[35]	valid_0's auc: 0.781785
[40]	valid_0's auc: 0.782593
[45]	valid_0's auc: 0.782758
[50]	valid_0's auc: 0.782698
[55]	valid_0's auc: 0.782987
[60]	valid_0's auc: 0.782658
[65]	valid_0's auc: 0.782224
[70]	valid_0's auc: 0.781702
Early stopping, best iteration is:
[51]	valid_0's auc: 0.783064


Now let's see what is the model's performance:

In [6]:
test_preds = lgb_model.predict(test_x)
print(lgb_utils.cal_metric(test_y.reshape(-1), test_preds, ['auc','logloss']))

{'auc': 0.7818, 'logloss': 0.4617}


<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
## Optimized Usage
### Label-encoding and Binary-encoding
Next, since LightGBM has a better capability in handling dense numerical features effectively, we try to convert all the categorical features in original data into numerical ones, by label-encoding [3] and binary-encoding [4]. Also due to the sequence property of Criteo, the label-encoding we adopted is executed one-by-one, which means we encode the samples in order, by the information of the previous samples before each sample (sequential label-encoding and sequential binary-encoding). Besides, we also filter the low-frequency categorical features and fill the missing values by the mean of corresponding columns for the numerical features. (consulting `lgb_utils.NumEncoder`)

Specifically, in `lgb_utils.NumEncoder`, the main steps are as follows.
* Firstly, we convert the low-frequency categorical features to "LESS" and the missing categorical features to "UNK". 
* Secondly, we convert the missing numerical features into the mean of corresponding columns. 
* Thirdly, the string-like categorical features are ordinal encoded like the examples in basic usage. 
* And then, we target encode the categorical features in the samples order one-by-one. For each sample, we add the label and count information of its former samples into the data and produce new features. Formally, we add $\frac{\sum\nolimits_{j=1}^n I(x_j=c) \cdot y}{\sum\nolimits_{j=1}^n I(x_j=c)}$ as a new label feature, where $c$ is a category to encode in current sample, $n$ is the number of former samples, and $I(\cdot)$ is the indicator function that check the former samples contain $c$ (whether $x_j=c$) or not. At the meantime, we also add $\frac{\sum\nolimits_{j=1}^n I(x_j=c)}{n}$ as the count frequency of $c$ as a new count feature. 
* Finally, based on the ordinal encoding, we add the features produced by binary encoding into the data.

Note that the statistics used in the above process only updates when fitting the training set, while maintaining static when transforming the testing set because the label of test data should be considered as unknown.

In [7]:
cate_cols = ['C'+str(i) for i in range(1, 27)]
nume_cols = ['I'+str(i) for i in range(1, 14)]
label_col = 'Label'
num_encoder = lgb_utils.NumEncoder(cate_cols, nume_cols, label_col)
train_x, train_y = num_encoder.fit_transform(train_file)
valid_x, valid_y = num_encoder.transform(valid_file)
test_x, test_y = num_encoder.transform(test_file)
del num_encoder
print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
      .format(trn_x_shape=train_x.shape,
              trn_y_shape=train_y.shape,
              vld_x_shape=valid_x.shape,
              vld_y_shape=valid_y.shape,
              tst_x_shape=test_x.shape,
              tst_y_shape=test_y.shape,))

----------------------------------------------------------------------
Fitting and Transforming ../../tests/resources/lightgbm/tiny_criteo0.csv .
----------------------------------------------------------------------


2019-03-03 04:30:37,690 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:10<00:00,  2.69it/s]
100%|██████████| 13/13 [00:00<00:00, 99.63it/s]
2019-03-03 04:30:48,445 [INFO] Ordinal encoding cate features
2019-03-03 04:31:03,039 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:41<00:00,  1.57s/it]
2019-03-03 04:31:44,411 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:12<00:00,  3.90it/s]
100%|██████████| 26/26 [00:18<00:00,  1.16it/s]


----------------------------------------------------------------------
Transforming ../../tests/resources/lightgbm/tiny_criteo1.csv .
----------------------------------------------------------------------


2019-03-03 04:32:16,050 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:01<00:00, 24.38it/s]
100%|██████████| 13/13 [00:00<00:00, 1021.77it/s]
2019-03-03 04:32:17,133 [INFO] Ordinal encoding cate features
2019-03-03 04:32:18,804 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:04<00:00,  5.78it/s]
2019-03-03 04:32:23,349 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:05<00:00, 12.91it/s]
100%|██████████| 26/26 [00:03<00:00,  6.86it/s]


----------------------------------------------------------------------
Transforming ../../tests/resources/lightgbm/tiny_criteo2.csv .
----------------------------------------------------------------------


2019-03-03 04:32:32,276 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:01<00:00, 24.41it/s]
100%|██████████| 13/13 [00:00<00:00, 1136.24it/s]
2019-03-03 04:32:33,357 [INFO] Ordinal encoding cate features
2019-03-03 04:32:35,018 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:04<00:00,  5.79it/s]
2019-03-03 04:32:39,535 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:05<00:00, 11.99it/s]
100%|██████████| 26/26 [00:03<00:00,  6.76it/s]


Train Data Shape: X: (800000, 303); Y: (800000, 1).
Valid Data Shape: X: (100000, 303); Y: (100000, 1).
Test Data Shape: X: (100000, 303); Y: (100000, 1).



### Training and Evaluation

In [8]:
lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params)
lgb_eval = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train)
lgb_model = lgb.train(params,
                      lgb_train,
                      num_boost_round=NUM_OF_TREES,
                      verbose_eval = 5,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=lgb_eval)

Training until validation scores don't improve for 20 rounds.
[5]	valid_0's auc: 0.762367
[10]	valid_0's auc: 0.767671
[15]	valid_0's auc: 0.771752
[20]	valid_0's auc: 0.774614
[25]	valid_0's auc: 0.776838
[30]	valid_0's auc: 0.778434
[35]	valid_0's auc: 0.77965
[40]	valid_0's auc: 0.78061
[45]	valid_0's auc: 0.781365
[50]	valid_0's auc: 0.782039
[55]	valid_0's auc: 0.782629
[60]	valid_0's auc: 0.783151
[65]	valid_0's auc: 0.783466
[70]	valid_0's auc: 0.783844
[75]	valid_0's auc: 0.784013
[80]	valid_0's auc: 0.784107
[85]	valid_0's auc: 0.784072
[90]	valid_0's auc: 0.784185
[95]	valid_0's auc: 0.784349
[100]	valid_0's auc: 0.784566
[105]	valid_0's auc: 0.784541
[110]	valid_0's auc: 0.784509
[115]	valid_0's auc: 0.784456
Early stopping, best iteration is:
[97]	valid_0's auc: 0.784606


In [9]:
test_preds = lgb_model.predict(test_x)
print(lgb_utils.cal_metric(test_y.reshape(-1), test_preds, ['auc','logloss']))

{'auc': 0.7827, 'logloss': 0.4601}


## Model saving and loading
Now we finish the basic training and testing for LightGBM, next let's try to save and reload the model, and then evaluate it again.

In [10]:
save_file = os.path.join(data_path, r'finished.model')
lgb_model.save_model(save_file)
loaded_model = lgb.Booster(model_file=save_file)

# eval the performance again
test_preds = loaded_model.predict(test_x)
print(lgb_utils.cal_metric(test_y.reshape(-1), test_preds, ['auc','logloss']))

{'auc': 0.7827, 'logloss': 0.4601}


## Reference
\[1\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154.<br>
\[2\] The Criteo datasets: http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ .<br>
\[3\] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).<br>
\[4\] Scikit-learn. 2018. categorical_encoding. https://github.com/scikit-learn-contrib/categorical-encoding .<br>
\[5\] The parameters of LightGBM: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst .