<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# LightGBM : A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you a quick example of how to train LightGBM model in recommendation scenario. 
LightGBM \[1\] is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

In recommendation scenario, LightGBM has a great capability in handling dense numerical features effectively. Therefore, to maximize the performance of LightGBM, we'd better encode the categorical features in data to numerical ones first and then put them into LightGBM.

## Global Settings and Imports

In [1]:
import sys, os
sys.path.append("../../")
import papermill as pm
import lightgbm as lgb
import pandas as pd

import reco_utils.recommender.lightgbm.lightgbm_utils as lgb_utils

print("System version: {}".format(sys.version))
print("LightGBM version: {}".format(lgb.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
LightGBM version: 2.2.1


## Data Preprocessing
Here we use CSV format as the example data input. Our example data is a sample (about 1 million samples) from Criteo dataset [2]. Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset. <br>
Specifically, there are 39 columns of features in Criteo, where 13 columns are numerical features (I1-I13) and the other 26 columns are categorical features (C1-C26).

First, we prepared three files (train_file (first 80\%), valid_file (middle 10\%) and test_file (last 10\%)) in the data root directory , cut from the original data. <br>
Notably, considering the Criteo is a kind of time-series streaming data, which is also very common in recommendation scenarion, we split the data by its order.

In [2]:
data_path = '../../tests/resources/lightgbm'
train_file = os.path.join(data_path, r'tiny_criteo0.csv')
valid_file = os.path.join(data_path, r'tiny_criteo1.csv')
test_file = os.path.join(data_path, r'tiny_criteo2.csv')
output_file = os.path.join(data_path, r'output.txt')

if not os.path.exists(train_file):
    # to do: upload our test resources.
    download_lgb_resources(r'https://recodatasets.blob.core.windows.net/lightgbm/', data_path, 'resources.zip')

test_data = pd.read_csv(test_file)
display(test_data.head())
del test_data

Unnamed: 0,Id,Label,I1,I2,I3,I4,I5,I6,I7,I8,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,10900000,1,4.0,0,11.0,11.0,64.0,24.0,9.0,46.0,...,27c07bd6,4bcc9449,21ddcdc9,b1252a9d,87508095,,c7dc6720,4f0948e6,e8b83407,61f8e249
1,10900001,0,0.0,474,18.0,18.0,2381.0,25.0,49.0,25.0,...,3486227d,99d2d39e,,,7212cd0a,,423fab69,a98f5ada,,
2,10900002,0,,-1,4.0,2.0,155268.0,,0.0,3.0,...,1e88c74f,157482f0,21ddcdc9,b1252a9d,e2e82c3c,ad3062eb,3a171ecb,32ebc486,001f3601,e539c901
3,10900003,0,,2,20.0,14.0,5175.0,57.0,4.0,27.0,...,07c540c4,395856b0,21ddcdc9,a458ea53,8e4884c0,,32c7478e,b936bfbe,001f3601,3464ae5c
4,10900004,0,1.0,87,105.0,5.0,8.0,5.0,7.0,6.0,...,8efede7f,775e80fe,21ddcdc9,a458ea53,3ee29a07,,423fab69,c83e0347,ea9a246c,2fede552


Sencod, we convert categorical features in original data into numerical ones, by label-encoding [3] and binary-encoding [4]. Also due to the time series property of Criteo, the label-encoding we adopted is executed one-by-one, which means we encode the samples in order, by the information of the former samples before each sample (you can consult the dynamical target encoding codes in `lgb_utils.NumEncoder`).

In [3]:
cate_cols = ['C'+str(i) for i in range(1, 27)]
nume_cols = ['I'+str(i) for i in range(1, 14)]
label_col = 'Label'
num_encoder = lgb_utils.NumEncoder(cate_cols, nume_cols, label_col)
train_x, train_y = num_encoder.fit_transform(train_file)
valid_x, valid_y = num_encoder.transform(valid_file)
test_x, test_y = num_encoder.transform(test_file)
del num_encoder
print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
      .format(trn_x_shape=train_x.shape,
              trn_y_shape=train_y.shape,
              vld_x_shape=valid_x.shape,
              vld_y_shape=valid_y.shape,
              tst_x_shape=test_x.shape,
              tst_y_shape=test_y.shape,))

----------------------------------------------------------------------
Fitting and Transforming ../../tests/resources/lightgbm/tiny_criteo0.csv .
----------------------------------------------------------------------


2019-03-01 08:32:33,751 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:10<00:00,  2.74it/s]
100%|██████████| 13/13 [00:01<00:00, 10.78it/s]
2019-03-01 08:32:45,217 [INFO] Ordinal encoding cate features
2019-03-01 08:33:00,647 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:43<00:00,  1.66s/it]
2019-03-01 08:33:44,348 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:12<00:00,  3.88it/s]
100%|██████████| 26/26 [00:18<00:00,  1.13it/s]


----------------------------------------------------------------------
Transforming ../../tests/resources/lightgbm/tiny_criteo1.csv .
----------------------------------------------------------------------


2019-03-01 08:34:15,609 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:01<00:00, 24.45it/s]
100%|██████████| 13/13 [00:00<00:00, 1009.52it/s]
2019-03-01 08:34:16,690 [INFO] Ordinal encoding cate features
2019-03-01 08:34:18,345 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:05<00:00,  5.23it/s]
2019-03-01 08:34:23,356 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:04<00:00, 15.12it/s]
100%|██████████| 26/26 [00:02<00:00,  7.23it/s]


----------------------------------------------------------------------
Transforming ../../tests/resources/lightgbm/tiny_criteo2.csv .
----------------------------------------------------------------------


2019-03-01 08:34:31,263 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:01<00:00, 24.48it/s]
100%|██████████| 13/13 [00:00<00:00, 1107.42it/s]
2019-03-01 08:34:32,341 [INFO] Ordinal encoding cate features
2019-03-01 08:34:34,021 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:04<00:00,  5.33it/s]
2019-03-01 08:34:38,989 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:04<00:00, 15.03it/s]
100%|██████████| 26/26 [00:02<00:00,  7.17it/s]


Train Data Shape: X: (800000, 303); Y: (800000, 1).
Valid Data Shape: X: (100000, 303); Y: (100000, 1).
Test Data Shape: X: (100000, 303); Y: (100000, 1).



## Parameter Setting
After data preparation, let's set the primary related parameters for LightGBM now. Basically, the task is a binary classification, so the objective function is set to binary loss.

Generally, we can adjust the number of leaves (MAX_LEAF), minimum number of datas in each leaf (MIN_DATA), number o trees (NUM_OF_TREES), the learning rate of trees (TREE_LEARINING_RATE) and EARLY_STOPPING_ROUNDS (to avoid overfitting) in the model to get better performance.

Besides, we can also try some other listed paramters in the following to optimize the results, which are listed in [5] concretely.

In [4]:
MAX_LEAF = 128
MIN_DATA = 40
NUM_OF_TREES = 200
TREE_LEARNING_RATE = 0.15
EARLY_STOPPING_ROUNDS = 20
METRIC = "auc"

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'num_class': 1,
    'objective': "binary",
    'metric': METRIC,
    'num_leaves': MAX_LEAF,
    'min_data': MIN_DATA,
    'boost_from_average': True,
    'num_threads': 20,
    'feature_fraction': 0.8,
    'bagging_freq': 3,
    'bagging_fraction': 0.9,
    'learning_rate': TREE_LEARNING_RATE,
}

## Create model
When both hyper-parameters and data are ready, we can create a model:

In [5]:
lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params)
lgb_eval = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train)
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=NUM_OF_TREES,
                early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                valid_sets=lgb_eval)

[1]	valid_0's auc: 0.747287
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's auc: 0.754357
[3]	valid_0's auc: 0.758606
[4]	valid_0's auc: 0.760637
[5]	valid_0's auc: 0.762482
[6]	valid_0's auc: 0.763813
[7]	valid_0's auc: 0.76506
[8]	valid_0's auc: 0.765835
[9]	valid_0's auc: 0.766965
[10]	valid_0's auc: 0.767588
[11]	valid_0's auc: 0.768471
[12]	valid_0's auc: 0.769421
[13]	valid_0's auc: 0.770194
[14]	valid_0's auc: 0.7709
[15]	valid_0's auc: 0.771585
[16]	valid_0's auc: 0.772365
[17]	valid_0's auc: 0.773133
[18]	valid_0's auc: 0.773704
[19]	valid_0's auc: 0.774298
[20]	valid_0's auc: 0.774892
[21]	valid_0's auc: 0.775447
[22]	valid_0's auc: 0.775945
[23]	valid_0's auc: 0.776312
[24]	valid_0's auc: 0.776671
[25]	valid_0's auc: 0.776938
[26]	valid_0's auc: 0.777283
[27]	valid_0's auc: 0.777533
[28]	valid_0's auc: 0.777917
[29]	valid_0's auc: 0.778465
[30]	valid_0's auc: 0.778753
[31]	valid_0's auc: 0.779115
[32]	valid_0's auc: 0.779296
[33]	valid_0's auc: 0.

Now let's see what is the model's performance:

In [6]:
test_preds = gbm.predict(test_x)
print(lgb_utils.cal_metric(test_y.reshape(-1), test_preds, ['auc','logloss']))

{'auc': 0.7831, 'logloss': 0.4598}


## Model saving and loading
Now we finish the basic training and testing for LightGBM, next let's try to save and reload the model, and then evaluate it again.

In [7]:
save_file = os.path.join(data_path, r'finished.model')
gbm.save_model(save_file)
gbm = lgb.Booster(model_file=save_file)

# eval the performance again
test_preds = gbm.predict(test_x)
print(lgb_utils.cal_metric(test_y.reshape(-1), test_preds, ['auc','logloss']))

{'auc': 0.7831, 'logloss': 0.4598}


## Reference
\[1\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154.<br>
\[2\] The Criteo datasets: http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ .<br>
\[3\] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).<br>
\[4\] Scikit-learn. 2018. categorical_encoding. https://github.com/scikit-learn-contrib/categorical-encoding .<br>
\[5\] The parameters of LightGBM: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst .