# LightGBM Hello World Example

In this notebook, we are going to mention Microsoft LightGBM framework on a simple data set. This includes both preprocessing steps and modelling parts. You may change the source data set and run this notebook again.

### Prerequisites

You need to install the following packages

pip install pandas numpy lightgbm graphviz matplotlib

In [1]:
import lightgbm as lgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

To plot built decision tree you need to download graphviz from the following link

https://graphviz.gitlab.io/_pages/Download/Download_windows.html

You will specify the installed path in the following block.

In [2]:
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin' #be sure where graphviz installed

In [3]:
is_regression = False #set this True to run classification
plotTree = False #if you haven't installed graphviz set this to False

We are going to work on playing golf decision data set.

You can find the raw data set here: https://github.com/serengil/decision-trees-for-ml/tree/master/dataset

In [4]:
dataset = pd.read_csv('C:/Users/IS96273/Desktop/decision tree/dataset/golf2.txt')

In [5]:
dataset.head()

Unnamed: 0,Outlook,Temp.,Humidity,Wind,Decision
0,Sunny,85,85,Weak,No
1,Sunny,80,90,Strong,No
2,Overcast,83,78,Weak,Yes
3,Rain,70,96,Weak,Yes
4,Rain,68,80,Weak,Yes


As seen loaded data set includes both continuous and string features

### Label encoding

LightGBM framework expects to convert categorical features to integer before constructing the dataset. That's why, we are going to apply label encoding to categorical features

Ref: https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

In [6]:
features = []; categorical_features = []

num_of_rows = dataset.shape[0]
num_of_columns = dataset.shape[1]
num_of_classes = 1 #default value is 1 for regression. we will update this for classification.

print("label encoding procedures:")

for i in range(0, num_of_columns):
    column_name = dataset.columns[i]
    column_type = dataset[column_name].dtypes
    
    if i != num_of_columns - 1: #skip target
        features.append(column_name)
    
    if column_type == 'object':
        print(column_name,": ", end='')
        feature_classes = dataset[column_name].unique()
        #print(feature_classes)
        
        if is_regression == False and i == num_of_columns - 1:
            num_of_classes = len(feature_classes)
        
        for j in range(len(feature_classes)):
            feature_class = feature_classes[j]
            print(feature_class," -> ",j,", ",end='')
                        
            dataset[column_name] = dataset[column_name].replace(feature_class, str(j))
        
        if i != num_of_columns - 1: #skip target
            categorical_features.append(column_name)
        
        print("")

label encoding procedures:
Outlook : Sunny  ->  0 , Overcast  ->  1 , Rain  ->  2 , 
Wind : Weak  ->  0 , Strong  ->  1 , 
Decision : No  ->  0 , Yes  ->  1 , 


In [7]:
print("num_of_classes: ",num_of_classes)
print("features: ",features)
print("categorical features: ",categorical_features)

num_of_classes:  2
features:  ['Outlook', 'Temp.', 'Humidity', 'Wind']
categorical features:  ['Outlook', 'Wind']


In [8]:
dataset.head()

Unnamed: 0,Outlook,Temp.,Humidity,Wind,Decision
0,0,85,85,0,0
1,0,80,90,1,0
2,1,83,78,0,1
3,2,70,96,0,1
4,2,68,80,0,1


In [9]:
target_name = dataset.columns[num_of_columns - 1] #target is the final column at the right on data set

y_train = dataset[target_name].values
x_train = dataset.drop(columns=[target_name]).values

print("input features:\n",x_train)
print("--------------------")
print("output:\n",y_train)

input features:
 [['0' 85 85 '0']
 ['0' 80 90 '1']
 ['1' 83 78 '0']
 ['2' 70 96 '0']
 ['2' 68 80 '0']
 ['2' 65 70 '1']
 ['1' 64 65 '1']
 ['0' 72 95 '0']
 ['0' 69 70 '0']
 ['2' 75 80 '0']
 ['0' 75 70 '1']
 ['1' 72 90 '1']
 ['1' 81 75 '0']
 ['2' 71 80 '1']]
--------------------
output:
 ['0' '0' '1' '1' '1' '0' '1' '0' '1' '1' '1' '1' '1' '0']


Now, we are going to create data set for LightGBM. We have already transformed categorical features to integer in previous steps. Here, we have to define categorical features. Otherwise, decision node will check instance's that feature greater than some threshold or less than the threshold. Suppose that feature is related to gender information and values are -1 for unknown, 0 for man and 1 for woman. In this case, decision node might check that gender information is greater than -1. This might cause a trouble. That's why, specifying categorical features is very important.

In [10]:
lgb_train = lgb.Dataset(x_train, y_train
    ,feature_name = features
    , categorical_feature = categorical_features
)

In [11]:
params = {
    'task': 'train'
    , 'boosting_type': 'gbdt'
    , 'objective': 'regression' if is_regression == True else 'multiclass'
    , 'num_class': num_of_classes
    , 'metric': 'rmsle' if is_regression == True else 'multi_logloss'
    , 'min_data': 1
    #, 'learning_rate':0.1
    , 'verbose': -1
}

gbm = lgb.train(params, lgb_train, num_boost_round=50)



In [12]:
predictions = gbm.predict(x_train)

print(predictions)
"""for i in predictions:
    print(np.argmax(i))"""

[[0.99674896 0.00325104]
 [0.99674896 0.00325104]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.99674896 0.00325104]
 [0.00325104 0.99674896]
 [0.99674896 0.00325104]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.00325104 0.99674896]
 [0.99674896 0.00325104]]


'for i in predictions:\n    print(np.argmax(i))'

In [13]:
for index, instance in dataset.iterrows():
    actual = instance[target_name]
    
    if is_regression == True:
        prediction = round(predictions[index])
    else: #classification
        prediction = np.argmax(predictions[index])
    
    print((index+1),". actual= ",actual,", prediction= ",prediction)

1 . actual=  0 , prediction=  0
2 . actual=  0 , prediction=  0
3 . actual=  1 , prediction=  1
4 . actual=  1 , prediction=  1
5 . actual=  1 , prediction=  1
6 . actual=  0 , prediction=  0
7 . actual=  1 , prediction=  1
8 . actual=  0 , prediction=  0
9 . actual=  1 , prediction=  1
10 . actual=  1 , prediction=  1
11 . actual=  1 , prediction=  1
12 . actual=  1 , prediction=  1
13 . actual=  1 , prediction=  1
14 . actual=  0 , prediction=  0


In [14]:
if plotTree == True:
    
    fig_size = [30, 20]
    plt.rcParams["figure.figsize"] = fig_size

    ax = lgb.plot_tree(gbm)
    plt.show()
    
    #ax = lgb.plot_importance(gbm, max_num_features=10)
    #plt.show()