# LightGBM para Classificar
Semelhante ao `XGBoost` que pode ter um resultado tão bom quanto ele.

Links:
+ [analyticsvidhya](https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/)
+ [official doc](https://lightgbm.readthedocs.io/en/latest/index.html)
+ [medium article 1](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc)
+ [medium article 2 - towardsdatascience](https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d)
+ [paper 1](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf)
+ [example classifier](https://www.programcreek.com/python/example/88793/lightgbm.LGBMClassifier)

In [42]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

file_path = "../../files/"

In [6]:
# Importing the dataset
dataset = pd.read_csv(file_path + 'Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [30]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV

# Tem que informar que o Objetivo é uma classificação binária
fit_lgb = lgb.LGBMClassifier(objective='binary')

params = {'num_leaves':[10,30,50],
         'n_estimators': [100,300],
          #'metric': ['l2', 'l1'],
          'learning_rate': [0.01,0.1],
          'min_child_samples': [20,40,60],
          'feature_fraction': [0.8,0.9],
          'bagging_fraction': [0.8,0.9],
          'bagging_freq': [1,3],
         }

gbm = GridSearchCV(fit_lgb, params, cv = 5, verbose = 1, n_jobs=-1)
gbm.fit(X_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 340 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 840 tasks      | elapsed:   24.4s
[Parallel(n_jobs=-1)]: Done 1433 out of 1440 | elapsed:   43.9s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:   44.1s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                      colsample_bytree=1.0,
                                      importance_type='split',
                                      learning_rate=0.1, max_depth=-1,
                                      min_child_samples=20,
                                      min_child_weight=0.001,
                                      min_split_gain=0.0, n_estimators=100,
                                      n_jobs=-1, num_leaves=31,
                                      objective='binary', random_state=None,
                                      reg_alpha=0.0, reg_lambda=0.0,
                                      silent...
                                      subsample_for_bin=200000,
                                      subsample_freq=0),
             iid='deprecated', n_jobs=-1,
             param_grid={'bagging_fraction': [0.8, 0.9], 'bagging_freq':

In [31]:
print('Best parameters found by grid search are:\n', gbm.best_params_)

print('\nBest score: ', gbm.best_score_)



Best parameters found by grid search are:
 {'bagging_fraction': 0.9, 'bagging_freq': 3, 'feature_fraction': 0.8, 'learning_rate': 0.01, 'min_child_samples': 40, 'n_estimators': 100, 'num_leaves': 10}

Best score:  0.9066666666666666


In [33]:
best_params = gbm.best_params_
best_params['boosting_type'] = 'gbdt'
best_params['objective'] = 'binary'
best_params

{'bagging_fraction': 0.9,
 'bagging_freq': 3,
 'feature_fraction': 0.8,
 'learning_rate': 0.01,
 'min_child_samples': 40,
 'n_estimators': 100,
 'num_leaves': 10,
 'boosting_type': 'gbdt',
 'objective': 'binary'}

In [34]:
# https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
gbm_tuned = lgb.LGBMClassifier(
    bagging_fraction = 0.9,
    bagging_freq = 3,
    feature_fraction = 0.8,
    learning_rate = 0.01,
    min_child_samples = 40,
    n_estimators = 100,
    num_leaves = 10,
    objective = 'binary'
)

In [35]:
gbm_tuned.fit(X_train, y_train)

LGBMClassifier(bagging_fraction=0.9, bagging_freq=3, boosting_type='gbdt',
               class_weight=None, colsample_bytree=1.0, feature_fraction=0.8,
               importance_type='split', learning_rate=0.01, max_depth=-1,
               min_child_samples=40, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=10, objective='binary',
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [41]:
# Predicting the Test set results
y_pred = gbm_tuned.predict(X_test)

In [44]:
# Evaluate
from sklearn.metrics import accuracy_score, confusion_matrix
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Applying Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = gbm_tuned, X = X_train, y = y_train, cv = 10)

print("\nCrossValidation : cv = 10")
print("mean", accuracies.mean())
print("std", accuracies.std())

# Classofy_Report
print("\n", classification_report(y_test, y_pred))

Accuracy: 0.94
Confusion Matrix:
 [[64  4]
 [ 2 30]]

CrossValidation : cv = 10
mean 0.9066666666666666
std 0.05925462944877058

               precision    recall  f1-score   support

           0       0.97      0.94      0.96        68
           1       0.88      0.94      0.91        32

    accuracy                           0.94       100
   macro avg       0.93      0.94      0.93       100
weighted avg       0.94      0.94      0.94       100



|  Modelo/Métrica | Média | DesvioPadrão |
| --- | --- | --- |
| XGBoost | 0.9033333333333333 | 0.06403124237432847 |
| LightGBM | 0.9066666666666666 | 0.05925462944877058 |

## Sem Tuning

In [47]:
gbm_not_tuned = lgb.LGBMClassifier()

gbm_not_tuned.fit(X_train, y_train)

y_pred = gbm_not_tuned.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = gbm_not_tuned, X = X_train, y = y_train, cv = 10)

print()
print("CrossValidation : cv = 10")
print("mean", accuracies.mean())
print("std", accuracies.std())

print("\n", classification_report(y_test, y_pred))

Accuracy: 0.91
Confusion Matrix:
 [[64  4]
 [ 5 27]]

CrossValidation : cv = 10
mean 0.89
std 0.06674994798166929

               precision    recall  f1-score   support

           0       0.93      0.94      0.93        68
           1       0.87      0.84      0.86        32

    accuracy                           0.91       100
   macro avg       0.90      0.89      0.90       100
weighted avg       0.91      0.91      0.91       100

