# LightGBM

## LightGBM 장단점

- XGB보다도 학습에 걸리는 시간이 훨씬 적으며 메모리 사용량도 상대적으로 적다
- 카테고리형 피처를 자동 변환하고, 예측 성능 역시 XGB와 큰 차이가 없다.
- 데이터의 갯수가 적을 경우 과적합이 발생하기 쉽다.
- 적은 데이터의 갯수에 대한 기준은 애매하지만 LightGBM의 공식문서에서 10,000건 이하라고 기술하였다.

## 트리 기반 알고리즘 특징

- 기존의 대부분 트리 기반 알고리즘은 트리의 깊이를 효과적으로 줄이기 위한 균형 트리 분할(Level Wise) 방식을 사용한다.
- 최대한 균형 잡힌 트리를 유지하면서 분할하기 때문에 깊이가 최소화되며 과대적합에 보다 강한 구조를 가진다.

## LightGBM 특징

- LightGBM은 일반 GBM 계열의 트리 분할 방법과 다르게 리프 중심 트리 분할(Leaf Wise) 방식을 사용한다.
- 트리의 균형을 맞추지 않고 최대 손실값(max delta loss)을 가지는 리프 노드를 지속적으로 분할하여 깊이가 증가하고 비대칭적인 트리를 생성한다.
- 이렇게 생성된 트리는 학습을 반복할수록 결국은 균형 트리 분할 장식보다 예측 오류 손실을 최소화 활 수 있다는 것이 LightGBM의 구현 사상이다.

# 필수라이브러리

In [1]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
from IPython.display import Image

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

mpl.rc('font', family = 'D2coding')
mpl.rc('axes', unicode_minus = False)

sns.set_style('darkgrid')
plt.rc('figure', figsize = (10, 8))

warnings.filterwarnings('ignore')

# 데이터 로딩과 훈련세트와 테스트세트 분리

In [3]:
from lightgbm import LGBMClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [4]:
X, y = load_breast_cancer(True)

print(X.shape)
print(y.shape)

(569, 30)
(569,)


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 156)

In [7]:
evals = [(X_test, y_test)]

# 학습
lgbm = LGBMClassifier(n_estimators = 4000)
lgbm.fit(X_train, y_train, early_stopping_rounds = 100, eval_metric = 'logloss',
        eval_set = evals, verbose = True)

[1]	valid_0's binary_logloss: 0.565079
Training until validation scores don't improve for 100 rounds
[2]	valid_0's binary_logloss: 0.507451
[3]	valid_0's binary_logloss: 0.458489
[4]	valid_0's binary_logloss: 0.417481
[5]	valid_0's binary_logloss: 0.385507
[6]	valid_0's binary_logloss: 0.355773
[7]	valid_0's binary_logloss: 0.329587
[8]	valid_0's binary_logloss: 0.308478
[9]	valid_0's binary_logloss: 0.285395
[10]	valid_0's binary_logloss: 0.267055
[11]	valid_0's binary_logloss: 0.252013
[12]	valid_0's binary_logloss: 0.237018
[13]	valid_0's binary_logloss: 0.224756
[14]	valid_0's binary_logloss: 0.213383
[15]	valid_0's binary_logloss: 0.203058
[16]	valid_0's binary_logloss: 0.194015
[17]	valid_0's binary_logloss: 0.186412
[18]	valid_0's binary_logloss: 0.179108
[19]	valid_0's binary_logloss: 0.174004
[20]	valid_0's binary_logloss: 0.167155
[21]	valid_0's binary_logloss: 0.162494
[22]	valid_0's binary_logloss: 0.156886
[23]	valid_0's binary_logloss: 0.152855
[24]	valid_0's binary_loglo

LGBMClassifier(n_estimators=4000)

In [8]:
def  get_clf_eval(y_test, pred=None, pred_proba=None):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score
    
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred_proba)
    
    print('오차 행렬')
    print(confusion)
 
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, \
    재현율: {2:.4f}, F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, roc_auc_score, f1_score

In [14]:
pred = lgbm.predict(X_test)

In [31]:
# pred_proba 생성
pred_proba_po = lgbm.predict_proba(X_test)[:, 1].reshape(-1, 1)

In [18]:
confusion = confusion_matrix(y_test, pred)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

print('오차 행렬')
print(confusion)

print('정확도: {0:.4f}, 정밀도: {1:.4f}, \
재현율: {2:.4f}, F1: {3:.4f}'.format(accuracy, precision, recall, f1))

오차 행렬
[[33  4]
 [ 1 76]]
정확도: 0.9561, 정밀도: 0.9500, 재현율: 0.9870, F1: 0.9682


In [19]:
# 정확도
accuracy_score(y_test, pred)

0.956140350877193

In [22]:
# 정밀도
precision = precision_score(y_test, pred)
precision

0.95

In [24]:
# 재현율
recall = recall_score(y_test, pred)
recall

0.987012987012987

In [25]:
# F1 score
f1 = f1_score(y_test, pred)
f1

0.9681528662420381

In [34]:
evals = [(X_test, y_test)]

# 학습
lgbm = LGBMClassifier(n_estimators = 4000)
lgbm.fit(X_train, y_train, early_stopping_rounds = 100, eval_metric = 'logloss',
        eval_set = evals, verbose = True)

[1]	training's binary_logloss: 0.585834	valid_1's binary_logloss: 0.565079
Training until validation scores don't improve for 100 rounds
[2]	training's binary_logloss: 0.522733	valid_1's binary_logloss: 0.507451
[3]	training's binary_logloss: 0.468843	valid_1's binary_logloss: 0.458489
[4]	training's binary_logloss: 0.424905	valid_1's binary_logloss: 0.417481
[5]	training's binary_logloss: 0.386683	valid_1's binary_logloss: 0.385507
[6]	training's binary_logloss: 0.350279	valid_1's binary_logloss: 0.355773
[7]	training's binary_logloss: 0.318714	valid_1's binary_logloss: 0.329587
[8]	training's binary_logloss: 0.293358	valid_1's binary_logloss: 0.308478
[9]	training's binary_logloss: 0.267737	valid_1's binary_logloss: 0.285395
[10]	training's binary_logloss: 0.24655	valid_1's binary_logloss: 0.267055
[11]	training's binary_logloss: 0.228594	valid_1's binary_logloss: 0.252013
[12]	training's binary_logloss: 0.211667	valid_1's binary_logloss: 0.237018
[13]	training's binary_logloss: 0.19

[124]	training's binary_logloss: 9.81886e-05	valid_1's binary_logloss: 0.177149
[125]	training's binary_logloss: 9.06258e-05	valid_1's binary_logloss: 0.179171
[126]	training's binary_logloss: 8.29418e-05	valid_1's binary_logloss: 0.180948
[127]	training's binary_logloss: 7.78771e-05	valid_1's binary_logloss: 0.183861
[128]	training's binary_logloss: 7.20705e-05	valid_1's binary_logloss: 0.187579
[129]	training's binary_logloss: 6.67089e-05	valid_1's binary_logloss: 0.188122
[130]	training's binary_logloss: 6.1535e-05	valid_1's binary_logloss: 0.1857
[131]	training's binary_logloss: 5.71133e-05	valid_1's binary_logloss: 0.187442
[132]	training's binary_logloss: 5.36521e-05	valid_1's binary_logloss: 0.188578
[133]	training's binary_logloss: 5.03417e-05	valid_1's binary_logloss: 0.189729
[134]	training's binary_logloss: 4.75967e-05	valid_1's binary_logloss: 0.187313
[135]	training's binary_logloss: 4.4469e-05	valid_1's binary_logloss: 0.189279
[136]	training's binary_logloss: 4.13393e-05

LGBMClassifier(n_estimators=4000)

In [27]:
evals = [(X_test, y_test)]

# 학습
lgbm = LGBMClassifier(n_estimators = 400)
lgbm.fit(X_train, y_train, early_stopping_rounds = 50, eval_metric = 'logloss',
        eval_set = evals, verbose = True)

[1]	valid_0's binary_logloss: 0.565079
Training until validation scores don't improve for 50 rounds
[2]	valid_0's binary_logloss: 0.507451
[3]	valid_0's binary_logloss: 0.458489
[4]	valid_0's binary_logloss: 0.417481
[5]	valid_0's binary_logloss: 0.385507
[6]	valid_0's binary_logloss: 0.355773
[7]	valid_0's binary_logloss: 0.329587
[8]	valid_0's binary_logloss: 0.308478
[9]	valid_0's binary_logloss: 0.285395
[10]	valid_0's binary_logloss: 0.267055
[11]	valid_0's binary_logloss: 0.252013
[12]	valid_0's binary_logloss: 0.237018
[13]	valid_0's binary_logloss: 0.224756
[14]	valid_0's binary_logloss: 0.213383
[15]	valid_0's binary_logloss: 0.203058
[16]	valid_0's binary_logloss: 0.194015
[17]	valid_0's binary_logloss: 0.186412
[18]	valid_0's binary_logloss: 0.179108
[19]	valid_0's binary_logloss: 0.174004
[20]	valid_0's binary_logloss: 0.167155
[21]	valid_0's binary_logloss: 0.162494
[22]	valid_0's binary_logloss: 0.156886
[23]	valid_0's binary_logloss: 0.152855
[24]	valid_0's binary_loglos

LGBMClassifier(n_estimators=400)

In [28]:
evals = [(X_test, y_test)]

# 학습
lgbm = LGBMClassifier(n_estimators = 4000)
lgbm.fit(X_train, y_train, early_stopping_rounds = 1000, eval_metric = 'logloss',
        eval_set = evals, verbose = True)

[1]	valid_0's binary_logloss: 0.565079
Training until validation scores don't improve for 1000 rounds
[2]	valid_0's binary_logloss: 0.507451
[3]	valid_0's binary_logloss: 0.458489
[4]	valid_0's binary_logloss: 0.417481
[5]	valid_0's binary_logloss: 0.385507
[6]	valid_0's binary_logloss: 0.355773
[7]	valid_0's binary_logloss: 0.329587
[8]	valid_0's binary_logloss: 0.308478
[9]	valid_0's binary_logloss: 0.285395
[10]	valid_0's binary_logloss: 0.267055
[11]	valid_0's binary_logloss: 0.252013
[12]	valid_0's binary_logloss: 0.237018
[13]	valid_0's binary_logloss: 0.224756
[14]	valid_0's binary_logloss: 0.213383
[15]	valid_0's binary_logloss: 0.203058
[16]	valid_0's binary_logloss: 0.194015
[17]	valid_0's binary_logloss: 0.186412
[18]	valid_0's binary_logloss: 0.179108
[19]	valid_0's binary_logloss: 0.174004
[20]	valid_0's binary_logloss: 0.167155
[21]	valid_0's binary_logloss: 0.162494
[22]	valid_0's binary_logloss: 0.156886
[23]	valid_0's binary_logloss: 0.152855
[24]	valid_0's binary_logl

[682]	valid_0's binary_logloss: 0.198244
[683]	valid_0's binary_logloss: 0.198244
[684]	valid_0's binary_logloss: 0.198244
[685]	valid_0's binary_logloss: 0.198244
[686]	valid_0's binary_logloss: 0.198244
[687]	valid_0's binary_logloss: 0.198244
[688]	valid_0's binary_logloss: 0.198244
[689]	valid_0's binary_logloss: 0.198244
[690]	valid_0's binary_logloss: 0.198244
[691]	valid_0's binary_logloss: 0.198244
[692]	valid_0's binary_logloss: 0.198244
[693]	valid_0's binary_logloss: 0.198244
[694]	valid_0's binary_logloss: 0.198244
[695]	valid_0's binary_logloss: 0.198244
[696]	valid_0's binary_logloss: 0.198244
[697]	valid_0's binary_logloss: 0.198244
[698]	valid_0's binary_logloss: 0.198244
[699]	valid_0's binary_logloss: 0.198244
[700]	valid_0's binary_logloss: 0.198244
[701]	valid_0's binary_logloss: 0.198244
[702]	valid_0's binary_logloss: 0.198244
[703]	valid_0's binary_logloss: 0.198244
[704]	valid_0's binary_logloss: 0.198244
[705]	valid_0's binary_logloss: 0.198244
[706]	valid_0's 

LGBMClassifier(n_estimators=4000)

In [32]:
def precision_recall_curve_plot(y_test, pred_proba_c1):
    precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1)
    
    plt.figure(figsize = (8, 6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle = '--', label = 'precision')
    plt.plot(thresholds, recalls[0:threshold_boundary], label = 'recall')
    
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))
    
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()

In [33]:
precision_recall_curve_plot(y_test, pred_proba_po)

NameError: name 'precision_recall_curve' is not defined