# Porto Seguro's Safe Driver Prediction - 최종 모델 구축

Random Forest와 LightGBM을 하이퍼파라미터 튜닝하여, 둘 중 더 우수한 결과가 나온 모델을 최종 모델로 사용함.

## Load libraries & data

In [7]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',100)
from numba import jit

from hyperopt import hp, tpe
from hyperopt.fmin import fmin

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score

import xgboost as xgb

import lightgbm as lgbm

In [8]:
df = pd.read_csv("train.csv")

X = df.drop(['id','target'],axis=1)
y = df['target']

모델 평가 지표인 Normalized Gini Coefficient를 계산하는 함수는 아래와 같음.

In [9]:
def gini(true, pred):
    g = np.asarray(np.c_[true,pred,np.arange(len(true))], dtype=np.float)
    g = g[np.lexsort((g[:,2], -1*g[:,1]))]
    gs = g[:,0].cumsum().sum() / g[:,0].sum()
    gs -= (len(true) + 1) / 2.
    return gs / len(true)
        
def gini_xgb(pred, true):
        true = true.get_label()
        return 'gini', -1.0 * gini(true,pred) / gini(true, true)
def gini_lgb(true, pred):
        score = gini(true, pred) / gini(true, true)
        return 'gini',score,True
def gini_sklearn(true,pred):
        return gini(true,pred) / gini(true,true)

gini_scorer = make_scorer(gini_sklearn, greater_is_better=True, needs_proba=True)

# Tuning Random Forest

- Hyperop를 사용해서 랜덤 포레스트 튜닝 진행

아래 하이퍼파라미터를 튜닝하는 것이 중요 :
- Number of trees (n_estimators)
- Tree complexity (max_depth)

In [17]:
def objective(params):
    params = {'n_estimators': int(params['n_estimators']),
             'max_depth': int(params['max_depth'])}
    clf = RandomForestClassifier(class_weight='balanced',**params)
    score = cross_val_score(clf, X, Y, scoring=gini_scorer, cv=StratifiedKFold()).mean()
    print("Gini {:.3f} params {}" .format(score,params))
    return score

space = {
    'n_estimators': hp.quniform('n_estimators',25,500,25), #hp.quniform(label, low, high, q): Returns a value like round(uniform(low, high) / q) * q
    'max_depth': hp.quniform('max_depth',1,10,1)
}

best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals=10)

  0%|                                                                             | 0/10 [00:00<?, ?it/s, best loss: ?]


Compilation is falling back to object mode WITH looplifting enabled because Function "gini" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-14-774008719509> (3)

File "<ipython-input-14-774008719509>", line 3:
def gini(true, pred):
    g = np.asarray(np.c_[true,pred,np.arange(len(true))], dtype=np.float)
    ^

  @jit


File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^

  self.func_ir.loc))

Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^


Compilation is falling back to object mode WITH looplifting enabled because Function "gini" failed type inference due to: non-precise type pyobject
[1] 

Gini 0.245 params {'n_estimators': 325, 'max_depth': 10}                                                               
 10%|████▊                                           | 1/10 [18:47<2:49:03, 1127.10s/it, best loss: 0.2451870161500748]




Gini 0.254 params {'n_estimators': 100, 'max_depth': 8}                                                                
 20%|█████████▊                                       | 2/10 [23:30<1:56:32, 874.04s/it, best loss: 0.2451870161500748]




Gini 0.201 params {'n_estimators': 25, 'max_depth': 1}                                                                 
 30%|██████████████▍                                 | 3/10 [23:49<1:12:01, 617.39s/it, best loss: 0.20143342662413213]




Gini 0.252 params {'n_estimators': 100, 'max_depth': 7}                                                                
 40%|████████████████████                              | 4/10 [28:03<50:50, 508.45s/it, best loss: 0.20143342662413213]




Gini 0.244 params {'n_estimators': 225, 'max_depth': 10}                                                               
 50%|█████████████████████████                         | 5/10 [40:56<48:59, 587.83s/it, best loss: 0.20143342662413213]




Gini 0.237 params {'n_estimators': 225, 'max_depth': 3}                                                                
 60%|██████████████████████████████                    | 6/10 [45:48<33:16, 499.15s/it, best loss: 0.20143342662413213]




Gini 0.248 params {'n_estimators': 250, 'max_depth': 5}                                                                
 70%|███████████████████████████████████               | 7/10 [53:53<24:44, 494.89s/it, best loss: 0.20143342662413213]




Gini 0.237 params {'n_estimators': 400, 'max_depth': 3}                                                                
 80%|██████████████████████████████████████▍         | 8/10 [1:02:29<16:42, 501.29s/it, best loss: 0.20143342662413213]




Gini 0.237 params {'n_estimators': 25, 'max_depth': 9}                                                                 
 90%|███████████████████████████████████████████▏    | 9/10 [1:03:51<06:15, 375.29s/it, best loss: 0.20143342662413213]




Gini 0.212 params {'n_estimators': 225, 'max_depth': 1}                                                                
100%|███████████████████████████████████████████████| 10/10 [1:06:21<00:00, 307.82s/it, best loss: 0.20143342662413213]


In [18]:
print("Hyperopt estimated optimum {}" .format(best))

Hyperopt estimated optimum {'max_depth': 1.0, 'n_estimators': 25.0}


# Tune LightGBM

In [20]:
def objective(params):
    params = {
        'num_leaves': int(params['num_leaves']),
        'colsample_bytree': '{:.3f}' .format(params['colsample_bytree'])
    }
    
    clf = lgbm.LGBMClassifier(
            n_estimators=500,
            learning_rate=0.01,
            **params
    )
    
    score = cross_val_score(clf, X, Y, scoring=gini_scorer, cv=StratifiedKFold()).mean()
    print("Gini: {:.3f} , Params: {}" .format(score,params))
    return score

space = {
    'num_leaves' : hp.quniform('num_leaves',8,128,2),
    'colsample_bytree' : hp.uniform('colsample_bytree',0.3,1.0)
}

best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10)

  0%|                                                                             | 0/10 [00:00<?, ?it/s, best loss: ?]







KeyboardInterrupt: 

위 tuning 돌린 결과 다음과 같이 나온다.

**RF**
- Hyperopt estimated optimum {'max_depth': 1.0, 'n_estimators': 25.0}

**LightGBM**
- Hyperopt estimated Optimum {'colsample_bytree': 0.9745801169679305, 'num_leaves': 10.0}

## Fit the model

In [22]:
rf_model = RandomForestClassifier(
    n_jobs=4,
    class_weight='balanced',
    n_estimators=25,
    max_depth=1
)

lgbm_model = lgbm.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.01,
    num_leaves=10,
    colsample_bytree=0.97458
)

In [23]:
models = [
    ('Random Forest',rf_model),
    ('LightGBM', lgbm_model)
]

for label, model in models:
    scores = cross_val_score(model, X, Y, cv = StratifiedKFold(), scoring=gini_scorer)
    print("Gini coefficient: %0.4f (+/- %0.4f) [%s]" %(scores.mean(),scores.std(), label))



Gini coefficient: 0.2049 (+/- 0.0048) [Radom Forest]


Compilation is falling back to object mode WITH looplifting enabled because Function "gini" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-14-774008719509> (3)

File "<ipython-input-14-774008719509>", line 3:
def gini(true, pred):
    g = np.asarray(np.c_[true,pred,np.arange(len(true))], dtype=np.float)
    ^

  @jit

File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^

  self.func_ir.loc))
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^



Gini coefficient: 0.2655 (+/- 0.0019) [XGBoost]


Compilation is falling back to object mode WITH looplifting enabled because Function "gini" failed type inference due to: non-precise type pyobject
[1] During: typing of argument at <ipython-input-14-774008719509> (3)

File "<ipython-input-14-774008719509>", line 3:
def gini(true, pred):
    g = np.asarray(np.c_[true,pred,np.arange(len(true))], dtype=np.float)
    ^

  @jit

File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^

  self.func_ir.loc))
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit

File "<ipython-input-14-774008719509>", line 2:
@jit
def gini(true, pred):
^



Gini coefficient: 0.2729 (+/- 0.0022) [LightGBM]
