# Basic Tabular Mining Tips

---
In this short tutorial, we demontrate how to perform a simple data mining task on the "Kicked!" dataset.

We will be covering:



1.   Data Preprocessing using Pandas.
2.   Some basic ideas in feature engineering.
3.   Basic parameter tuning for tree models. 



In [None]:
!pip install lightgbm xgboost catboost category-encoders sklearn pandas==1.1.5

In [36]:
!pip install hyperopt

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/


In [9]:
import pandas as pd
import numpy as np

train = pd.read_csv('./final/train_final.csv', engine='python')
test = pd.read_csv('./final/test_final.csv', engine='python')

x_train = train.drop('loan_status', axis=1)
x_test = test.drop('loan_status', axis=1)

y_train = train['loan_status']
y_test = test['loan_status']

In [10]:
TRAIN_IDX = x_train.shape[0]
TEST_IDX = TRAIN_IDX + x_test.shape[0]

In [11]:
x = pd.concat([x_train, x_test], axis=0)
y = pd.concat([y_train, y_test], axis=0)

data = pd.concat([x, y], axis=1)

In [12]:
data.columns.to_list()

['continuous_annual_inc',
 'continuous_annual_inc_joint',
 'continuous_delinq_2yrs',
 'continuous_dti',
 'continuous_dti_joint',
 'continuous_fico_range_high',
 'continuous_fico_range_low',
 'continuous_funded_amnt',
 'continuous_funded_amnt_inv',
 'continuous_inq_last_6mths',
 'continuous_installment',
 'continuous_int_rate',
 'continuous_last_fico_range_high',
 'continuous_last_fico_range_low',
 'continuous_loan_amnt',
 'continuous_mths_since_last_delinq',
 'continuous_mths_since_last_major_derog',
 'continuous_mths_since_last_record',
 'continuous_open_acc',
 'continuous_pub_rec',
 'discrete_addr_state_1_one_hot',
 'discrete_addr_state_2_one_hot',
 'discrete_addr_state_3_one_hot',
 'discrete_addr_state_4_one_hot',
 'discrete_addr_state_5_one_hot',
 'discrete_addr_state_6_one_hot',
 'discrete_addr_state_7_one_hot',
 'discrete_addr_state_8_one_hot',
 'discrete_addr_state_9_one_hot',
 'discrete_addr_state_10_one_hot',
 'discrete_addr_state_11_one_hot',
 'discrete_addr_state_12_one_hot'

## Basic Data Manipulation

Let us see how we can do some basic data preprocessing

In [28]:
data['loan_status'].unique()

array([1, 0], dtype=int64)

In [29]:
data['loan_status'].value_counts()

1    80014
0    19986
Name: loan_status, dtype: int64

## TreeBased Models  

---
In this example, we use lightgbm as the tree model of choice.

In [15]:
train = data.iloc[:TRAIN_IDX, :]
test = data.iloc[TRAIN_IDX:TEST_IDX, :]

In [16]:
import lightgbm as lgb
train_dataset = lgb.Dataset(train.drop(columns='loan_status'), train['loan_status'])
test_dataset = lgb.Dataset(test.drop(columns='loan_status'), test['loan_status'])

In [17]:
param = {'num_leaves': 31, 'objective': 'binary', 'metric':'binary_error'}
num_round = 2000

In [18]:
model = lgb.train(param, train_dataset, num_boost_round=num_round, valid_sets=[train_dataset, test_dataset])

[LightGBM] [Info] Number of positive: 20672, number of negative: 29328
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2607
[LightGBM] [Info] Number of data points in the train set: 50000, number of used features: 141
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.413440 -> initscore=-0.349763
[LightGBM] [Info] Start training from score -0.349763
[1]	training's binary_error: 0.41344	valid_1's binary_error: 0.39348
[2]	training's binary_error: 0.41344	valid_1's binary_error: 0.39348
[3]	training's binary_error: 0.41	valid_1's binary_error: 0.39074
[4]	training's binary_error: 0.39362	valid_1's binary_error: 0.37906
[5]	training's binary_error: 0.38106	valid_1's binary_error: 0.36738
[6]	training's binary_error: 0.37056	valid_1's binary_error: 0.3612
[7]	training's binary_error: 0.363	valid_1's binary_error: 0.3561
[8]	training's binary_error: 0.3588	valid_1's binary_error: 0.35232
[

[143]	training's binary_error: 0.28576	valid_1's binary_error: 0.34028
[144]	training's binary_error: 0.28552	valid_1's binary_error: 0.34028
[145]	training's binary_error: 0.28506	valid_1's binary_error: 0.34014
[146]	training's binary_error: 0.28476	valid_1's binary_error: 0.34032
[147]	training's binary_error: 0.28452	valid_1's binary_error: 0.34022
[148]	training's binary_error: 0.28422	valid_1's binary_error: 0.34028
[149]	training's binary_error: 0.28408	valid_1's binary_error: 0.34004
[150]	training's binary_error: 0.28378	valid_1's binary_error: 0.33972
[151]	training's binary_error: 0.28362	valid_1's binary_error: 0.33974
[152]	training's binary_error: 0.28352	valid_1's binary_error: 0.33954
[153]	training's binary_error: 0.28304	valid_1's binary_error: 0.3392
[154]	training's binary_error: 0.28266	valid_1's binary_error: 0.33946
[155]	training's binary_error: 0.28242	valid_1's binary_error: 0.33962
[156]	training's binary_error: 0.28216	valid_1's binary_error: 0.33962
[157]	t

[309]	training's binary_error: 0.24532	valid_1's binary_error: 0.34302
[310]	training's binary_error: 0.24516	valid_1's binary_error: 0.34292
[311]	training's binary_error: 0.24498	valid_1's binary_error: 0.34298
[312]	training's binary_error: 0.24468	valid_1's binary_error: 0.34308
[313]	training's binary_error: 0.2445	valid_1's binary_error: 0.34308
[314]	training's binary_error: 0.24402	valid_1's binary_error: 0.34328
[315]	training's binary_error: 0.24364	valid_1's binary_error: 0.34324
[316]	training's binary_error: 0.24334	valid_1's binary_error: 0.34324
[317]	training's binary_error: 0.24302	valid_1's binary_error: 0.34326
[318]	training's binary_error: 0.24268	valid_1's binary_error: 0.34354
[319]	training's binary_error: 0.24264	valid_1's binary_error: 0.34346
[320]	training's binary_error: 0.24246	valid_1's binary_error: 0.34326
[321]	training's binary_error: 0.24222	valid_1's binary_error: 0.34332
[322]	training's binary_error: 0.24194	valid_1's binary_error: 0.34328
[323]	t

[472]	training's binary_error: 0.2124	valid_1's binary_error: 0.3441
[473]	training's binary_error: 0.21232	valid_1's binary_error: 0.34414
[474]	training's binary_error: 0.21234	valid_1's binary_error: 0.34428
[475]	training's binary_error: 0.21224	valid_1's binary_error: 0.34448
[476]	training's binary_error: 0.21166	valid_1's binary_error: 0.34464
[477]	training's binary_error: 0.21158	valid_1's binary_error: 0.3449
[478]	training's binary_error: 0.21142	valid_1's binary_error: 0.34478
[479]	training's binary_error: 0.21126	valid_1's binary_error: 0.34512
[480]	training's binary_error: 0.21114	valid_1's binary_error: 0.34528
[481]	training's binary_error: 0.21074	valid_1's binary_error: 0.34536
[482]	training's binary_error: 0.21062	valid_1's binary_error: 0.34542
[483]	training's binary_error: 0.21062	valid_1's binary_error: 0.34536
[484]	training's binary_error: 0.2104	valid_1's binary_error: 0.3455
[485]	training's binary_error: 0.21022	valid_1's binary_error: 0.34562
[486]	train

[640]	training's binary_error: 0.18258	valid_1's binary_error: 0.34712
[641]	training's binary_error: 0.18256	valid_1's binary_error: 0.34712
[642]	training's binary_error: 0.1824	valid_1's binary_error: 0.34708
[643]	training's binary_error: 0.18204	valid_1's binary_error: 0.34718
[644]	training's binary_error: 0.18196	valid_1's binary_error: 0.34714
[645]	training's binary_error: 0.18174	valid_1's binary_error: 0.3472
[646]	training's binary_error: 0.18144	valid_1's binary_error: 0.34714
[647]	training's binary_error: 0.18136	valid_1's binary_error: 0.3473
[648]	training's binary_error: 0.18134	valid_1's binary_error: 0.34738
[649]	training's binary_error: 0.18112	valid_1's binary_error: 0.34738
[650]	training's binary_error: 0.18092	valid_1's binary_error: 0.34718
[651]	training's binary_error: 0.18076	valid_1's binary_error: 0.34734
[652]	training's binary_error: 0.18066	valid_1's binary_error: 0.34728
[653]	training's binary_error: 0.18038	valid_1's binary_error: 0.34728
[654]	tra

[765]	training's binary_error: 0.16344	valid_1's binary_error: 0.34876
[766]	training's binary_error: 0.16318	valid_1's binary_error: 0.34874
[767]	training's binary_error: 0.16288	valid_1's binary_error: 0.34882
[768]	training's binary_error: 0.16268	valid_1's binary_error: 0.3487
[769]	training's binary_error: 0.16268	valid_1's binary_error: 0.34874
[770]	training's binary_error: 0.16234	valid_1's binary_error: 0.34874
[771]	training's binary_error: 0.16216	valid_1's binary_error: 0.34856
[772]	training's binary_error: 0.16206	valid_1's binary_error: 0.3488
[773]	training's binary_error: 0.16194	valid_1's binary_error: 0.34872
[774]	training's binary_error: 0.16162	valid_1's binary_error: 0.34878
[775]	training's binary_error: 0.1615	valid_1's binary_error: 0.34874
[776]	training's binary_error: 0.16134	valid_1's binary_error: 0.3487
[777]	training's binary_error: 0.16122	valid_1's binary_error: 0.34878
[778]	training's binary_error: 0.16108	valid_1's binary_error: 0.3487
[779]	train

[890]	training's binary_error: 0.14584	valid_1's binary_error: 0.35092
[891]	training's binary_error: 0.1457	valid_1's binary_error: 0.35086
[892]	training's binary_error: 0.14566	valid_1's binary_error: 0.35086
[893]	training's binary_error: 0.14552	valid_1's binary_error: 0.35078
[894]	training's binary_error: 0.14536	valid_1's binary_error: 0.3508
[895]	training's binary_error: 0.1452	valid_1's binary_error: 0.3508
[896]	training's binary_error: 0.14504	valid_1's binary_error: 0.35098
[897]	training's binary_error: 0.14496	valid_1's binary_error: 0.35096
[898]	training's binary_error: 0.14476	valid_1's binary_error: 0.35074
[899]	training's binary_error: 0.14444	valid_1's binary_error: 0.35088
[900]	training's binary_error: 0.14406	valid_1's binary_error: 0.35078
[901]	training's binary_error: 0.14386	valid_1's binary_error: 0.35078
[902]	training's binary_error: 0.14384	valid_1's binary_error: 0.3509
[903]	training's binary_error: 0.14374	valid_1's binary_error: 0.3507
[904]	traini

[1048]	training's binary_error: 0.12704	valid_1's binary_error: 0.35256
[1049]	training's binary_error: 0.1269	valid_1's binary_error: 0.35254
[1050]	training's binary_error: 0.1269	valid_1's binary_error: 0.35258
[1051]	training's binary_error: 0.12682	valid_1's binary_error: 0.35274
[1052]	training's binary_error: 0.1267	valid_1's binary_error: 0.35274
[1053]	training's binary_error: 0.12644	valid_1's binary_error: 0.35268
[1054]	training's binary_error: 0.12638	valid_1's binary_error: 0.3527
[1055]	training's binary_error: 0.12614	valid_1's binary_error: 0.35288
[1056]	training's binary_error: 0.12606	valid_1's binary_error: 0.35286
[1057]	training's binary_error: 0.12604	valid_1's binary_error: 0.35286
[1058]	training's binary_error: 0.1259	valid_1's binary_error: 0.35288
[1059]	training's binary_error: 0.12582	valid_1's binary_error: 0.35286
[1060]	training's binary_error: 0.12564	valid_1's binary_error: 0.35284
[1061]	training's binary_error: 0.12566	valid_1's binary_error: 0.352

[1220]	training's binary_error: 0.10916	valid_1's binary_error: 0.35364
[1221]	training's binary_error: 0.10908	valid_1's binary_error: 0.35362
[1222]	training's binary_error: 0.1089	valid_1's binary_error: 0.35356
[1223]	training's binary_error: 0.10876	valid_1's binary_error: 0.35366
[1224]	training's binary_error: 0.10866	valid_1's binary_error: 0.3537
[1225]	training's binary_error: 0.10848	valid_1's binary_error: 0.35362
[1226]	training's binary_error: 0.1084	valid_1's binary_error: 0.35358
[1227]	training's binary_error: 0.1084	valid_1's binary_error: 0.3536
[1228]	training's binary_error: 0.10814	valid_1's binary_error: 0.35368
[1229]	training's binary_error: 0.10812	valid_1's binary_error: 0.35376
[1230]	training's binary_error: 0.1081	valid_1's binary_error: 0.35392
[1231]	training's binary_error: 0.10802	valid_1's binary_error: 0.3541
[1232]	training's binary_error: 0.1077	valid_1's binary_error: 0.35396
[1233]	training's binary_error: 0.10754	valid_1's binary_error: 0.35392


[1392]	training's binary_error: 0.09456	valid_1's binary_error: 0.3549
[1393]	training's binary_error: 0.09436	valid_1's binary_error: 0.3549
[1394]	training's binary_error: 0.0942	valid_1's binary_error: 0.35496
[1395]	training's binary_error: 0.09416	valid_1's binary_error: 0.3549
[1396]	training's binary_error: 0.09396	valid_1's binary_error: 0.355
[1397]	training's binary_error: 0.09376	valid_1's binary_error: 0.35498
[1398]	training's binary_error: 0.09348	valid_1's binary_error: 0.35492
[1399]	training's binary_error: 0.09342	valid_1's binary_error: 0.35494
[1400]	training's binary_error: 0.09326	valid_1's binary_error: 0.3547
[1401]	training's binary_error: 0.09314	valid_1's binary_error: 0.35492
[1402]	training's binary_error: 0.09308	valid_1's binary_error: 0.35478
[1403]	training's binary_error: 0.09312	valid_1's binary_error: 0.35478
[1404]	training's binary_error: 0.09306	valid_1's binary_error: 0.35478
[1405]	training's binary_error: 0.09296	valid_1's binary_error: 0.35476

[1565]	training's binary_error: 0.08004	valid_1's binary_error: 0.3563
[1566]	training's binary_error: 0.07998	valid_1's binary_error: 0.3563
[1567]	training's binary_error: 0.07978	valid_1's binary_error: 0.35614
[1568]	training's binary_error: 0.07976	valid_1's binary_error: 0.35614
[1569]	training's binary_error: 0.07974	valid_1's binary_error: 0.35608
[1570]	training's binary_error: 0.07954	valid_1's binary_error: 0.35636
[1571]	training's binary_error: 0.0795	valid_1's binary_error: 0.35634
[1572]	training's binary_error: 0.07938	valid_1's binary_error: 0.35638
[1573]	training's binary_error: 0.07922	valid_1's binary_error: 0.35634
[1574]	training's binary_error: 0.0792	valid_1's binary_error: 0.35624
[1575]	training's binary_error: 0.07912	valid_1's binary_error: 0.35634
[1576]	training's binary_error: 0.07912	valid_1's binary_error: 0.35636
[1577]	training's binary_error: 0.07906	valid_1's binary_error: 0.35624
[1578]	training's binary_error: 0.07904	valid_1's binary_error: 0.35

[1736]	training's binary_error: 0.06756	valid_1's binary_error: 0.35672
[1737]	training's binary_error: 0.06756	valid_1's binary_error: 0.3567
[1738]	training's binary_error: 0.06754	valid_1's binary_error: 0.3568
[1739]	training's binary_error: 0.0674	valid_1's binary_error: 0.35686
[1740]	training's binary_error: 0.06728	valid_1's binary_error: 0.3568
[1741]	training's binary_error: 0.0673	valid_1's binary_error: 0.3568
[1742]	training's binary_error: 0.0671	valid_1's binary_error: 0.35674
[1743]	training's binary_error: 0.06696	valid_1's binary_error: 0.35676
[1744]	training's binary_error: 0.06676	valid_1's binary_error: 0.35682
[1745]	training's binary_error: 0.06672	valid_1's binary_error: 0.35684
[1746]	training's binary_error: 0.06658	valid_1's binary_error: 0.35686
[1747]	training's binary_error: 0.06658	valid_1's binary_error: 0.35692
[1748]	training's binary_error: 0.06648	valid_1's binary_error: 0.357
[1749]	training's binary_error: 0.06644	valid_1's binary_error: 0.35696
[

[1921]	training's binary_error: 0.05646	valid_1's binary_error: 0.35776
[1922]	training's binary_error: 0.05642	valid_1's binary_error: 0.3579
[1923]	training's binary_error: 0.05638	valid_1's binary_error: 0.35784
[1924]	training's binary_error: 0.05626	valid_1's binary_error: 0.358
[1925]	training's binary_error: 0.0562	valid_1's binary_error: 0.35802
[1926]	training's binary_error: 0.05614	valid_1's binary_error: 0.3579
[1927]	training's binary_error: 0.05608	valid_1's binary_error: 0.35788
[1928]	training's binary_error: 0.05592	valid_1's binary_error: 0.35798
[1929]	training's binary_error: 0.0559	valid_1's binary_error: 0.35802
[1930]	training's binary_error: 0.05582	valid_1's binary_error: 0.35794
[1931]	training's binary_error: 0.05576	valid_1's binary_error: 0.358
[1932]	training's binary_error: 0.05572	valid_1's binary_error: 0.35796
[1933]	training's binary_error: 0.0557	valid_1's binary_error: 0.35786
[1934]	training's binary_error: 0.05568	valid_1's binary_error: 0.35786
[

## A Wrapper

In [19]:
import io
import multiprocessing
from contextlib import redirect_stdout
from copy import deepcopy
from dataclasses import dataclass, asdict
import hyperopt.pyll
from hyperopt import fmin, tpe, hp
import numpy as np
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
import torch

import copy
cpu_count = 4
use_gpu = False
@dataclass
class LGBOpt:
    num_threads: any = hp.choice('num_threads', [cpu_count])
    num_leaves: any = hp.choice('num_leaves', [64])
    metric: any = hp.choice('metric', ['binary_error'])
    num_round: any = hp.choice('num_rounds', [1000])
    objective: any = hp.choice('objective', ['binary'])
    learning_rate: any = hp.uniform('learning_rate', 0.01, 0.1)
    feature_fraction: any = hp.uniform('feature_fraction', 0.5, 1.0)
    bagging_fraction: any = hp.uniform('bagging_fraction', 0.8, 1.0)
    device_type: any = hp.choice('device_tpye', ['gpu']) if use_gpu else hp.choice('device_type',
                                                                                   ['cpu'])
    boosting: any = hp.choice('boosting', ['gbdt', 'dart', 'goss'])
    extra_trees: any = hp.choice('extra_tress', [False, True])
    drop_rate: any = hp.uniform('drop_rate', 0, 0.2)
    uniform_drop: any = hp.choice('uniform_drop', [True, False])
    lambda_l1: any = hp.uniform('lambda_l1', 0, 10)  # TODO: Check range
    lambda_l2: any = hp.uniform('lambda_l2', 0, 10)  # TODO: Check range
    min_gain_to_split: any = hp.uniform('min_gain_to_split', 0, 1)  # TODO: Check range
    min_data_in_bin = hp.choice('min_data_in_bin', [3, 5, 10, 15, 20, 50])

    @staticmethod
    def get_common_params():
        return {'num_thread': 4, 'num_leaves': 12, 'metric': 'binary', 'objective': 'binary',
                'num_round': 1000, 'learning_rate': 0.01, 'feature_fraction': 0.8, 'bagging_fraction': 0.8}
    

In [20]:
class FitterBase(object):
    def __init__(self, label, metric, max_eval=100, opt=None):
        self.label = label
        self.metric = metric
        self.opt_params = dict()
        self.max_eval = max_eval
        self.opt = opt

    def get_loss(self, y, y_pred):
        if self.metric == 'error':
            return 1 - accuracy_score(y, y_pred)
        elif self.metric == 'precision':
            return 1 - precision_score(y, y_pred)
        elif self.metric == 'recall':
            return 1 - recall_score(y, y_pred)
        elif self.metric == 'macro_f1':
            return 1 - f1_score(y, y_pred, average='macro')
        elif self.metric == 'micro_f1':
            return 1 - f1_score(y, y_pred, average='micro')
        elif self.metric == 'auc':  # TODO: Add a warning checking if y_predict is all [0, 1], it should be probability
            return 1 - roc_auc_score(y, y_pred)
        else:
            raise Exception("Not implemented yet.")


In [21]:
class LGBFitter(FitterBase):
    def __init__(self, label='label', metric='error', opt: LGBOpt = None, max_eval=100):
        super(LGBFitter, self).__init__(label, metric, max_eval)
        if opt is not None:
            self.opt = opt
        else:
            self.opt = LGBOpt()
        self.best_round = None
        self.clf = None

    def train(self, train_df, eval_df, params=None, use_best_eval=True):
        self.best_round = None
        dtrain = lgb.Dataset(train_df.drop(columns=[self.label]), train_df[self.label])
        deval = lgb.Dataset(eval_df.drop(columns=[self.label]), eval_df[self.label])
        evallist = [dtrain, deval]
        if params is None:
            use_params = deepcopy(self.opt_params)
        else:
            use_params = deepcopy(params)

        num_round = use_params.pop('num_round')
        if use_best_eval:
            with io.StringIO() as buf, redirect_stdout(buf):
                self.clf = lgb.train(use_params, dtrain, num_round, valid_sets=evallist)
                output = buf.getvalue().split("\n")
            min_error = np.inf
            min_index = 0
            for idx in range(len(output) - 1):
                if len(output[idx].split("\t")) == 3:
                    temp = float(output[idx].split("\t")[2].split(":")[1])
                    if min_error > temp:
                        min_error = temp
                        min_index = int(output[idx].split("\t")[0][1:-1])
            print("The minimum is attained in round %d" % (min_index + 1))
            self.best_round = min_index + 1
            return output
        else:
            with io.StringIO() as buf, redirect_stdout(buf):
                self.clf = lgb.train(use_params, dtrain, num_round, valid_sets=evallist)
                output = buf.getvalue().split("\n")
            self.best_round = num_round
            return output

    def search(self, train_df, eval_df, use_best_eval=True):
        self.opt_params = dict()

        def train_impl(params):
            self.train(train_df, eval_df, params, use_best_eval)
            if self.metric == 'auc':
                y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
            else:
                y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                           num_iteration=self.best_round) > 0.5).astype(int)
            return self.get_loss(eval_df[self.label], y_pred)

        self.opt_params = fmin(train_impl, asdict(self.opt), algo=tpe.suggest, max_evals=self.max_eval)

    def search_k_fold(self, k_fold, data, use_best_eval=True):
        self.opt_params = dict()

        def train_impl_nfold(params):
            loss = list()
            for train_id, eval_id in k_fold.split(data):
                train_df = data.loc[train_id]
                eval_df = data.loc[eval_id]
                self.train(train_df, eval_df, params, use_best_eval)
                if self.metric == 'auc':
                    y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
                else:
                    y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                               num_iteration=self.best_round) > 0.5).astype(int)
                loss.append(self.get_loss(eval_df[self.label], y_pred))
            return np.mean(loss)

        self.opt_params = fmin(train_impl_nfold, asdict(self.opt), algo=tpe.suggest, max_evals=self.max_eval)

    def train_k_fold(self, k_fold, train_data, test_data, params=None, drop_test_y=True, use_best_eval=True):
        acc_result = list()
        train_pred = np.empty(train_data.shape[0])
        test_pred = np.empty(test_data.shape[0])
        if drop_test_y:
            dtest = test_data.drop(columns=self.label)
        else:
            dtest = test_data

        models = list()
        for train_id, eval_id in k_fold.split(train_data):
            train_df = train_data.loc[train_id]
            eval_df = train_data.loc[eval_id]
            self.train(train_df, eval_df, params, use_best_eval)
            models.append(copy.deepcopy(self.clf))
            train_pred[eval_id] = self.clf.predict(eval_df.drop(columns=self.label), num_iteration=self.best_round)
            if self.metric == 'auc':
                y_pred = self.clf.predict(eval_df.drop(columns=[self.label]), num_iteration=self.best_round)
            else:
                y_pred = (self.clf.predict(eval_df.drop(columns=[self.label]),
                                           num_iteration=self.best_round) > 0.5).astype(int)
            acc_result.append(self.get_loss(eval_df[self.label], y_pred))
            test_pred += self.clf.predict(dtest, num_iteration=self.best_round)
        test_pred /= k_fold.n_splits
        return train_pred, test_pred, acc_result, models

In [22]:
fitter = LGBFitter(label='loan_status')

In [23]:
params = {'num_thread': 4, 'num_leaves': 12, 'metric': 'binary', 'objective': 'binary',
                'num_round': 2000, 'learning_rate': 0.02, 'feature_fraction': 0.8, 'bagging_fraction': 0.8}

In [24]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5)

In [25]:
fitter.train_k_fold(kfold, train, test, params = params)

The minimum is attained in round 752
Finished loading model, total used 2000 iterations
The minimum is attained in round 1128
Finished loading model, total used 2000 iterations
The minimum is attained in round 1255
Finished loading model, total used 2000 iterations
The minimum is attained in round 746
Finished loading model, total used 2000 iterations
The minimum is attained in round 1012
Finished loading model, total used 2000 iterations


(array([0.66544515, 0.63186229, 0.34167458, ..., 0.61421409, 0.52737908,
        0.10582775]),
 array([0.32797432, 0.36489331, 0.27675128, ..., 0.32066783, 0.31667225,
        0.41526439]),
 [0.43300000000000005, 0.4312, 0.4272, 0.4302, 0.42290000000000005],
 [<lightgbm.basic.Booster at 0x18a46662a48>,
  <lightgbm.basic.Booster at 0x18a46d24488>,
  <lightgbm.basic.Booster at 0x18a484f2a88>,
  <lightgbm.basic.Booster at 0x18a4642b988>,
  <lightgbm.basic.Booster at 0x18a461f3508>])

In [34]:
train1=train.copy()
train1['continuous_inc_funded_ratio']=train1['continuous_annual_inc']/train1['continuous_funded_amnt']
test1=test.copy()
test1['continuous_inc_funded_ratio']=test1['continuous_annual_inc']/test1['continuous_funded_amnt']

In [35]:
fitter.train_k_fold(kfold, train1, test1, params = params)

The minimum is attained in round 788
Finished loading model, total used 2000 iterations
The minimum is attained in round 875
Finished loading model, total used 2000 iterations
The minimum is attained in round 1330
Finished loading model, total used 2000 iterations
The minimum is attained in round 1138
Finished loading model, total used 2000 iterations
The minimum is attained in round 837
Finished loading model, total used 2000 iterations


(array([0.64132968, 0.59716021, 0.35742535, ..., 0.59169384, 0.52140829,
        0.14728204]),
 array([0.30160404, 0.39559409, 0.26574056, ..., 0.32128021, 0.30451696,
        0.4011293 ]),
 [0.43279999999999996, 0.4302, 0.42589999999999995, 0.4286, 0.4226],
 [<lightgbm.basic.Booster at 0x18a00aa5148>,
  <lightgbm.basic.Booster at 0x18a37af4208>,
  <lightgbm.basic.Booster at 0x18a3affea48>,
  <lightgbm.basic.Booster at 0x18a00aa5488>,
  <lightgbm.basic.Booster at 0x18a484d6a08>])