Kernel

Xgboost

---

They are quite obvious for the first run. He tunes the num_boost_rounds with early_stopping_rounds. The objective seems to be clear, the other parameters [lr, depth] - pretty standart run (for the first run you have to have a baseline, so 

    max_depth in [5, 6] and 
    eta (learning rate) in [0.05..0.01] is a good approach. 
    
Regularization

    Subsample and 
    colsample_bytree are regularisation. 0.7-0.8, but here is a 0.9, 
    
Maybe this is matter of taste, maybe not. Anyway afterwards you will tune parameters with GridsearchCV, Tpot, Hyperopt and etc.

Stacking

---

It may help to think about it like this: we have three solutions (three sets of numbers) that are somehow related to a real solution (a single set of numbers). Our job is to find a mathematical transformation that uses our three sets of numbers to get as close as possible to the final solution. It is not any kind of a rule that a set of 3 multiplying weights will do better or worse than a set of 3 weights that are used as powers. We try different things and find out what works best after minimizing the score.

If you are constraining weights to certain ranges, keep in mind that the ranges will be different when using one approach versus the other. For example, the following two solutions are equivalent:

```
w1=0.3333
w2=0.3333
w3=0.3333
score = w1*model1 + w2*model2 + w3*model3
```

```
w1=1
w2=1
w3=1
score = (model1^w1 + model2^w2 + model3^w3) / 3```

Yet the first-case weights are in the 0.1-0.6 range (to be generous and to allow unequal model contributions), while for the second case they would be - generously - in the 0.8-1.2 range.

https://stackoverflow.com/questions/34655628/how-to-handle-class-imbalance-in-sklearn-random-forests-should-i-use-sample-wei

In [2]:
from utils import *

In [38]:
import pandas as pd
import numpy as np
import xgboost as xgb
#import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
import gc
from sklearn import metrics

In [39]:
# custom objective function (similar to auc)

from sklearn.metrics import roc_auc_score

def gini(y, pred):
    fpr, tpr, thr = metrics.roc_curve(y, pred, pos_label=1)
    g = 2 * metrics.auc(fpr, tpr) -1
    return g

def gini_xgb(pred, y):
    y = y.get_label()
    return 'gini', gini(y, pred) / gini(y, y)

def gini_lgb(preds, dtrain):
    y = list(dtrain.get_label())
    score = gini(y, preds) / gini(y, y)
    return 'gini', score, True

In [40]:
# xgb
params = {'eta': 0.01, 'max_depth': 5, 'subsample': 0.9, 'colsample_bytree': 0.9, 
          'objective': 'binary:logistic', 'eval_metric': 'auc', 'silent': True}

# xgb.3 top .2 eta
params = {'gamma': 1.5797890063184805, 'booster': 'gbtree', 
         'max_depth': 4, 'eta': 0.1, 
         'tree_method': 'gpu_hist', 'objective': 'binary:logistic', 
         'silent': True, 'subsample': 0.9434828523993708, 
         'colsample_bytree': 0.44937522035664768, 'min_child_weight': 7.6065464675689975, 
         'max_delta_step': 0, 'seed': 42}

# xgb.4 top .05 eta
params = {'gamma': 0.066133947627838238, 'booster': 'gbtree', 
          'max_depth': 4, 'eta': 0.05, 
          'tree_method': 'gpu_hist', 'objective': 'binary:logistic', 
          'silent': True, 'subsample': 0.81293674522428461, 
          'colsample_bytree': 0.36922966018429904, 'min_child_weight': 9.6586643489118895, 
          'max_delta_step': 4, 'seed': 42}

# xgb.5 top .001 eta (.28471)
params = {'gamma': 2.0, 'booster': 'gbtree', 
          'max_depth': 4, 'eta': 0.01, 
          'tree_method': 'gpu_hist', 'objective': 'binary:logistic', 
          'silent': True, 'subsample': 0.2, 
          'colsample_bytree': 1.0, 'min_child_weight': 10.0, 
          'max_delta_step': 0, 'seed': 42}


#params['updater'] = 'grow_gpu'
#params['tree_method'] = 'gpu_hist'

train, test = read_train_test()
X = train.drop(['id', 'target'], axis=1)
features = X.columns
X = X.values
y = train['target'].values
sub = test['id'].to_frame()
sub['target'] = 0

nrounds=5000  # need to change to 2000
kfold = 5  # need to change to 5
skf = StratifiedKFold(n_splits=kfold, random_state=0)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(' xgb kfold: {}  of  {} : '.format(i+1, kfold))
    X_train, X_valid = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    d_train = xgb.DMatrix(X_train, y_train) 
    d_valid = xgb.DMatrix(X_valid, y_valid) 
    watchlist = [(d_train, 'train'), (d_valid, 'valid')]
    xgb_model = xgb.train(params, d_train, nrounds, watchlist, early_stopping_rounds=80, 
                          feval=gini_xgb, maximize=True, verbose_eval=100)
    sub['target'] += xgb_model.predict(xgb.DMatrix(test[features].values), 
                        ntree_limit=xgb_model.best_ntree_limit+50) / (kfold)
gc.collect()
sub.to_csv('./submissions/xgboost.5.csv',index=False, float_format='%.5f')

 xgb kfold: 1  of  5 : 
[0]	train-error:0.036447	valid-error:0.036449	train-gini:0.1736	valid-gini:0.179968
Multiple eval metrics have been passed: 'valid-gini' will be used for early stopping.

Will train until valid-gini hasn't improved in 80 rounds.
[100]	train-error:0.036447	valid-error:0.036449	train-gini:0.240556	valid-gini:0.239175
[200]	train-error:0.036447	valid-error:0.036449	train-gini:0.244228	valid-gini:0.242475
[300]	train-error:0.036447	valid-error:0.036449	train-gini:0.24537	valid-gini:0.242941
[400]	train-error:0.036447	valid-error:0.036449	train-gini:0.246936	valid-gini:0.244111
[500]	train-error:0.036447	valid-error:0.036449	train-gini:0.248287	valid-gini:0.245305
[600]	train-error:0.036447	valid-error:0.036449	train-gini:0.24959	valid-gini:0.246169
[700]	train-error:0.036447	valid-error:0.036449	train-gini:0.250485	valid-gini:0.24663
[800]	train-error:0.036447	valid-error:0.036449	train-gini:0.251293	valid-gini:0.247332
[900]	train-error:0.036447	valid-error:0.03644

KeyboardInterrupt: 

In [23]:
# random forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score


In [30]:
cv_pred = rf.predict_proba(X_valid)[:,1]

In [33]:
cv_pred[:,1].shape

(297606,)

In [34]:
train, test = read_train_test()
X = train.drop(['id', 'target'], axis=1)
features = X.columns
X = X.values
y = train['target'].values

sub = test['id'].to_frame()
sub['target'] = 0

kfold = 2
skf = StratifiedKFold(n_splits=kfold, random_state=0)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(' rf kfold: {}  of  {} : '.format(i+1, kfold))
    X_train, X_valid = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    
    rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight='balanced',
                                min_samples_leaf=25, min_samples_split=25)
    
#     rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight={0:.1, 1:.9},
#                                 min_samples_leaf=25, min_samples_split=25)
    rf.fit(X_train, y_train)
    cv_pred = rf.predict_proba(X_valid)[:,1]
    cv_score = roc_auc_score(y_valid, cv_pred)*2-1
    print(' cv-gini: {}'.format(cv_score))
    

    
    sub['target'] += rf.predict_proba(test.drop(['id'], axis=1))[:,1] / (kfold)
    
#     xgb_model = xgb.train(params, d_train, nrounds, watchlist, early_stopping_rounds=80, 
#                           feval=gini_xgb, maximize=True, verbose_eval=100)
#     sub['target'] += xgb_model.predict(xgb.DMatrix(test[features].values), 
#                         ntree_limit=xgb_model.best_ntree_limit+50) / (kfold)

gc.collect()
sub.head(2)

 rf kfold: 1  of  2 : 
 cv-gini: 0.25792038571468634
 rf kfold: 2  of  2 : 
 cv-gini: 0.2579841592567014


Unnamed: 0,id,target
0,0,0.234351
1,1,0.198716


In [35]:
sub.to_csv('./submissions/rf.2.csv',index=False, float_format='%.5f')

In [18]:
train, test = read_train_test()
X = train.drop(['id', 'target'], axis=1)
features = X.columns
X = X.values
y = train['target'].values

sub = test['id'].to_frame()
sub['target'] = 0

In [19]:
%%time
rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight='balanced',
                            min_samples_leaf=25, min_samples_split=25)
rf.fit(X, y)
sub['target'] = rf.predict_proba(test.drop(['id'], axis=1))

CPU times: user 1h 20min 14s, sys: 25.4 s, total: 1h 20min 40s
Wall time: 3min 28s


In [20]:
sub.to_csv('./submissions/rf.1.csv',index=False, float_format='%.5f')