from https://www.kaggle.com/ruby33421/lgbm-with-new-features

## Intro
Please see Alexey Pronin's kernel (https://www.kaggle.com/graf10a/logistic-regression-with-new-features-feather) to read more on feature engineering and the performance benefit of using feather files.
Alexey Pronin's kernel also references Youri Matiounine's work here: (https://www.kaggle.com/ymatioun/santander-linear-model-with-additional-features) 

The featuring engineering process adds 1000 new features, which means a total of 1200 features for the Santander dataset. The original kernel uses a simple logistic regression for training, which achieves a very good score of 0.896 (AUC). This kernel will use Light GBM model, but instead of using incorporating all the 1K additional features in our model, we will use feature importance to select some of the top engineered features only.

In [34]:
import feather
import gc
import keras
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import optuna
import os
import time
# import shutil
import sklearn
import pandas as pd
import xgboost as xgb

from functools import partial
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from numpy import sort
from pprint import pprint
from pylab import rcParams
from scipy.stats import norm, rankdata
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.decomposition import PCA


#csvデータの呼び出し場所
loadpath = "../input/"
#csvデータの保存場所
savepath = "../output/"

Now, let's read the CSV files containing the training and testing data and measure how long it takes.

Train:

In [None]:
path_train = '../input/train.feather'
path_test = '../input/test.feather'

print("Reading train data...")
start = time.time()
train = pd.read_csv('../input/train.csv')
end = time.time()

print("It takes {0:.2f} seconds to read 'train.csv'.".format(end - start))

Test:

In [57]:
start = time.time()
print("Reading test data...")
test = pd.read_csv('../input/test.csv')
end = time.time()

print("It takes {0:.2f} seconds to read 'test.csv'.".format(end - start))

Reading test data...


KeyboardInterrupt: 

In [43]:
print("Train: ",train.shape)
print("Test: ", test.shape)

Train:  (200000, 202)
Test:  (200000, 201)


Saving the 'target' and 'ID_code' data.

In [44]:
target = train.pop('target')
train_ids = train.pop('ID_code')
test_ids = test.pop('ID_code')

Saving the number of rows in 'train' for future use.

In [45]:
len_train = len(train)

Merging test and train.

In [46]:
merged = pd.concat([train, test])
merged.shape

(400000, 200)

Removing data we no longer need.

In [123]:
del test, train
gc.collect()

55788

Saving the list of original features in a new list `original_features`.

In [124]:
original_features = merged.columns

## Computing new features

In [10]:
for col in merged.columns:
    # Normalize the data, so that it can be used in norm.cdf(), 
    # as though it is a standard normal variable
    merged[col] = ((merged[col] - merged[col].mean()) 
    / merged[col].std()).astype('float32')

    # Square
    merged[col+'^2'] = merged[col] * merged[col]

    # Cube
    merged[col+'^3'] = merged[col] * merged[col] * merged[col]

    # 4th power
    merged[col+'^4'] = merged[col] * merged[col] * merged[col] * merged[col]

    # Cumulative percentile (not normalized)
    merged[col+'_cp'] = rankdata(merged[col]).astype('float32')

    # Cumulative normal percentile
    merged[col+'_cnp'] = norm.cdf(merged[col]).astype('float32')

Getting the list of names of the added features.

In [11]:
new_features = set(merged.columns) - set(original_features)

Normalize the data. Again.

In [12]:
for col in new_features:
    merged[col] = ((merged[col] - merged[col].mean()) 
    / merged[col].std()).astype('float32')

Saving the data to feather files.

In [22]:
path_target = loadpath + 'target.feather'

path_train_ids =loadpath + 'train_ids_extra_features.feather'
path_test_ids = loadpath + 'test_ids_extra_features.feather'

path_train = loadpath + 'train_extra_features.feather'
path_test = loadpath + 'test_extra_features.feather'

print("Writing target to a feather files...")
pd.DataFrame({'target' : target.values}).to_feather(path_target)

print("Writing train_ids to a feather files...")
pd.DataFrame({'ID_code' : train_ids.values}).to_feather(path_train_ids)

print("Writing test_ids to a feather files...")
pd.DataFrame({'ID_code' : test_ids.values}).to_feather(path_test_ids)

print("Writing train to a feather files...")
feather.write_dataframe(merged.iloc[:len_train], path_train)

print("Writing test to a feather files...")
feather.write_dataframe(merged.iloc[len_train:], path_test)

Writing target to a feather files...
Writing train_ids to a feather files...
Writing test_ids to a feather files...
Writing train to a feather files...
Writing test to a feather files...


Removing data we no longer need.

In [23]:
del target, train_ids, test_ids, merged
gc.collect()

27

## Loading the data from feather files

Now let's load of these data back into memory. This will help us to illustrate the advantage of using the feather file format.

In [3]:
path_target = loadpath + 'target.feather'

path_train_ids = loadpath + 'train_ids_extra_features.feather'
path_test_ids = loadpath + 'test_ids_extra_features.feather'

path_train = loadpath + 'train_extra_features.feather'
path_test = loadpath + 'test_extra_features.feather'

print("Reading target")
start = time.time()
y = feather.read_dataframe(path_target).values.ravel()
end = time.time()

print("{0:5f} sec".format(end - start))

Reading target
0.005985 sec


In [4]:
print("Reading train_ids")
start = time.time()
train_ids = feather.read_dataframe(path_train_ids).values.ravel()
end = time.time()

print("{0:5f} sec".format(end - start))

Reading train_ids
0.047362 sec


In [5]:
print("Reading test_ids")
start = time.time()
test_ids = feather.read_dataframe(path_test_ids).values.ravel()
end = time.time()

print("{0:5f} sec".format(end - start))

Reading test_ids
0.048272 sec


In [6]:
print("Reading training data")

start = time.time()
train_logistic = feather.read_dataframe(path_train)
train = train_logistic.iloc[:,:200]
end = time.time()

print("{0:5f} sec".format(end - start))

Reading training data
0.135424 sec


In [7]:
print("Reading testing data")

start = time.time()
test_logistic = feather.read_dataframe(path_test)
test = test_logistic.iloc[:,:200]
end = time.time()

print("{0:5f} sec".format(end - start))

Reading testing data
0.131395 sec


In [8]:
print("Total number of features: ",train.shape[1])

Total number of features:  200


In [9]:
original_features = train.columns

Hopefully now you can see the great advantage of using the feather files: it is blazing fast. Just compare the timings shown above with those measured for the original CSV files: the processed data sets (stored in the feather file format) that we have just loaded are much bigger in size that the original ones (stored in the CSV files) but we can load them in almost no time!

# Logistic regession with the added features.

Now let's finally do some modeling! More specifically, we will build a straighforward logistic regression model to see whether or not we can improve on linear regression result (LB 0.894). 

In [22]:
NFOLDS = 10
RANDOM_STATE = 823

feature_list = train_logistic.columns

# test = test[feature_list]

# X = train.values.astype(float)
# X_test = test.values.astype(float)

folds = StratifiedKFold(n_splits=NFOLDS, shuffle=True, 
                        random_state=RANDOM_STATE)


In [14]:
oof_preds = np.zeros((len(train_logistic), 1))
test_preds = np.zeros((len(test_logistic), 1))
roc_cv =[]

for fold_, (trn_, val_) in enumerate(folds.split(y, y)):
    print("Current Fold: {}".format(fold_))
    trn_x, trn_y = train_logistic.iloc[trn_, :], y[trn_]
    val_x, val_y = train_logistic.iloc[val_, :], y[val_]
    
    clf =  LogisticRegression(solver='lbfgs', max_iter=1500, C=10)

    clf.fit(trn_x, trn_y)

    val_pred = clf.predict_proba(val_x)[:,1]
    test_fold_pred = clf.predict_proba(test_logistic)[:,1]
    
    roc_cv.append(roc_auc_score(val_y, val_pred))
    
    print("AUC = {}".format(roc_auc_score(val_y, val_pred)))
    oof_preds[val_, :] = val_pred.reshape((-1, 1))
    test_preds += test_fold_pred.reshape((-1, 1))
test_preds/=NFOLDS
test_preds = test_preds.reshape(-1)

Current Fold: 0
AUC = 0.8867748850235262
Current Fold: 1
AUC = 0.9077794663852814
Current Fold: 2
AUC = 0.905253941520856
Current Fold: 3
AUC = 0.8912520222677607
Current Fold: 4
AUC = 0.89269715900763
Current Fold: 5
AUC = 0.8970970605560303
Current Fold: 6
AUC = 0.9029870104729272
Current Fold: 7
AUC = 0.8989451851360208
Current Fold: 8
AUC = 0.899775607786526
Current Fold: 9
AUC = 0.9002646578115537
Current Fold: 10
AUC = 0.9005251673815469
Current Fold: 11
AUC = 0.9004905433919894
Current Fold: 12
AUC = 0.8970716180077932
Current Fold: 13
AUC = 0.8953268122976004
Current Fold: 14
AUC = 0.8999129975470066
Current Fold: 15
AUC = 0.8937476043904989
Current Fold: 16
AUC = 0.8953758168579007
Current Fold: 17
AUC = 0.8945592216792636
Current Fold: 18
AUC = 0.9018440966539624
Current Fold: 19
AUC = 0.8896115371753676


In [71]:
train_logistic.shape

(200000, 1200)

In [72]:
clf =  LogisticRegression(solver='lbfgs', max_iter=1500, C=10)
clf.fit(train_logistic,y)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1500, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

foldしないでテスト予測

In [75]:
test_preds_single = clf.predict_proba(test_logistic)[:,1]

foldでやったやつとの差

In [94]:
np.abs(test_preds_single - test_preds).sum()

38.932710383915044

### Feature Importance & Feature Selection

In [15]:
feature_importance = abs(clf.coef_[0])
sorted_idx = np.argsort(feature_importance)[::-1]

In [16]:
top_new_features = feature_list[sorted_idx[0:500]]

In [17]:
train_newf = train_logistic[top_new_features]
Orig_feature_list = list(original_features)

In [18]:
cols = [col for col in train_newf.columns if col not in Orig_feature_list]
len(cols)

348

In [19]:
# train2 = pd.concat([train[original_features], train[cols]], axis=1)
# test2 = test[train2.columns]

## LGBM model with additional features

In [23]:
params_tuned ={
    'max_leaves': 962,
    'min_data_in_leaf': 101,
    'learning_rate': 0.008467089868304085,
    'bagging_fraction': 0.39421465183790383,
    'feature_fraction': 0.742029399864705,
    'reg_alpha': 0.1659749229798302, 
    'reg_lambda': 1.4223979160695897, 
    'colsample_bytree': 0.7931590791796129,
    'min_gain_to_split': 0.06117879704656666, 
    'min_child_weight': 0.016977338542881623,
}

params_tuned.update({
    'boosting': 'gbdt',
    'bagging_freq': 5, 
    'num_threads': 4,
    'objective': 'binary',
    'random_state': 823,
    'metric': 'auc',
    'verbosity': -1,
})

# {'num_leaves': 9,
#          'min_data_in_leaf': 42,
#          'objective': 'binary',
#          'max_depth': 11,
#          'learning_rate': 0.03,
#          'boosting': 'gbdt',
# #          'bagging_freq': 5,
#          'bagging_fraction': 0.8,
#          'feature_fraction': 0.8201,
#          'bagging_seed': 11,
#          'reg_alpha': 3,
#          'reg_lambda': 5,
#          'random_state': 42,
#          'metric': 'auc',
#          'verbosity': -1,
#          'colsample_bytree': 0.7,
# #         'subsample': 0.81,
#          'min_gain_to_split': 0.02,
# #         'min_child_weight': 19.428902804238373,
#          'num_threads': 4,
# #          'tree_learner': 'data'
#         }


In [29]:
%%time

oof_preds_lgb = np.zeros((len(train), 1))
y_pred_lgb = np.zeros(len(test))
roc_cv_lgb =[]
for fold_n, (train_index, valid_index) in enumerate(folds.split(train,y)):
    print('Fold', fold_n, 'started at', time.ctime())
    X_train, X_valid = train[original_features].iloc[train_index, :], train[original_features].iloc[valid_index, :]
    y_train, y_valid = y[train_index], y[valid_index]
    
    lgb_model = lgb.LGBMClassifier(n_estimators=10000,**params_tuned)
    lgb_model.fit(X_train,y_train,
                    eval_set = [(X_train,y_train), (X_valid,y_valid)],verbose=300,early_stopping_rounds = 200)

    val_pred = lgb_model.predict_proba(X_valid)[:,1]
    print("AUC = {}".format(roc_auc_score(y_valid, val_pred)))
    oof_preds_lgb[valid_index] = val_pred.reshape((-1, 1))
    
    roc_cv_lgb.append(roc_auc_score(y_valid, val_pred))
    
    y_pred_lgb += lgb_model.predict_proba(test[original_features])[:,1]/NFOLDS #, num_iteration=lgb_model.best_iteration)/5
y_pred_lgb = y_pred_lgb.reshape(-1)

Fold 0 started at Sat Mar 16 14:45:39 2019
Training until validation scores don't improve for 200 rounds.
[300]	training's auc: 0.859574	valid_1's auc: 0.830581
[600]	training's auc: 0.894037	valid_1's auc: 0.861971
[900]	training's auc: 0.911017	valid_1's auc: 0.875216
[1200]	training's auc: 0.922601	valid_1's auc: 0.883683
[1500]	training's auc: 0.930889	valid_1's auc: 0.888404
[1800]	training's auc: 0.937147	valid_1's auc: 0.891889
[2100]	training's auc: 0.942196	valid_1's auc: 0.894442
[2400]	training's auc: 0.946532	valid_1's auc: 0.895984
[2700]	training's auc: 0.950185	valid_1's auc: 0.897266
[3000]	training's auc: 0.953687	valid_1's auc: 0.898099
[3300]	training's auc: 0.956812	valid_1's auc: 0.898618
[3600]	training's auc: 0.959748	valid_1's auc: 0.899059
[3900]	training's auc: 0.9625	valid_1's auc: 0.899253
[4200]	training's auc: 0.965139	valid_1's auc: 0.899437
[4500]	training's auc: 0.967681	valid_1's auc: 0.899703
Early stopping, best iteration is:
[4540]	training's auc: 0

[1800]	training's auc: 0.937126	valid_1's auc: 0.888783
[2100]	training's auc: 0.942145	valid_1's auc: 0.891264
[2400]	training's auc: 0.946344	valid_1's auc: 0.89327
[2700]	training's auc: 0.950076	valid_1's auc: 0.894485
[3000]	training's auc: 0.953422	valid_1's auc: 0.895496
[3300]	training's auc: 0.95661	valid_1's auc: 0.896301
[3600]	training's auc: 0.959614	valid_1's auc: 0.896695
[3900]	training's auc: 0.962382	valid_1's auc: 0.897049
[4200]	training's auc: 0.965037	valid_1's auc: 0.89731
[4500]	training's auc: 0.967569	valid_1's auc: 0.897456
[4800]	training's auc: 0.970091	valid_1's auc: 0.8977
[5100]	training's auc: 0.972396	valid_1's auc: 0.897807
[5400]	training's auc: 0.974611	valid_1's auc: 0.897848
Early stopping, best iteration is:
[5300]	training's auc: 0.973898	valid_1's auc: 0.897919
AUC = 0.8979186059695962
Fold 8 started at Sat Mar 16 15:10:14 2019
Training until validation scores don't improve for 200 rounds.
[300]	training's auc: 0.86076	valid_1's auc: 0.833235
[

In [33]:
pprint(roc_cv_lgb)

[0.8997542441757086,
 0.8996298314994977,
 0.9018637772781452,
 0.8975483339279146,
 0.8971079289489183,
 0.8959785840115708,
 0.9002147682930539,
 0.8979186059695962,
 0.8941248262750917,
 0.900515163697768]


foldしないでテストデータ予測

In [30]:
lgb_model = lgb.LGBMClassifier(n_estimators=5000,**params_tuned)
lgb_model.fit(train,y)

LGBMClassifier(bagging_fraction=0.39421465183790383, bagging_freq=5,
        boosting='gbdt', boosting_type='gbdt', class_weight=None,
        colsample_bytree=0.7931590791796129,
        feature_fraction=0.742029399864705, importance_type='split',
        learning_rate=0.008467089868304085, max_depth=-1, max_leaves=962,
        metric='auc', min_child_samples=20,
        min_child_weight=0.016977338542881623, min_data_in_leaf=101,
        min_gain_to_split=0.06117879704656666, min_split_gain=0.0,
        n_estimators=5000, n_jobs=-1, num_leaves=31, num_threads=4,
        objective='binary', random_state=823, reg_alpha=0.1659749229798302,
        reg_lambda=1.4223979160695897, silent=True, subsample=1.0,
        subsample_for_bin=200000, subsample_freq=0, verbosity=-1)

In [31]:
y_pred_lgb_single = lgb_model.predict_proba(test)[:,1]

foldでやったときとの差

In [32]:
np.abs(y_pred_lgb - y_pred_lgb_single).sum()

1788.4834714081546

### 自分の考えるスタッキング(うまく行ってない)

In [22]:
%%time

target2 = np.argmin(np.vstack((np.abs(oof_preds.reshape(-1)-y),np.abs(oof_preds_lgb.reshape(-1)-y))),axis=0)
oof_preds2 = np.zeros((len(train_logistic), 1))
test_preds2 = np.zeros((len(test_logistic), 1))
roc_cv2 =[]
for fold_, (trn_, val_) in enumerate(folds.split(train_logistic, target2)):
    print("Current Fold: {}".format(fold_))
    trn_x, trn_y = train_logistic.iloc[trn_, :], target2[trn_]
    val_x, val_y, target_val, pred1_val, pred2_val = train_logistic.iloc[val_], target2[val_], y[val_], oof_preds[val_], oof_preds_lgb[val_]
    target_val = target_val.reshape(-1)
    pred1_val = pred1_val.reshape(-1)
    pred2_val = pred2_val.reshape(-1)
    
    clf =  LogisticRegression(solver='lbfgs', max_iter=1500, C=10)

    clf.fit(trn_x, trn_y)

    val_pred = clf.predict(val_x).reshape(-1)
    test_fold_pred = clf.predict_proba(test_logistic)[:,1].reshape(-1)
    
    val_pred_class = (1-val_pred)*pred1_val + val_pred*pred2_val
    test_fold_pred_class = (1-test_fold_pred)*test_preds + test_fold_pred*y_pred_lgb
    
    fold_score = roc_auc_score(target_val, val_pred_class)
    roc_cv2.append(fold_score)
    
    print(roc_auc_score(val_y,val_pred))
    print("AUC = {}".format(fold_score))
    print("rev=",roc_auc_score(target_val,(1-val_pred)*pred2_val + val_pred*pred1_val))
    print("ave=",roc_auc_score(target_val,(pred1_val+pred2_val)/2))
    print("log=",roc_auc_score(target_val,pred1_val))
    print("lgb=",roc_auc_score(target_val,pred2_val))
    print("max=",roc_auc_score(target_val,(1-val_y)*pred1_val + val_y*pred2_val))
    print("min=",roc_auc_score(target_val,(1-val_y)*pred2_val + val_y*pred1_val))
    oof_preds2[val_] = val_pred_class.reshape((-1, 1))
    test_preds2 += test_fold_pred_class.reshape((-1, 1))

test_preds2/=5


Current Fold: 0
0.5572748947620956
AUC = 0.8950171466311798
rev= 0.8974241341239989
ave= 0.8989273784454206
log= 0.897495248691385
lgb= 0.8960491099022705
max= 0.9340548734973438
min= 0.8448987007742826
Current Fold: 1
0.5472144196245284
AUC = 0.9026414519950692
rev= 0.9044106294805135
ave= 0.9058489231959866
log= 0.9039580188086572
lgb= 0.9043702353118201
max= 0.9391073694424535
min= 0.8565930795969816
Current Fold: 2
0.5601017788509441
AUC = 0.8914782433279254
rev= 0.8953573390675834
ave= 0.8962100727082314
log= 0.8924111319993476
lgb= 0.895980230571794
max= 0.9335953662451821
min= 0.8380586172559723
Current Fold: 3
0.5583245178701275
AUC = 0.894916060452756
rev= 0.8986766219493049
ave= 0.8995895771912379
log= 0.8967359233350197
lgb= 0.8980766300903069
max= 0.9357177165651949
min= 0.8418489378899685
Current Fold: 4
0.5518683835443744
AUC = 0.8969727038213299
rev= 0.9001740394361888
ave= 0.9014584662254367
log= 0.8992596292633755
lgb= 0.8991212527239366
max= 0.936406642523898
min= 0.8

##### 確認用

In [23]:
a  = np.random.rand(val_pred.shape[0])
print("rev=",roc_auc_score(target_val,(1-a)*pred1_val + a*pred2_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred2_val + a*pred1_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred1_val + a*pred1_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred2_val + a*pred2_val))
print("ave=",roc_auc_score(target_val,(pred1_val+pred2_val)/2))

rev= 0.8979429999104619
rev= 0.8991064541775937
rev= 0.8985534762848428
rev= 0.8986698756500942
ave= 0.9004940770091296


In [24]:
a  = np.ones(val_pred.shape[0])/2
print("rev=",roc_auc_score(target_val,(1-a)*pred1_val + a*pred2_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred2_val + a*pred1_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred1_val + a*pred1_val))
print("rev=",roc_auc_score(target_val,(1-a)*pred2_val + a*pred2_val))
print("ave=",roc_auc_score(target_val,(pred1_val+pred2_val)/2))

rev= 0.9004940770091296
rev= 0.9004940770091296
rev= 0.8985534762848428
rev= 0.8986698756500942
ave= 0.9004940770091296


In [54]:
print(roc_auc_score(y,oof_preds))
print(roc_auc_score(y,oof_preds_lgb))
print(roc_auc_score(y,oof_preds2))
print(roc_auc_score(y,(oof_preds+oof_preds_lgb)/2))

0.8975665283512196
0.8982480202268966
0.8956609102374608
0.8999414605379313


## Submission File

In [96]:
submission_lgb = pd.DataFrame({
        "ID_code": test_ids,
        "target":
    })
submission_lgb.to_csv(savepath + 'csv', index=False)

In [27]:
target2 = np.argmin(np.vstack((np.abs(oof_preds.reshape(-1)-target),np.abs(oof_preds_lgb.reshape(-1)-target))),axis=0)

NameError: name 'target' is not defined

## lightgbmのハイパーパラメーター調整 

In [106]:
count_itr=0
def objective_optuna(X, y, trial):
    NFOLDS = 10
    RANDOM_STATE = 823
    folds = StratifiedKFold(n_splits=NFOLDS, shuffle=True, 
                        random_state=RANDOM_STATE)

    global count_itr
    print(count_itr, end=' ')
    count_itr += 1
    #最適化するパラメータを指定
    params = {
        'max_leaves': trial.suggest_int('max_leaves', 5, 1000),
        'min_data_in_leaf':  trial.suggest_int('min_data_in_leaf', 10, 1000),
#         'max_depth': trial.suggest_int('max_depth', 5, 30),
        'learning_rate': trial.suggest_uniform('learning_rate',0,1),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction',0,1),
        'feature_fraction': trial.suggest_uniform('feature_fraction',0.2,1),
        'reg_alpha': trial.suggest_uniform('reg_alpha',0,5),
        'reg_lambda': trial.suggest_uniform('reg_lambda',0,5),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree',0,1),
        'min_gain_to_split': trial.suggest_loguniform('min_gain_to_split',1e-4,1),
        'min_child_weight': trial.suggest_loguniform('min_child_weight',1e-3,1000),
        'boosting': 'gbdt',
        'bagging_freq': 5, 
        'num_threads': 4,
        'objective': 'binary',
        'random_state': 823,
        'metric': 'auc',
        'verbosity': -1,
#         'tree_learner': 'data'
    }
    
    
    oof_preds_lgb = np.zeros(len(X))
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X,y)):
#         print('Fold', fold_n, 'started at', time.ctime())
        X_train, X_valid = X.iloc[train_index, :], X.iloc[valid_index, :]
        y_train, y_valid = y[train_index], y[valid_index]

        lgb_model = lgb.LGBMClassifier(n_estimators=5000,**params)
        lgb_model.fit(X_train,y_train,
                        eval_set = [(X_train,y_train), (X_valid,y_valid)],verbose=False,early_stopping_rounds = 200)

        val_pred = lgb_model.predict_proba(X_valid)[:,1]
#         print("AUC = {}".format(roc_auc_score(y_valid, val_pred)))
        oof_preds_lgb[valid_index] = val_pred.reshape(-1)
      
    return -roc_auc_score(y,oof_preds_lgb)

In [28]:
f = partial(objective_optuna, train, y)
study = optuna.create_study()
study.optimize(f, n_trials=10000)

NameError: name 'objective_optuna' is not defined

### keras 

In [81]:
model = Sequential([
    Dense(32, input_dim=train.shape[1],name = "lay1"),
    Activation('relu'),
    Dense(1,name="lay2"),
    Activation('softmax'),
])

model.compile(loss='categorical_crossentropy', optimizer='sgd')#, metrics=['auc'])

In [82]:
# 学習処理の実行
model.fit(train, y, batch_size=200, verbose=True, epochs=20, validation_split=0.1)

ValueError: You are passing a target array of shape (200000, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
```
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
```

Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.

In [67]:
scr = model.evaluate(train,y)



In [68]:
scr


14.340334671020507

In [84]:
model = Sequential()

model.add(Dense(500, input_shape=(train.shape[1],)))
model.add(Activation('sigmoid'))
model.add(Dropout(0.2))

model.add(Dense(1000))
model.add(Activation('sigmoid'))
model.add(Dropout(0.2))

model.add(Dense(50))
model.add(Activation('sigmoid'))
model.add(Dropout(0.2))

model.add(Dense(2))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=2)

In [109]:
hist = model.fit(train, keras.utils.np_utils.to_categorical(y),
                 batch_size=50000,
                 verbose=2,
                 epochs=100,
                 validation_split=0.1,
                 callbacks=[early_stopping])

Train on 180000 samples, validate on 20000 samples
Epoch 1/100
 - 1s - loss: 0.2293 - acc: 0.9160 - val_loss: 0.2337 - val_acc: 0.9137
Epoch 2/100
 - 1s - loss: 0.2298 - acc: 0.9159 - val_loss: 0.2337 - val_acc: 0.9137
Epoch 3/100
 - 1s - loss: 0.2292 - acc: 0.9161 - val_loss: 0.2337 - val_acc: 0.9137
Epoch 4/100
 - 1s - loss: 0.2296 - acc: 0.9162 - val_loss: 0.2337 - val_acc: 0.9136
Epoch 5/100
 - 1s - loss: 0.2296 - acc: 0.9159 - val_loss: 0.2337 - val_acc: 0.9136
Epoch 6/100
 - 1s - loss: 0.2294 - acc: 0.9162 - val_loss: 0.2337 - val_acc: 0.9136


In [110]:
pre = model.predict_proba(train)

In [111]:
roc_auc_score(y,pre[:,1])

0.8643111086556021