## Day3 - More Careful Tuning of XGBClassifier
Today's plan :
1. optimize XGBoost again.
order of tuning : 
1. max_depth, min_child_weight
2. gamma
3. subsample, colsample_by_tree
4. reg_alpha
5. reduce learning rate

Today, I do basically same things as yesterday, but in a little more careful fashion.

In [1]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import mean_squared_error, log_loss
from sklearn.model_selection import GridSearchCV
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
from tqdm import tqdm
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

In [2]:
data = pd.read_csv('train_final.csv')

In [3]:
X_pred_data = pd.read_csv('test_final.csv')
X_pred = np.asarray(X_pred_data.iloc[:, 1:25]).reshape(-1, 24)

In [4]:
X = np.asarray(data.iloc[:, 2:26]).reshape(-1, 24)
y = np.asarray(data.iloc[:, 1]).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01)

First, tune 'max_depth' and 'min_child_weight'

In [5]:
param_test1 = [
 {'max_depth':[4, 5],
 'min_child_weight':[4, 5],
 'learning_rate': [0.05],
 'n_estimators': [1000],
 'gamma':[0.1],
 'subsample':[0.9],
 'colsample_bytree':[0.7],
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'reg_alpha':[1]}
]

In [6]:
grid1 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test1, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)

In [7]:
grid1.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:  4.0min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [4, 5], 'min_child_weight': [4, 5], 'learning_rate': [0.05], 'n_estimators': [1000], 'gamma': [0.1], 'subsample': [0.9], 'colsample_bytree': [0.7], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'reg_alpha': [1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [8]:
xgb1 = grid1.best_estimator_
print(xgb1.get_params()['max_depth'])
print(xgb1.get_params()['min_child_weight'])

5
4


Among 4 combinations, (5, 5) is the best.

In [9]:
xgb1.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.7, gamma=0.1,
       learning_rate=0.05, max_delta_step=0, max_depth=5,
       min_child_weight=4, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.9, verbosity=1)

In [10]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb1.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb1.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.910851


In [11]:
param_test2 = [
 {'max_depth':[5, 6],
 'min_child_weight':[5, 6],
 'learning_rate': [0.05],
 'n_estimators': [1000],
 'gamma':[0.1],
 'subsample':[0.9],
 'colsample_bytree':[0.7],
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'reg_alpha':[1]}
]

In [12]:
grid2 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test2, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)

In [13]:
grid2.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  12 out of  12 | elapsed:  4.3min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5, 6], 'min_child_weight': [5, 6], 'learning_rate': [0.05], 'n_estimators': [1000], 'gamma': [0.1], 'subsample': [0.9], 'colsample_bytree': [0.7], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'reg_alpha': [1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [14]:
xgb2 = grid2.best_estimator_
print(xgb2.get_params()['max_depth'])
print(xgb2.get_params()['min_child_weight'])

6
5


Again, among those 4 combinations, (5, 5) is the best.

In [15]:
xgb2.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.7, gamma=0.1,
       learning_rate=0.05, max_delta_step=0, max_depth=6,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.9, verbosity=1)

In [16]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb2.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb2.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.881814


In [17]:
param_test3 = [
 {'max_depth':[5],
 'min_child_weight':[5],
 'learning_rate': [0.05],
 'n_estimators': [1000],
 'gamma':[0, 0.1, 0.2],
 'subsample':[0.9],
 'colsample_bytree':[0.7],
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'reg_alpha':[1]}
]

In [18]:
grid3 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test3, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)
grid3.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   9 out of   9 | elapsed:  2.9min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5], 'min_child_weight': [5], 'learning_rate': [0.05], 'n_estimators': [1000], 'gamma': [0, 0.1, 0.2], 'subsample': [0.9], 'colsample_bytree': [0.7], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'reg_alpha': [1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [19]:
xgb3 = grid3.best_estimator_
print(xgb3.get_params()['gamma'])

0


among [0, 0.1, 0.2], 0.1 is the best 'gamma'.

In [20]:
xgb3.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.7, gamma=0,
       learning_rate=0.05, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.9, verbosity=1)

In [21]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb3.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb3.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.901681


In [22]:
param_test4 = [
 {'max_depth':[5],
 'min_child_weight':[5],
 'learning_rate': [0.05],
 'n_estimators': [1000],
 'gamma':[0.09, 0.1, 0.11],
 'subsample':[0.9],
 'colsample_bytree':[0.7],
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'reg_alpha':[1]}
]

In [23]:
grid4 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test4, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)
grid4.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   9 out of   9 | elapsed:  3.1min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5], 'min_child_weight': [5], 'learning_rate': [0.05], 'n_estimators': [1000], 'gamma': [0.09, 0.1, 0.11], 'subsample': [0.9], 'colsample_bytree': [0.7], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'reg_alpha': [1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [24]:
xgb4 = grid4.best_estimator_
print(xgb4.get_params()['gamma'])

0.11


Among [0.09, 0.10, 0.11], 0.10 was the best 'gamma'.

In [25]:
xgb4.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.7, gamma=0.11,
       learning_rate=0.05, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.9, verbosity=1)

In [26]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb4.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb4.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.897096


In [27]:
param_test5 = [
 {'max_depth':[5],
 'min_child_weight':[5],
 'gamma':[0.1],
 'subsample':[0.85, 0.9, 0.95],
 'colsample_bytree':[0.65, 0.7, 0.75],
 'reg_alpha':[1], 
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'learning_rate': [0.05],
 'n_estimators': [1000]}
]

In [28]:
grid5 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test5, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)
grid5.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  27 out of  27 | elapsed:  8.0min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5], 'min_child_weight': [5], 'gamma': [0.1], 'subsample': [0.85, 0.9, 0.95], 'colsample_bytree': [0.65, 0.7, 0.75], 'reg_alpha': [1], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'learning_rate': [0.05], 'n_estimators': [1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [29]:
xgb5 = grid5.best_estimator_
print(xgb5.get_params()['subsample'])
print(xgb5.get_params()['colsample_bytree'])

0.95
0.65


In [30]:
xgb5.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.05, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.95, verbosity=1)

In [31]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb5.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb5.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.901172


Combine with discovery yesterday.
1. subsamples
    1. (0.7, 0.8, 0.9) -> 0.9 won. 
    2. (0.9, 1.0) -> 0.9 won.
    3. (0.85, 0.9, 0.95) -> 0.85 won.
2. colsample_bytree
    1. (0.7, 0.8, 0.9) -> 0.7 won.
    2. (0.6, 0.7) -> 0.7 won.
    3. (0.65, 0.7, 0.75) -> 0.65 won
    
Finally, I pick (subsamples, colsample_bytree) = (0.85, 0.65) as optimal. By the way they didn't change the scores. Maybe model works almost exactly same way and produces same predictions...

In [32]:
param_test6 = [
 {'max_depth':[5],
 'min_child_weight':[5],
 'gamma':[0.1],
 'subsample':[0.85],
 'colsample_bytree':[0.65],
 'reg_alpha':[0.5, 1, 1.5], 
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'learning_rate': [0.05],
 'n_estimators': [1000]}
]

In [33]:
grid6 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test6, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)
grid6.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   9 out of   9 | elapsed:  2.7min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5], 'min_child_weight': [5], 'gamma': [0.1], 'subsample': [0.85], 'colsample_bytree': [0.65], 'reg_alpha': [0.5, 1, 1.5], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'learning_rate': [0.05], 'n_estimators': [1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [34]:
xgb6 = grid6.best_estimator_
print(xgb6.get_params()['reg_alpha'])

1


So best 'alpha' seems to be 1. (At least 1 is best among 0.5, 1, 1.5)

In [35]:
xgb6.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.05, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [37]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb6.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb6.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.896077


In [38]:
param_test7 = [
 {'max_depth':[5],
 'min_child_weight':[5],
 'gamma':[0.1],
 'subsample':[0.85],
 'colsample_bytree':[0.65],
 'reg_alpha':[1], 
 'objective': ['binary:logistic'],
 'scale_pos_weight':[1],
 'learning_rate': [0.0025, 0.05, 0.075],
 'n_estimators': [1000]}
]

In [39]:
grid7 = GridSearchCV(xgb.XGBClassifier(), param_grid = param_test7, scoring='roc_auc', n_jobs=2, iid=False, cv=3, verbose = 2)
grid7.fit(X_train, y_train.ravel())

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   9 out of   9 | elapsed:  2.8min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=False, n_jobs=2,
       param_grid=[{'max_depth': [5], 'min_child_weight': [5], 'gamma': [0.1], 'subsample': [0.85], 'colsample_bytree': [0.65], 'reg_alpha': [1], 'objective': ['binary:logistic'], 'scale_pos_weight': [1], 'learning_rate': [0.0025, 0.05, 0.075], 'n_estimators': [1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=2)

In [40]:
xgb7 = grid7.best_estimator_
print(xgb7.get_params()['learning_rate'])

0.075


So best 'learning_rate' seems to be 0.5 (at least better than 0.001, 0.01, 0.025, 0.075)

In [41]:
xgb7.fit(X_train, y_train.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.075, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [43]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb7.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb7.predict_proba(X_test)[:, 1]))

Accuracy: 0.939024
AUC: 0.877738


In [44]:
# y_submission10 = pd.DataFrame(xgb7.predict_proba(X_pred)[:, 1], columns = ['Y']) 
# y_submission10['Id'] = X_pred_data['Id']
# y_submission10 = y_submission10.reindex(columns=["Id", "Y"])
# y_submission10.to_csv("submission10_uk734.csv", index=False)

In [45]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2)

In [46]:
xgb8 = grid7.best_estimator_
xgb8.fit(X_train2, y_train2.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.075, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [47]:
print('Accuracy: %f' % metrics.accuracy_score(y_test2, xgb8.predict(X_test2)))
print('AUC: %f' %metrics.roc_auc_score(y_test2, xgb8.predict_proba(X_test2)[:, 1]))

Accuracy: 0.955752
AUC: 0.862954


In [48]:
# y_submission11 = pd.DataFrame(xgb8.predict_proba(X_pred)[:, 1], columns = ['Y']) 
# y_submission11['Id'] = X_pred_data['Id']
# y_submission11 = y_submission11.reindex(columns=["Id", "Y"])
# y_submission11.to_csv("submission11_uk734.csv", index=False)

None of today's tuned models did better than yesterday. Now I will move onto stacking, and removing outliers.

In [49]:
xgb9 = grid7.best_estimator_
xgb9.fit(X, y.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.075, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [50]:
print('Accuracy: %f' % metrics.accuracy_score(y_test, xgb9.predict(X_test)))
print('AUC: %f' %metrics.roc_auc_score(y_test, xgb9.predict_proba(X_test)[:, 1]))

Accuracy: 0.993902
AUC: 1.000000


In [54]:
# y_submission12 = pd.DataFrame(xgb9.predict_proba(X_pred)[:, 1], columns = ['Y']) 
# y_submission12['Id'] = X_pred_data['Id']
# y_submission12 = y_submission12.reindex(columns=["Id", "Y"])
# y_submission12.to_csv("submission12_uk734.csv", index=False)

In [51]:
xgb10 = xgb9
xgb10.set_params(reg_alpha = 1)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.075, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [52]:
X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.05)
xgb10.fit(X_train3, y_train3.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.65, gamma=0.1,
       learning_rate=0.075, max_delta_step=0, max_depth=5,
       min_child_weight=5, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=1, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=0.85, verbosity=1)

In [53]:
print('Accuracy: %f' % metrics.accuracy_score(y_test3, xgb10.predict(X_test3)))
print('AUC: %f' %metrics.roc_auc_score(y_test3, xgb10.predict_proba(X_test3)[:, 1]))

Accuracy: 0.947561
AUC: 0.868245


In [55]:
# y_submission13 = pd.DataFrame(xgb10.predict_proba(X_pred)[:, 1], columns = ['Y']) 
# y_submission13['Id'] = X_pred_data['Id']
# y_submission13 = y_submission13.reindex(columns=["Id", "Y"])
# y_submission13.to_csv("submission13_uk734.csv", index=False)