Started the competition as I was busy with work and life today, Saturday, June 15, 2019.

Important things to note:

    Baseline XGBoost model with original aggregated features is AUC: .68+
    Adding minutes aggregation will result to +.06 points to get AUC 0.74+ on Hold-out set

### Approach

1. aggregate the dataframe with original features (with second)
2. aggregate the dataframe with minutes (second converted to minute) and merge with original aggregated dataframe using bookingid
3. train models on the whole feature set with XGB and LGB
4. use bayesian optimization to find the optimal parameters
5. retrain the model using the found parameters stored in .dict file (for xgb and lgb)
6. predict and check results.

### How to use the notebook
Step by step:

1. instantiate the variables for labels and features
2. run get_all_data to get the initial dataframe
3. run process_all_data to get X_train, X_test, y_train, y_test
4. run xgb_optimize and lgb_optimize
5. run get_test_result to get the hold out evaluation (using X_test as your hold out data and y_test as your ground truth data.

Additional steps for grab evaluator:
6. run get_all_data for your files for the initial dataframe
7. run process_all_data and set test_grab_evaluation=True to get your final process dataframe
8. run get_test_result to get the evaluation score of your files (using run_process_all_data output as your hold out data and your file(s) ground truth data.

In [1]:
import pandas as pd
import numpy as np
from modules.utils import reduce_mem_usage, interaction_features, data_aggregation_in_minutes
from modules.models import optimize_lgb, optimize_xgb, xgb_cv, lgb_cv
import glob #file handling
from itertools import combinations
import gc #garbage collector for memory usage efficiency
gc.enable()

In [2]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold

In [3]:
labels = pd.read_csv(glob.glob('../labels/*.csv')[0])
features = sorted(glob.glob('../features/*.csv')) #sort from 0-9

In [4]:
def get_all_data(labels, features):
    labels.columns = [x.lower() for x in labels.columns]
    df = pd.DataFrame()
    for file in features:
        print(f"processing file {file}")
        new_df = pd.read_csv(file)
        new_df = reduce_mem_usage(new_df)
        df = df.append(new_df)
        del new_df
        gc.collect()
    return df

In [5]:
df = get_all_data(labels, features)

processing file ../features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.41 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.41 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00005-e6120af0-10c2-4248-97

In [6]:
def process_all_data(df, test_grab_evaluation=False):
    """
    This function will handle the processing of the dataframe to be model-consumable.
    Please handle the merging of the labels on your own (if you are evaluating the model(s)
    just set your processed file to the test_ param in the models module.

    Returns:
        if grab evaluation:
            final_df : final dataframe without label (for evaluation)
        my case:
            X_train, X_test, y_train, y_test : sets of processed data for training and evaluation
    """
    print("Generating dataframe and feature engineering.....")
    df.columns = [x.lower() for x in df.columns]
    df = df.merge(labels, on='bookingid', how='left')
    ids_with_duplicate_labels = df[['bookingid', 'label']].groupby(['bookingid']).agg(np.mean)
    ids_with_duplicate_labels=ids_with_duplicate_labels.loc[ids_with_duplicate_labels['label'] == 0.5]\
                .reset_index()['bookingid'].unique()
    df = df.set_index('bookingid').drop(ids_with_duplicate_labels).reset_index()
    df = df.sort_values(['bookingid', 'second'], ascending=True)

    df['total_acceleration'] = np.sqrt((df['acceleration_x'] ** 2) + \
                                       (df['acceleration_y'] ** 2) + (df['acceleration_z'] ** 2))

    #https://physics.stackexchange.com/questions/41653/how-do-i-get-the-total-acceleration-from-3-axes

    df['gyro_magnitude'] = np.sqrt((df['gyro_x'] ** 2) + \
                                       (df['gyro_y'] ** 2) + (df['gyro_z'] ** 2))

    #https://electronics.stackexchange.com/questions/92447/is-the-magnitude-of-gyro-xyz-meaningful
    print('generating interaction features....')
    for e, (x, y) in enumerate(combinations(['speed', 'accuracy', 'bearing','total_acceleration', 'gyro_magnitude'], 2)):
        df = interaction_features(df, x, y, e)

    df.drop('label',axis=1, inplace=True) #drop again after finding the duplicate labels

    final_df = df.drop('second', axis=1).groupby('bookingid')\
            .agg([np.mean, max, min, np.std, sum]) ##aggregation for original df

    final_df.columns = ['_orig_'.join(col).strip() for col in final_df.columns.values]
    final_df = final_df.reset_index()
    minute_list = [1,5,10,15,20,25,30,60]
    print(f"generating feature engineering using minutes in {minute_list} ....")
    for m in minute_list:
        to_append_df = data_aggregation_in_minutes(df, minute=m)
        to_append_df = to_append_df.drop('minute', axis=1).groupby('bookingid').agg(np.mean).reset_index()
        final_df = final_df.merge(to_append_df, on='bookingid', how='left')
        del to_append_df
        gc.collect()
    del df
    gc.collect()
    print('Done!')
    if test_grab_evaluation:
        """
        Return the generated/processed dataframe for Grab evaluation.
        Please handle the labels of your files on your own (ground truth of your evaluation data)
        """
        return final_df
    else:
        """
        My Case.
        """
        final_df = final_df.merge(labels, on='bookingid', how='left')
        X_train, X_test, y_train, y_test = train_test_split(\
                    final_df.drop('label', axis=1), final_df.label, test_size=0.25, random_state=42)

        return X_train, X_test, y_train, y_test

In [7]:
X_train, X_test, y_train, y_test = process_all_data(df, test_grab_evaluation=False)

Generating dataframe and feature engineering.....
generating interaction features....
generating feature engineering using minutes in [1, 5, 10, 15, 20, 25, 30, 60] ....
Done!


In [8]:
#X_train, X_test, y_train, y_test = pd.read_pickle('X_train.pkl'), \
# pd.read_pickle('X_test.pkl'),  pd.read_pickle('y_train.pkl'),  pd.read_pickle('y_test.pkl')

In [9]:
init_points = 10
num_iter = 20
num_params = 1
stratify=True

In [10]:
optimize_lgb(X_train, X_test, y_train, init_points=init_points, num_iter=num_iter,\
             num_params=num_params, stratify=stratify)

Bayesian optimization results will be stored in lgb_BO_res_optimization.dict after training...
|   iter    |  target   | baggin... | baggin... | featur... |  max_bin  | min_da... | min_ga... | min_su... | num_le... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.764769	valid_1's auc: 0.718392
[2000]	training's auc: 0.783211	valid_1's auc: 0.719689
Early stopping, best iteration is:
[2420]	training's auc: 0.789068	valid_1's auc: 0.720201
Fold  1 AUC : 0.720201
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.763703	valid_1's auc: 0.726267
[2000]	training's auc: 0.781967	valid_1's auc: 0.72731
Early stopping, best iteration is:
[2258]	training's auc: 0.785487	valid_1's auc: 0.727839
Fold  2 AUC : 0.727839
Training until validation scores don

Fold  3 AUC : 0.720573
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.804795	valid_1's auc: 0.723313
Early stopping, best iteration is:
[786]	training's auc: 0.794903	valid_1's auc: 0.724038
Fold  4 AUC : 0.724050
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.802574	valid_1's auc: 0.733492
Early stopping, best iteration is:
[586]	training's auc: 0.780802	valid_1's auc: 0.737268
Fold  5 AUC : 0.737268
Full AUC score 0.725337
| [0m 6       [0m | [0m 0.7253  [0m | [0m 0.1119  [0m | [0m 0.6072  [0m | [0m 0.3047  [0m | [0m 25.85   [0m | [0m 3.47e+03[0m | [0m 0.9121  [0m | [0m 0.000397[0m | [0m 39.72   [0m | [0m 0.9588  [0m | [0m 0.792   [0m |
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.859646	valid_1's auc: 0.725753
Early stopping, best iteration is:
[571]	training's auc: 0.824413	valid_1's auc: 0.729619
Fold  1 AUC : 0.729626
Training until 

Fold  4 AUC : 0.711829
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.752852	valid_1's auc: 0.727114
Early stopping, best iteration is:
[669]	training's auc: 0.744891	valid_1's auc: 0.727539
Fold  5 AUC : 0.727539
Full AUC score 0.718213
| [0m 12      [0m | [0m 0.7182  [0m | [0m 0.5758  [0m | [0m 0.9311  [0m | [0m 0.086   [0m | [0m 28.0    [0m | [0m 4.46e+03[0m | [0m 0.333   [0m | [0m 0.00036 [0m | [0m 6.677   [0m | [0m 0.4523  [0m | [0m 0.6485  [0m |
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.816065	valid_1's auc: 0.723482
Early stopping, best iteration is:
[818]	training's auc: 0.806556	valid_1's auc: 0.724175
Fold  1 AUC : 0.724178
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.814641	valid_1's auc: 0.724749
Early stopping, best iteration is:
[986]	training's auc: 0.814056	valid_1's auc: 0.725035
Fold  2 AUC : 0.725035
Training until 

Fold  2 AUC : 0.724204
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[478]	training's auc: 0.78275	valid_1's auc: 0.719097
Fold  3 AUC : 0.719097
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.81291	valid_1's auc: 0.725123
Early stopping, best iteration is:
[683]	training's auc: 0.794904	valid_1's auc: 0.726148
Fold  4 AUC : 0.726148
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[421]	training's auc: 0.773902	valid_1's auc: 0.739217
Fold  5 AUC : 0.739217
Full AUC score 0.726142
| [0m 19      [0m | [0m 0.7261  [0m | [0m 0.5843  [0m | [0m 0.3095  [0m | [0m 0.1003  [0m | [0m 145.3   [0m | [0m 3.431e+0[0m | [0m 0.1546  [0m | [0m 0.000219[0m | [0m 11.25   [0m | [0m 0.1156  [0m | [0m 0.3399  [0m |
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.775423	valid_1's auc: 0.721928
[2000]	tra

Early stopping, best iteration is:
[192]	training's auc: 0.779024	valid_1's auc: 0.725953
Fold  3 AUC : 0.725953
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.856381	valid_1's auc: 0.723348
Early stopping, best iteration is:
[957]	training's auc: 0.853249	valid_1's auc: 0.724394
Fold  4 AUC : 0.724400
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[355]	training's auc: 0.796628	valid_1's auc: 0.738848
Fold  5 AUC : 0.738848
Full AUC score 0.728263
| [0m 25      [0m | [0m 0.7283  [0m | [0m 0.139   [0m | [0m 0.3233  [0m | [0m 0.4647  [0m | [0m 50.87   [0m | [0m 2.62e+03[0m | [0m 0.3955  [0m | [0m 9.37e-05[0m | [0m 36.69   [0m | [0m 0.8423  [0m | [0m 0.3352  [0m |
Training until validation scores don't improve for 500 rounds.
[1000]	training's auc: 0.847218	valid_1's auc: 0.729844
Early stopping, best iteration is:
[590]	training's auc: 0.816882	valid_1's auc: 0.732695
Fold

In [11]:
optimize_xgb(X_train, X_test, y_train, init_points=init_points, num_iter=num_iter,\
             num_params=num_params, stratify=stratify)

Bayesian optimization results will be stored in xgb_BO_res_optimization.dict after training...
|   iter    |  target   | colsam... |   gamma   | max_depth | min_ch... | reg_alpha | reg_la... | scale_... | subsample |
-------------------------------------------------------------------------------------------------------------------------
[0]	train-auc:0.649969	eval-auc:0.646602
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.744072	eval-auc:0.730307
[200]	train-auc:0.752086	eval-auc:0.733475
[300]	train-auc:0.757017	eval-auc:0.733635
[400]	train-auc:0.758905	eval-auc:0.734548
[500]	train-auc:0.761267	eval-auc:0.734517
Stopping. Best iteration:
[411]	train-auc:0.759427	eval-auc:0.734724

Fold  1 AUC : 0.734724
[0]	train-auc:0.63862	eval-auc:0.635271
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 

Fold  4 AUC : 0.737497
Full AUC score 0.735507
| [95m 4       [0m | [95m 0.7355  [0m | [95m 0.9365  [0m | [95m 6.862   [0m | [95m 3.589   [0m | [95m 80.99   [0m | [95m 0.4535  [0m | [95m 0.6545  [0m | [95m 0.8691  [0m | [95m 0.4644  [0m |
[0]	train-auc:0.70338	eval-auc:0.69706
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.770305	eval-auc:0.738098
[200]	train-auc:0.804604	eval-auc:0.741666
[300]	train-auc:0.831688	eval-auc:0.74188
Stopping. Best iteration:
[222]	train-auc:0.811429	eval-auc:0.742936

Fold  1 AUC : 0.742936
[0]	train-auc:0.705535	eval-auc:0.682703
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.768443	eval-auc:0.739137
[200]	train-auc:0.802901	eval-auc:0.742078
[300]	train-auc:0.830204	eval-auc:0.740931
Stopping. Best iteration:
[

Fold  1 AUC : 0.744175
[0]	train-auc:0.675822	eval-auc:0.664968
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.757203	eval-auc:0.735804
[200]	train-auc:0.781812	eval-auc:0.739697
[300]	train-auc:0.80387	eval-auc:0.738993
Stopping. Best iteration:
[268]	train-auc:0.797051	eval-auc:0.740695

Fold  2 AUC : 0.740695
[0]	train-auc:0.700697	eval-auc:0.673479
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.761801	eval-auc:0.715491
[200]	train-auc:0.787153	eval-auc:0.720851
[300]	train-auc:0.806852	eval-auc:0.723426
Stopping. Best iteration:
[266]	train-auc:0.800759	eval-auc:0.723844

Fold  3 AUC : 0.723844
[0]	train-auc:0.691785	eval-auc:0.696655
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved 

Fold  1 AUC : 0.738580
[0]	train-auc:0.732806	eval-auc:0.70334
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.89905	eval-auc:0.737359
Stopping. Best iteration:
[49]	train-auc:0.839044	eval-auc:0.739649

Fold  2 AUC : 0.739649
[0]	train-auc:0.747274	eval-auc:0.689813
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.890649	eval-auc:0.724633
[200]	train-auc:0.958383	eval-auc:0.726123
Stopping. Best iteration:
[145]	train-auc:0.92962	eval-auc:0.726741

Fold  3 AUC : 0.726741
[0]	train-auc:0.725273	eval-auc:0.69684
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.89528	eval-auc:0.735387
Stopping. Best iteration:
[63]	train-auc:0.857798	eval-auc:0.738343

Fold  4

Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.751276	eval-auc:0.734808
[200]	train-auc:0.771048	eval-auc:0.739531
[300]	train-auc:0.789389	eval-auc:0.742457
[400]	train-auc:0.804552	eval-auc:0.742764
[500]	train-auc:0.817719	eval-auc:0.743393
Stopping. Best iteration:
[479]	train-auc:0.815196	eval-auc:0.744142

Fold  1 AUC : 0.744142
[0]	train-auc:0.640617	eval-auc:0.640176
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.749766	eval-auc:0.734769
[200]	train-auc:0.772473	eval-auc:0.740978
[300]	train-auc:0.790087	eval-auc:0.744061
[400]	train-auc:0.805448	eval-auc:0.744417
[500]	train-auc:0.818566	eval-auc:0.745504
[600]	train-auc:0.829918	eval-auc:0.746274
[700]	train-auc:0.841764	eval-auc:0.746041
Stopping. Best iteration:
[608]	train-auc:0.831025	eval-auc:0.7465

Fold  4 AUC : 0.742714
Full AUC score 0.741852
| [95m 22      [0m | [95m 0.7419  [0m | [95m 1.0     [0m | [95m 1.0     [0m | [95m 2.0     [0m | [95m 44.04   [0m | [95m 0.2     [0m | [95m 0.2     [0m | [95m 1.0     [0m | [95m 1.0     [0m |
[0]	train-auc:0.723509	eval-auc:0.703865
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.799287	eval-auc:0.743907
[200]	train-auc:0.833394	eval-auc:0.746957
[300]	train-auc:0.853183	eval-auc:0.748871
Stopping. Best iteration:
[276]	train-auc:0.848975	eval-auc:0.749536

Fold  1 AUC : 0.749536
[0]	train-auc:0.714663	eval-auc:0.710751
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.804074	eval-auc:0.741098
[200]	train-auc:0.835638	eval-auc:0.74274
[300]	train-auc:0.852958	eval-auc:0.742985
[400]	train-auc:0.862806	

[0]	train-auc:0.642314	eval-auc:0.644005
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.739971	eval-auc:0.728688
[200]	train-auc:0.754023	eval-auc:0.731591
[300]	train-auc:0.765679	eval-auc:0.733683
[400]	train-auc:0.775716	eval-auc:0.734653
[500]	train-auc:0.783625	eval-auc:0.736064
[600]	train-auc:0.791182	eval-auc:0.736832
[700]	train-auc:0.797646	eval-auc:0.737932
[800]	train-auc:0.803704	eval-auc:0.738711
Stopping. Best iteration:
[796]	train-auc:0.80347	eval-auc:0.73876

Fold  1 AUC : 0.738760
[0]	train-auc:0.643153	eval-auc:0.641651
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.738285	eval-auc:0.731222
[200]	train-auc:0.753705	eval-auc:0.736576
[300]	train-auc:0.764023	eval-auc:0.735882
Stopping. Best iteration:
[240]	train-auc:0.757969	eval-auc:0.736669



In [12]:
from six.moves import cPickle
def get_test_result(X_test, y_test, model='xgb'):
    if model == 'xgb':
        fname=max(glob.glob('*-xgb*.dict'))
    elif model == 'lgb':
        fname=max(glob.glob('*-lgb*.dict'))
    fname= fname.encode()
    with open(fname, 'rb') as fl:
        res = cPickle.load(fl)

    res_df = pd.DataFrame(res)
    top_res = res_df['target'].argsort().iloc[-num_params:]
    num_round_index = top_res.values[::-1]
    params = res_df.loc[top_res.values[::-1]]['params']
   # subs = dict()
    for res_name, (nm_rnd, param) in enumerate(zip(num_round_index, params)):
        if model == 'xgb':
            sub = xgb_cv(X_train, X_test, y_train, **param, test_phase=True, stratify=stratify)
        elif model == 'lgb':
            sub = lgb_cv(X_train, X_test, y_train, **param, test_phase=True, stratify=stratify)
    
    if model == 'xgb':
        print(f"xgboost test set AUC: {roc_auc_score(y_test, sub)}")
        return sub
    elif model == 'lgb':
        print(f"lightgbm test set AUC: {roc_auc_score(y_test, sub)}")
        return sub

In [13]:
sub_xgb = get_test_result(X_test, y_test, model='xgb')

[0]	train-auc:0.640602	eval-auc:0.63879
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.756598	eval-auc:0.739187
[200]	train-auc:0.778555	eval-auc:0.742302
[300]	train-auc:0.79773	eval-auc:0.744532
[400]	train-auc:0.812995	eval-auc:0.745301
Stopping. Best iteration:
[396]	train-auc:0.812327	eval-auc:0.745416

Fold  1 AUC : 0.745416
[0]	train-auc:0.640799	eval-auc:0.636976
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.755552	eval-auc:0.739603
[200]	train-auc:0.776864	eval-auc:0.743066
[300]	train-auc:0.795824	eval-auc:0.744526
[400]	train-auc:0.812234	eval-auc:0.745672
[500]	train-auc:0.825632	eval-auc:0.745764
[600]	train-auc:0.838313	eval-auc:0.74604
Stopping. Best iteration:
[579]	train-auc:0.835646	eval-auc:0.746438

Fold  2 AUC : 0.746438
[0]	train-auc:0.67982

In [14]:
sub_lgb = get_test_result(X_test, y_test, model='lgb')

Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[149]	training's auc: 0.793607	valid_1's auc: 0.735233
Fold  1 AUC : 0.735233
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[148]	training's auc: 0.793273	valid_1's auc: 0.732797
Fold  2 AUC : 0.732797
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[118]	training's auc: 0.786265	valid_1's auc: 0.730245
Fold  3 AUC : 0.730245
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[190]	training's auc: 0.808311	valid_1's auc: 0.728908
Fold  4 AUC : 0.728908
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[93]	training's auc: 0.773113	valid_1's auc: 0.743349
Fold  5 AUC : 0.743349
Full AUC score 0.733629
lightgbm test set AUC: 0.7289775570928088


In [15]:
print(f'mean combined predictions for single xgb and lgb models \
{roc_auc_score(y_test, ((sub_lgb + sub_xgb) / 2))}')

mean combined predictions for single xgb and lgb models 0.7340758026587321


### Demo for Grab Evaluation

In [16]:
grab_data = get_all_data(labels, features) #just use the features files

processing file ../features/part-00000-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.41 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00001-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00002-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00003-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.41 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00004-e6120af0-10c2-4248-97c4-81baf4304e5c-c000.csv
Memory usage of dataframe is 135.42 MB
Memory usage after optimization is: 46.16 MB
Decreased by 65.9%
processing file ../features/part-00005-e6120af0-10c2-4248-97

In [17]:
#get sample of 10 bookingid to test
grab_data = grab_data.set_index('bookingID').loc[grab_data.sample(10)['bookingID'].tolist()].reset_index()

In [18]:
grab_data = process_all_data(grab_data, test_grab_evaluation=True) 
#set test_grab_evaluation to Tre

Generating dataframe and feature engineering.....
generating interaction features....
generating feature engineering using minutes in [1, 5, 10, 15, 20, 25, 30, 60] ....
Done!


In [19]:
grab_data = grab_data.merge(labels, how='left', on='bookingid') #merge with the labels

In [20]:
test = grab_data
y_ground_truth = grab_data.label

In [21]:
grab_lgb_preds = get_test_result(test, y_ground_truth, model='lgb')

Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[149]	training's auc: 0.793607	valid_1's auc: 0.735233
Fold  1 AUC : 0.735233
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[148]	training's auc: 0.793273	valid_1's auc: 0.732797
Fold  2 AUC : 0.732797
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[118]	training's auc: 0.786265	valid_1's auc: 0.730245
Fold  3 AUC : 0.730245
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[190]	training's auc: 0.808311	valid_1's auc: 0.728908
Fold  4 AUC : 0.728908
Training until validation scores don't improve for 500 rounds.
Early stopping, best iteration is:
[93]	training's auc: 0.773113	valid_1's auc: 0.743349
Fold  5 AUC : 0.743349
Full AUC score 0.733629
lightgbm test set AUC: 0.8571428571428571


In [22]:
grab_xgb_preds = get_test_result(test, y_ground_truth, model='xgb')

[0]	train-auc:0.640602	eval-auc:0.63879
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.756598	eval-auc:0.739187
[200]	train-auc:0.778555	eval-auc:0.742302
[300]	train-auc:0.79773	eval-auc:0.744532
[400]	train-auc:0.812995	eval-auc:0.745301
Stopping. Best iteration:
[396]	train-auc:0.812327	eval-auc:0.745416

Fold  1 AUC : 0.745416
[0]	train-auc:0.640799	eval-auc:0.636976
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 100 rounds.
[100]	train-auc:0.755552	eval-auc:0.739603
[200]	train-auc:0.776864	eval-auc:0.743066
[300]	train-auc:0.795824	eval-auc:0.744526
[400]	train-auc:0.812234	eval-auc:0.745672
[500]	train-auc:0.825632	eval-auc:0.745764
[600]	train-auc:0.838313	eval-auc:0.74604
Stopping. Best iteration:
[579]	train-auc:0.835646	eval-auc:0.746438

Fold  2 AUC : 0.746438
[0]	train-auc:0.67982

In [23]:
print(f'mean combined predictions for evaluation files of single xgb and lgb models \
{roc_auc_score(y_ground_truth, ((grab_lgb_preds + grab_xgb_preds) / 2))}')

mean combined predictions for evaluation files of single xgb and lgb models 0.8571428571428572
