# DEPLOYMENT MODEL PREPARATION

<pre>This notebook contains the pre-processing and model tuning for deployment. Since we will be deploying the model on Free Tier of AWS EC2 instance, we need to create the model and required tables which would require less compute power. Hence we will model on the top 300 features, taken from the best single XGBoost Models Feature Importances. We will optimize the LightGBM model and check the performance on Test Data, to make sure that the model doesn't perform too well. <br>
We are making a sort of a trade-off between performance and compute requirements.</pre>

## Importing Libraries and Utility Functions

In [None]:
#importing Useful DataStructures
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3

#importing Misc Libraries
import gc
import pickle
from datetime import datetime

#sklearn libraries
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

#bayesian optimization
from bayes_opt import BayesianOptimization

#lightgbm
import lightgbm as lgb
from lightgbm import LGBMClassifier

#for 100% jupyter notebook cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
def reduce_mem_usage(data, verbose = True):
    #source: https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
    '''
    This function is used to reduce the memory usage by converting the datatypes of a pandas
    DataFrame withing required limits.
    '''

    start_mem = data.memory_usage().sum() / 1024**2
    if verbose:
        print('-'*100)
        print('Memory usage of dataframe: {:.2f} MB'.format(start_mem))

    for col in data.columns:
        col_type = data[col].dtype

        if col_type != object:
            c_min = data[col].min()
            c_max = data[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    data[col] = data[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    data[col] = data[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    data[col] = data[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    data[col] = data[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    data[col] = data[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    data[col] = data[col].astype(np.float32)
                else:
                    data[col] = data[col].astype(np.float64)

    end_mem = data.memory_usage().sum() / 1024**2
    if verbose:
        print('Memory usage after optimization: {:.2f} MB'.format(end_mem))
        print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
        print('-'*100)

    return data

In [None]:
def relational_tables_prepare(table_name, file_directory = '', verbose = True, num_top_cols = 300):
    '''
    Function to pickle the relational tables which would need to be merged during production with the
    test datapoint

    Inputs:
        table_name: str
            The name of file to be pickled.
        file_directory: str, default = ''
            The directory in which files are saved
        verbose: bool, default = True
            Whether to keep verbosity or not
        num_top_cols: int, default = 300
            Number of columns to keep out of 600 for deployment

    Returns:
        None
    '''

    if verbose:
        print("Loading the tables into memory...")
        start = datetime.now()

    #loading all the tables in memory, for dimensionality reduction
    with open(file_directory + 'bureau_merged_preprocessed.pkl', 'rb') as f:
        bureau_aggregated = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'previous_application_preprocessed.pkl', 'rb') as f:
        previous_aggregated = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'installments_payments_preprocessed.pkl', 'rb') as f:
        installments_aggregated = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'POS_CASH_balance_preprocessed.pkl', 'rb') as f:
        pos_aggregated = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'credit_card_balance_preprocessed.pkl', 'rb') as f:
        cc_aggregated = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'application_train_preprocessed.pkl', 'rb') as f:
        application_train = reduce_mem_usage(pickle.load(f), verbose = False)
    with open(file_directory + 'application_test_preprocessed.pkl', 'rb') as f:
        application_test = reduce_mem_usage(pickle.load(f), verbose = False)
    #select only num_to_cols
    with open('Final_XGBOOST_Selected_features.pkl', 'rb') as f:
        final_cols = pickle.load(f)[:num_top_cols]

    if verbose:
        print("Done.")
        print(f"Time Elapsed = {datetime.now() - start}")
        start2 = datetime.now()
        print("\nRemoving the non-useful features...")

    #removing non-useful columns from pre-processed previous_application table
    previous_app_columns_to_keep = set(previous_aggregated.columns).intersection(set(final_cols)).union(
                                    set([ele for ele in previous_aggregated.columns if 'AMT_ANNUITY' in ele] + [ele for ele in previous_aggregated.columns if 'AMT_GOODS' in ele]))
    previous_aggregated = previous_aggregated[previous_app_columns_to_keep]
    #removing non-useful columns from pre-processed credit_card_balance table
    credit_card_balance_columns_to_keep = set(cc_aggregated.columns).intersection(set(final_cols)).union(
                                    set([ele for ele in cc_aggregated.columns if 'AMT_RECEIVABLE_PRINCIPAL' in ele] +
                                        [ele for ele in cc_aggregated.columns if 'AMT_RECIVABLE' in ele] +
                                        [ele for ele in cc_aggregated.columns if 'TOTAL_RECEIVABLE' in ele] + ['SK_ID_CURR']))
    cc_aggregated = cc_aggregated[credit_card_balance_columns_to_keep]
    #removing non-useful columns from pre-processed installments_payments table
    installments_payments_columns_to_keep = set(installments_aggregated.columns).intersection(set(final_cols)).union(
                                            set([ele for ele in installments_aggregated.columns if 'AMT_PAYMENT' in
                                                 ele and 'RATIO' not in ele and 'DIFF' not in ele] + ['AMT_INSTALMENT_MEAN_MAX', 'AMT_INSTALMENT_SUM_MAX']))
    installments_aggregated = installments_aggregated[installments_payments_columns_to_keep]
    #removing non-useful columns from pre-processed bureau-aggregated table
    bureau_columns_to_keep =  set(bureau_aggregated.columns).intersection(set(final_cols)).union([ele for ele in bureau_aggregated.columns
                                        if 'DAYS_CREDIT' in ele and 'ENDDATE' not in ele and 'UPDATE' not in ele] + [ele for ele in bureau_aggregated.columns if
                                        'AMT_CREDIT' in ele and 'OVERDUE' in ele] + [ele for ele in bureau_aggregated.columns if 'AMT_ANNUITY' in ele and 'CREDIT'  not in ele])
    bureau_aggregated = bureau_aggregated[bureau_columns_to_keep]

    if verbose:
        print("Done.")
        print(f"Time Elapsed = {datetime.now() - start2}")
        print("\nMerging all the tables, and saving to pickle file 'relational_table.pkl'...")

    #merging all the tables
    relational_table = cc_aggregated.merge(bureau_aggregated, on = 'SK_ID_CURR', how = 'outer')
    relational_table = relational_table.merge(previous_aggregated, on = 'SK_ID_CURR', how = 'outer')
    relational_table = relational_table.merge(installments_aggregated, on = 'SK_ID_CURR', how = 'outer')
    relational_table = relational_table.merge(pos_aggregated, on = 'SK_ID_CURR', how = 'outer')
    relational_table = reduce_mem_usage(relational_table, verbose = False)

    with open('LGBM Deployment/' + table_name + '.pkl', 'wb') as f:
        pickle.dump(relational_table, f)

    if verbose:
        print("Done.")
        print(f"Total Time taken = {datetime.now() - start}")

## Modelling on 300 Features

### Loading Data with top 300 Features

In [None]:
#loading the training and test data
with open('train_data_final.pkl', 'rb') as f:
    train_data = pickle.load(f)
with open('test_data_final.pkl', 'rb') as f:
    test_data = pickle.load(f)

#getting the test SK_ID_CURR and train class labels
target_train = train_data.pop('TARGET')
skid_test = test_data.pop('SK_ID_CURR')
#remvoing sk_id_curr from train data
_ = train_data.pop('SK_ID_CURR')

In [None]:
#loading the final columns for modelling, obtained from XGBoost
#choosing only first 300 columns
with open('Final_XGBOOST_Selected_features.pkl', 'rb') as f:
    final_cols = pickle.load(f)[:300]
train_data = train_data[final_cols]
test_data = test_data[final_cols]
print(f"Shape of Train Data = {train_data.shape}")
print(f"Shape of Test Data = {test_data.shape}")

Shape of Train Data = (307507, 300)
Shape of Test Data = (48744, 300)


### Bayesian Optimization for LGBM Model (to be deployed)

In [None]:
def lgbm_evaluation(num_leaves, max_depth, min_split_gain, min_child_weight,
                    min_child_samples, subsample, colsample_bytree, reg_alpha, reg_lambda):
    '''
    Function for Bayesian Optimization of LightGBM's Hyperparamters. Takes the hyperparameters as input, and
    returns the Cross-Validation AUC as output.

    Inputs: Hyperparamters to be tuned.
        num_leaves, max_depth, min_split_gain, min_child_weight,
        min_child_samples, subsample, colsample_bytree, reg_alpha, reg_lambda

    Returns:
        CV ROC-AUC Score
    '''

    params = {
        'objective' : 'binary',
        'boosting_type' : 'gbdt',
        'learning_rate' : 0.05,
        'n_estimators' : 5000,
        'n_jobs' : -1,
        'num_leaves' : int(round(num_leaves)),
        'max_depth' : int(round(max_depth)),
        'min_split_gain' : min_split_gain,
        'min_child_weight' : min_child_weight,
        'min_child_samples' : int(round(min_child_samples)),
        'subsample': subsample,
        'subsample_freq' : 1,
        'colsample_bytree' : colsample_bytree,
        'reg_alpha' : reg_alpha,
        'reg_lambda' : reg_lambda,
        'verbosity' : -1,
        'seed' : 2131
    }

    stratified_cv = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 312)

    cv_preds = np.zeros(train_data.shape[0])
    for train_indices, cv_indices in stratified_cv.split(train_data, target_train):

        x_tr = train_data.iloc[train_indices]
        y_tr = target_train.iloc[train_indices]
        x_cv = train_data.iloc[cv_indices]
        y_cv = target_train.iloc[cv_indices]

        lgbm_clf = lgb.LGBMClassifier(**params)
        lgbm_clf.fit(x_tr, y_tr, eval_set= [(x_cv, y_cv)],
                eval_metric='auc', verbose = False, early_stopping_rounds=200)

        cv_preds[cv_indices] = lgbm_clf.predict_proba(x_cv, num_iteration = lgbm_clf.best_iteration_)[:,1]

    return roc_auc_score(target_train, cv_preds)

In [None]:
bopt_lgbm_300 = BayesianOptimization(lgbm_evaluation, {'num_leaves' : (25,50),
                                                   'max_depth' : (6,11),
                                                   'min_split_gain' : (0, 0.1),
                                                   'min_child_weight' : (5,80),
                                                   'min_child_samples' : (5,80),
                                                   'subsample' : (0.5,1),
                                                   'colsample_bytree' : (0.5,1),
                                                   'reg_alpha' : (0.001, 0.3),
                                                   'reg_lambda' : (0.001, 0.3)},
                                 random_state = 312)

bayesian_optimization = bopt_lgbm_300.maximize(n_iter = 6, init_points = 4)

|   iter    |  target   | colsam... | max_depth | min_ch... | min_ch... | min_sp... | num_le... | reg_alpha | reg_la... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.8011  [0m | [0m 0.6084  [0m | [0m 8.094   [0m | [0m 36.07   [0m | [0m 17.77   [0m | [0m 0.08661 [0m | [0m 40.52   [0m | [0m 0.1314  [0m | [0m 0.2057  [0m | [0m 0.5302  [0m |
| [95m 2       [0m | [95m 0.802   [0m | [95m 0.9637  [0m | [95m 6.587   [0m | [95m 54.72   [0m | [95m 34.07   [0m | [95m 0.08797 [0m | [95m 32.6    [0m | [95m 0.1777  [0m | [95m 0.02717 [0m | [95m 0.5482  [0m |
| [0m 3       [0m | [0m 0.8014  [0m | [0m 0.809   [0m | [0m 9.7     [0m | [0m 70.59   [0m | [0m 71.72   [0m | [0m 0.008772[0m | [0m 42.08   [0m | [0m 0.02745 [0m | [0m 0.2914  [0m | [0m 0.6052  [0m |
| [0m 4       [0m | [0m 0.8017  [0m | [0m 0.8429  

In [None]:
#extracting the best parameters
target_values = []
for result in bopt_lgbm_300.res:
    target_values.append(result['target'])
    if result['target'] == max(target_values):
        best_params = result['params']

print("Best Hyperparameters obtained for 300 selected features are:\n")
print(best_params)

Best Hyperparameters obtained for 300 selected features are:

{'colsample_bytree': 0.6729420012253402, 'max_depth': 8.465800066653655, 'min_child_samples': 55.99234119409433, 'min_child_weight': 33.97047696286344, 'min_split_gain': 0.02993571494711166, 'num_leaves': 31.080485031008543, 'reg_alpha': 0.2498037855480203, 'reg_lambda': 0.04666011834689482, 'subsample': 0.8486579368120211}


### Modelling on Optimized Hyperparameters

In [None]:
params = {
        'objective' : 'binary',
        'boosting_type' : 'gbdt',
        'learning_rate' : 0.05,
        'n_estimators' : 5000,
        'n_jobs' : -1,
        'num_leaves' : 31,
        'max_depth' : 8,
        'min_split_gain' : 0.02993571494711166,
        'min_child_weight' : 33.97047696286344,
        'min_child_samples' : 56,
        'subsample': 0.8486579368120211,
        'subsample_freq' : 1,
        'colsample_bytree' : 0.6729420012253402,
        'reg_alpha' : 0.2498037855480203,
        'reg_lambda' : 0.04666011834689482,
        'verbosity' : -1,
        'seed' : 2131
    }
print("Fitting the model on Tuned parameters:")
#3 fold Stratified Cross Validation
stratified_cv = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 312)

#test and OOF-CV preds
test_preds = np.zeros(test_data.shape[0])
cv_preds = np.zeros(train_data.shape[0])

for i, (train_indices, cv_indices) in enumerate(stratified_cv.split(train_data, target_train),1):

    print(f"\n\tFold Number {i}\n")
    x_tr = train_data.iloc[train_indices]
    y_tr = target_train.iloc[train_indices]
    x_cv = train_data.iloc[cv_indices]
    y_cv = target_train.iloc[cv_indices]

    lgbm_clf = lgb.LGBMClassifier(**params)
    lgbm_clf.fit(x_tr, y_tr, eval_set= [(x_cv, y_cv)],
            eval_metric='auc', verbose = 200, early_stopping_rounds=200)

    cv_preds[cv_indices] = lgbm_clf.predict_proba(x_cv, num_iteration = lgbm_clf.best_iteration_)[:,1]
    test_preds += lgbm_clf.predict_proba(test_data, num_iteration = lgbm_clf.best_iteration_)[:,1] /3

    #saving each folds model
    with open(f'LGBM Deployment/clf_fold{i}.pkl', 'wb') as f:
        pickle.dump(lgbm_clf, f)

#checking the Final ROC_AUC Score on CV Data
print(f"\nCV ROC-AUC Score = {roc_auc_score(target_train, cv_preds)}")

Fitting the model on Tuned parameters:

	Fold Number 1

Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.804662	valid_0's binary_logloss: 0.231271
[400]	valid_0's auc: 0.805389	valid_0's binary_logloss: 0.231017
Early stopping, best iteration is:
[279]	valid_0's auc: 0.805666	valid_0's binary_logloss: 0.230901

	Fold Number 2

Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.799072	valid_0's binary_logloss: 0.233269
[400]	valid_0's auc: 0.800581	valid_0's binary_logloss: 0.232776
[600]	valid_0's auc: 0.800355	valid_0's binary_logloss: 0.232935
Early stopping, best iteration is:
[415]	valid_0's auc: 0.800772	valid_0's binary_logloss: 0.232707

	Fold Number 3

Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.801353	valid_0's binary_logloss: 0.231669
[400]	valid_0's auc: 0.802282	valid_0's binary_logloss: 0.231335
Early stopping, best iteration is:
[298]	valid_0's auc: 0.802581	vali

In [None]:
#submitting the result
pd.DataFrame({'SK_ID_CURR':skid_test, 'TARGET' : test_preds}).to_csv('LGBM_deployment.csv', index = False)
!kaggle competitions submit -c home-credit-default-risk -f LGBM_deployment.csv -m "LightGBM model for deployment"
print('Successfully submitted to Home Credit Default Risk')

Successfully submitted to Home Credit Default Risk


<img src = 'LGBM Deployment Model.png' />

In [None]:
#tuning the threshold for best J-Statistic

fpr, tpr, threshold = roc_curve(target_train, cv_preds)
j_stat = tpr - fpr
best_threshold = threshold[np.argmax(j_stat)]
print(f"Best Threshold = {best_threshold}")

Best Threshold = 0.07180553151892075


## Saving the Relational Table

### Saving In Pickle Form

In [None]:
#saving the relational tables with reduced feature set
relational_tables_prepare('relational_300_feats')

Loading the tables into memory...
Done.
Time Elapsed = 0:02:21.028309

Removing the non-useful features...
Done.
Time Elapsed = 0:00:00.772336

Merging all the tables, and saving to pickle file 'relational_table.pkl'...
Done.
Total Time taken = 0:02:38.289286


### Saving All Data in database

<pre>For deployment, we would be needing the applications from application_train and application_test tables, for testing purpose. Because getting 120 inputs from User would be too time consuming, so we will only test the applications from these tables. So we will save them to the DataBase.
We will create a DataBase named HOME_CREDIT_DB which will contain all the tables required during deployed phase.
The tables stored will be:
1. applications table
2. relational table
</pre>

In [None]:
def save_to_db(verbose = True):
    '''
    Function to save the required tables to DataBase

    Inputs:
        verbose: bool, default = True
            Whether to keep verbostiy or not

    Returns:
        None
    '''

    if verbose:
        print("Loading the files and saving to DataBase...\n")
        start = datetime.now()

    #loading the application tables
    application_train = reduce_mem_usage(pd.read_csv('application_train.csv'), verbose = False)
    application_test = reduce_mem_usage(pd.read_csv('application_test.csv'), verbose = False)
    #removing target column from application_train
    _ = application_train.pop('TARGET')
    #combining the train and test DataFrames
    applications_all = application_train.append(application_test, ignore_index = True)
    #saving this to sqlite database

    print("Saving applications table to DataBase...")
    try:
        #creating the DataBase
        engine = create_engine('sqlite:///HOME_CREDIT_DB.db')
        conn = engine.connect()
        table_name = 'applications'
        applications_all.to_sql(table_name, conn, index = False)
        conn.close()

        #also saving the relational table to this db
        with open('LGBM Deployment/relational_300_feats.pkl', 'rb') as f:
            relational_table = reduce_mem_usage(pickle.load(f), verbose = False)

        conn = sqlite3.connect('HOME_CREDIT_DB.db')
        table_name = 'relational_table'
        relational_table.to_sql(table_name, conn, index = False)
    finally:
        conn.close()

    if verbose:
        print("Done.")
        print(f"Time Taken = {datetime.now() - start}")

In [None]:
save_to_db()

Loading the files and saving to DataBase...

Saving applications table to DataBase...


  method=method,


Done.
Time Taken = 0:02:14.429377
