# **BinaryClassification: Customer_Transaction_Predict**

1. **Define the Problem**
2. **Load and Check the Data**
    * 2.1 import data modeling libraries
    * 2.2 preview data, check for outliers and missing values
3. **Exploratory Data Analysis (EDA)**
    * 3.1 data visualization and statistics
    * 3.2 missing values imputation
    * 3.3 skewed data preprocessing
    * 3.4 reducing memory size
    * 3.5 imbalanced output classes
4. **Feature Engineering**
    * 4.1 Feature Normalization
    * 4.2 feature interactions
    * 4.3 feature encodings
    * 4.4 feature selection
5. **Modeling & Hyperparameter Tunning**
    * 5.1 split dataset
    * 5.2 simple model and feature importance  
    * 5.3 compare MLA models
    * 5.4 hyperparameter tunning
    * 5.5 visulization: learning curves and tree structure 
6. **Model Ensemble and Prediction**
    * 6.1 prediction correlations
    * 6.2 ensemble modeling
    * 6.3 predict and submit results

# **1. Define the Problem**
 
This kaggle project is to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The output Y is column "target" in the training dataset, which contains "1" for successful transaction and "0" for no transaction. This is a binary classification problem with supervised learning.

Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target. So we need to use the **roc_auc_score** as our metrics.

# **2. Load and Check the Data**

**2.1 import data modeling libraries**


In [None]:
import pandas as pd
import numpy as np
import random
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from timeit import default_timer as timer
import time

from sklearn.model_selection import StratifiedKFold, learning_curve, GridSearchCV, cross_validate, train_test_split
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, roc_curve
from xgboost import XGBClassifier
import lightgbm as lgb

**2.2 preview data, check for outliers and missing values**

In [None]:
train = pd.read_csv("../input/santander-customer-transaction-prediction/train.csv")
test = pd.read_csv("../input/santander-customer-transaction-prediction/test.csv")

In [None]:
## preview the data and check missing values
train.info()
# test.info()
# train.isnull().sum()
print('\nTraing dataset:')
print('isnull: ' + str(Counter(train.isnull().sum())))
print('target_labels: ' + str(Counter(train['target'])) + '\n')
print('Testing dataset:')
print('isnull: ' + str(Counter(test.isnull().sum())) + '\n')

# train.sample(10)
train.describe(include='all')

In [None]:
## check the data outliers in each column
features = train.columns.values
outlier_indices = []

for col in features[2:]:
    # 1st quartile: 25%
    Q1 = np.percentile(train[col],25)
    # 3rd quartile: 75%
    Q3 = np.percentile(train[col],75)
    # interquartile range
    IQR = Q3 - Q1
    outlier_step = 1.5*IQR
    # generate a list of indices of outliers for feature col
    outlier_list_col = train[(train[col] < Q1-outlier_step) | (train[col] > Q3+outlier_step)].index 
    outlier_indices.extend(outlier_list_col)

# a dictionary that count outlier features for each row index 
outlier_indices = Counter(outlier_indices) 
print("Max number of outlier features in a row: ",max(outlier_indices.values()))


> As we see, the max number of outlier features that a row could have is only 4, and it is out of 200 features. So we don't have to drop any outliers.

# **3. Exploratory Data Analysis (EDA)**
**3.1 data visualization and statistics**

The column "ID_code" has 200K unique values and is not expected to be correlated to the prediction. Distribution plot of ID numbers with either target labels shows uniform distribution, and therefore we don't need to include the ID numbers in the modeling.

In [None]:
train_ID = [i.split("_")[1].strip() for i in train["ID_code"]]
train["ID"] = pd.Series(train_ID)
p = sns.FacetGrid(train,col='target',aspect=2)
p.map(sns.distplot,'ID')

For data visualization, we need to consider what kind of plots matches better with the data type.
In this dataset excluding the 'ID_code', we have two data types: (1) 'target': categorical; (2) 'var_#': numerical with continous values.

* **For categorical x-axis and numerical y-axis, we can use:** boxplot, barplot, violinplot, hist, etc.
* **For numerical x-axis and categorical legend, we can use:** displot, kdeplot, etc.

Here we cannot plot all 200 feature columns, so we will randomly pick a few for this checkout

In [None]:
## firstly, visualize the label imbalance
print(train['target'].value_counts())

f,axe=plt.subplots(1,2,figsize=(16,6))
train['target'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=axe[0],shadow=True)
# axe[0].set_title('target')
axe[0].set_ylabel('')
sns.countplot('target',data=train,ax=axe[1])
# axe[1].set_title('target')
plt.show()

In [None]:
## plot all feature columns as well as their max/mean/min
describe_df_train = train.describe(include='all').drop(labels=['ID_code', 'target','ID'], axis=1)
# print(describe_df_train.columns.values)
feature_labels_train = np.array(describe_df_train.columns.str.lstrip('var_')).astype(np.int16)
feature_means_train = np.array(describe_df_train.loc['mean'])
feature_mins_train = np.array(describe_df_train.loc['min'])
feature_maxs_train = np.array(describe_df_train.loc['max'])

## plot all feature columns as well as their max/mean/min
describe_df_test = test.describe(include='all').drop(labels=['ID_code'], axis=1)
# print(describe_df_test.columns.values)
feature_labels_test = np.array(describe_df_test.columns.str.lstrip('var_')).astype(np.int16)
feature_means_test = np.array(describe_df_test.loc['mean'])
feature_mins_test = np.array(describe_df_test.loc['min'])
feature_maxs_test = np.array(describe_df_test.loc['max'])

## plot
plt.figure(figsize=(12,16))
plt.subplot(211)
plt.fill_between(feature_labels_train,feature_mins_train,feature_maxs_train,alpha=0.2,color='g')
plt.plot(feature_labels_train,feature_means_train,'o-',color='g',label='Train')
plt.title("Train Dataset");
plt.xlabel("Feature Number #")
plt.ylabel("Feature Value")
plt.legend(loc="best"); plt.grid()
plt.subplot(212)
plt.fill_between(feature_labels_test,feature_mins_test,feature_maxs_test,alpha=0.2,color='g')
plt.plot(feature_labels_test,feature_means_test,'o-',color='g',label='Test')
plt.title("Test Dataset");
plt.xlabel("Feature Number #")
plt.ylabel("Feature Value")
plt.legend(loc="best"); plt.grid()

In [None]:
## randomly choose a list of features to look at
feature_plot_list = []
for i in range(12):
    rand_feature = random.randrange(200)
    feature_plot_list.append('var_%d'%rand_feature)

## group by 'target' and show the mean
train[['target']+feature_plot_list].groupby('target').mean()

In [None]:
## plot the feature distributions by target
fig, axs = plt.subplots(4,3,figsize=(20,16))
feature_idx = 0
for i in range(4):
    for j in range(3): 
        g = sns.kdeplot(train[feature_plot_list[feature_idx]][train["target"] == 0], color="Red", ax =axs[i,j], shade = True)
        g = sns.kdeplot(train[feature_plot_list[feature_idx]][train["target"] == 1], color="blue", ax =axs[i,j], shade = True)
#         g = g.legend(["Skewness: %.2f"% (train[feature_plot_list[feature_idx]][train["target"] == 0].skew()),
#                       "Skewness: %.2f"% (train[feature_plot_list[feature_idx]][train["target"] == 1].skew())])
        g.legend(["0","1"])
        feature_idx += 1

**3.2 missing values imputation**

From the data preview in step 2.2, we saw no missing values in either the training dataset or the test dataset.
So here we can skip this step of filling the missing values.


**3.3 skewed data preprocessing**

In the data visualization in step 3.1, we looked at randomly picked features and the skewness are very small.
So we can skip this step as well.

**3.4 reducing memory size**

From the plotted feature values in step 3.1, we can see that all features have mean value in the range of (0,25) and max range within (-100,100). So we can consider to reduce the memory size of train/test dataframes in order to facilitate modeling step.

In [None]:
def reduce_memory_usage(df):
    df_new = pd.DataFrame()
    col2int_list = []
    for col in df.columns:
        if col in ['ID_code','target','ID']:
            df_new[col] = df[col]
        else:
            isInt = False
            max_val = df[col].max()
            min_val = df[col].min()
            ## check if column can be convert into integer 
            col_series_asint = df[col].astype(np.int64)
            residule_series = df[col] - col_series_asint
            total_residule = residule_series.sum()
            if total_residule > -0.01 and total_residule < 0.01:
                isInt = True
                col2int_list.append(col)
            if isInt:
                df_new[col] = df[col].astype(np.int8)
            else:
                df_new[col] = df[col].astype(np.float32)
    return df_new, col2int_list
 

In [None]:
# start_mem_usg = train.memory_usage().sum() / 1024**2 
# print("Memory usage of train dataframe is :",start_mem_usg," MB") 
print ("############## train dataset #############")
print('Before reduce memory usuage:')
train.info()
print('\nAfter reduce memory usuage:')
train_reduced,_ = reduce_memory_usage(train)
train_reduced.info()
print ("\n############## test dataset #############")
print('Before reduce memory usuage:')
test.info()
print('\nAfter reduce memory usuage:')
test_reduced,_ = reduce_memory_usage(test)
test_reduced.info()

Here we can see that the memory usages of train and test datasets are reduced by 50%.

**3.5 imbalanced output classes**

From section3.1 we can see that the output classed are highly imbalanced.

0:179902
1: 20098

There are different techniques to due with imbalanced classification problem, such as:
* **Resampling**: undersampling of the majority class, or oversampling of the minority class
* **Change performance metrics**: instead of using accuracy, use confusion matric, precision/recall/F1-score, or ROC-AUC
* **Change machine learning algorithm**: use models that penalize mistakes on the minorty class, and tree models typically perform well on imbalanced data

In this study, we will focus on using performance metrics of ROC-AUC, and compare several different ML models.

# **4. Feature Engineering**

**4.1 Feature Normalization**

From the EDA in section3, we gain more insights of the dataset. We see that all features are numerical and are close to normal distributions. So we can perform standardization on each feature: 

X = (X - mean) / std

The feature normalization is essential for some ML models, e.g. SVM, and tree modes are typically not affected by any monotonic transformation.

In [None]:
def feature_standardization(df_train,df_test):
    df_train_stand = pd.DataFrame()
    df_test_stand = pd.DataFrame()
    for col in df_train.columns:
        if col in ['ID_code','target','ID']:
            df_train_stand[col] = df_train[col]
            if col in df_test.columns:
                df_test_stand[col] = df_test[col]
        else:
            mean = df_train[col].mean()
            std  = df_train[col].std()
            ## use the same parameters for both train and test set
            df_train_stand[col] = (df_train[col] - mean)/std
            df_test_stand[col] = (df_test[col] - mean)/std
    return df_train_stand,df_test_stand

In [None]:
train_standard, test_standard = feature_standardization(train,test)

# train_standard.dtypes

## after the standarization, the datatype changed to float64, so we need to reduce the memory one more time here
train_standard,_ = reduce_memory_usage(train_standard)
test_standard,_ = reduce_memory_usage(test_standard)

train_standard.describe()

**4.2 Feature Interactions**

In this dataset, all features are senseless names, and we cannot do meaningful feature interaction to add new features. So we can skip this step.


**4.3 Feature Encodings**

In this dataset, all features are numerical and senseless named, so we cannot do meaningful grouping or categorization. This step is also skipped. 

**4.4 Feature Selection**

We have 200 features here and 200k samples that can make training quite slow. So we want to reduce the feature space by evaluating feature importance and by removing the features that don't influence the outcome or have strong correlations with other features.

Here we will try two different methods:
* **Filtering method**: evaluate the pearson correlation between features, and between feature and target
* **Embedded method**: rank the feature importance by built-in class in tree models (we will do this in **section 5**)

In [None]:
## plot the correlations between the first 20 features and the target column
## results show very small corr values, so we will move on the the embedded method to rank the feature importance
fig, axe = plt.subplots(figsize=(13,10))
colormap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(train_standard.iloc[:,1:22].corr(),annot=True, fmt = ".2f", cmap = colormap, ax=axe)

Data shows almost no correlation between each feature, it indicate that all these anonymous features could have been well engineered. At this point, we will leave all the features for now, and re-evaluate the feature selection in the modeling step in section5.

# **5. Modeling & Hyperparameter Tunning**

**5.1 split dataset**

Here we will split our dataset to 80% training set and 20% cross-validation set.
We will start with the "train_test_split" method for one pair of datasets to facilitate our simple model. Later, we will also use 10-fold split of train and cross-validation dataset for model comparisons.

In [None]:
## define the X, Y of train/test set
y = train_standard['target']
X = train_standard.drop(labels=['ID_code','target','ID'],axis=1)
X_test = test_standard.drop(labels=['ID_code'],axis=1)

## train and cross-validation set split
X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=0.8,random_state=42,stratify=y)

In [None]:
## k-fold train and cross-validation split
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

**5.2 simple model and feature importance**

Start with a simple model can give us the baseline prediction of this problem, and also provide feature importance scores for guiding the feature selection.

As a starting model, it should be fast enough. Light GBM is clearly a good approach when dealing with such a large datasets.

In [None]:
## start with a lgb model and non-optimized parameters
lgb_train_data = lgb.Dataset(X_train, label=y_train)
lgb_val_data   = lgb.Dataset(X_val, label=y_val)

param = {
    'metric':'auc',
    'objective':'binary',
    'verbosity':1,
}
num_round = 2000

bst = lgb.train(param,lgb_train_data,num_round,valid_sets=[lgb_val_data],verbose_eval=100,early_stopping_rounds=400)

In [None]:
## let's take a look at the ROC plot
y_predict = bst.predict(X_val,num_iteration=bst.best_iteration)
fpr, tpr, thresholds = roc_curve(y_val, y_predict)
plt.figure(figsize=(6,5.5))
plt.plot(fpr, tpr, label='LGBM')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

Here, an inital modeling gives us 88.79% AUC score, which is not bad. We can now analyze the feature importance and evaluate if we can reduce the feature space with feature selection.

In [None]:
## plot the feature importance score of all features
feature_importances = pd.Series(bst.feature_importance(),index=X_train.columns)
plt.figure(figsize=(14,5))
feature_importances.reset_index()[0].plot(kind='line',grid=True,label='LGBM')
plt.xlabel('Feature Number')
plt.ylabel('Feature Importance')
plt.title('Feature Importance Plot')
plt.legend(loc='best')
plt.show()

In [None]:
## lets have a comparison of 
feature_scores = list(range(0,120,20))
display_df = pd.DataFrame(columns=['Feature Count',"Accuracy", "ROC", "Precision", "Recall", "F1 Score", "Run Time"])

for score in feature_scores:
        
    feature_selected = feature_importances[feature_importances>score].index.tolist()
    feature_count = len(feature_selected)
    
    lgb_train_data_selected = lgb.Dataset(X_train[feature_selected],y_train)
    lgb_val_data_selected   = lgb.Dataset(X_val[feature_selected], y_val)
    
    start = timer()
    bst_selected = lgb.train(param,lgb_train_data_selected,num_round,valid_sets=[lgb_val_data_selected],verbose_eval=100,early_stopping_rounds=400)
    run_time = timer()-start
    y_predict = bst_selected.predict(X_val[feature_selected],num_iteration=bst_selected.best_iteration)
    
    # calculate threshod
    fpr, tpr, thresholds = roc_curve(y_val, y_predict)
    optimal_idx = np.argmax(tpr - fpr)
    threshold = thresholds[optimal_idx]
    y_predict_binary = np.where(y_predict > threshold, 1, 0)

    # calculate evaluation metrics 
    roc = roc_auc_score(y_val,y_predict) 
    acc = accuracy_score(y_val,y_predict_binary)
    prc = precision_score(y_val,y_predict_binary)
    rcl = recall_score(y_val,y_predict_binary)
    f1  = f1_score(y_val,y_predict_binary)
    ## create a dataframe to summarize the scores
    display_df.loc[score]=[feature_count,acc,roc,prc,rcl,f1,run_time]

In [None]:
## lets take a look at the summary table
display_df

By setting the feature importance threshold to 20, we got 188 features. Even though we couldn't save much on run time or memory usage, we are able to increase the ROC by 0.1%.

In [None]:
feature_selected = feature_importances[feature_importances>20].index.tolist()
print(feature_selected)

**5.3 compare MLA models**

For this binary classification problem with large feature space, lets consider the following models with relatively short run time:
* Logistic regression
* Naive Bayes
* LightBGM
* XGBoost
* Ensemble


In [None]:
## List of Models to check
MLA = {
    'LogisticRegression': LogisticRegressionCV(verbose=2),
    'Naive Bayes': GaussianNB(),
    'LightGBM': lgb.LGBMClassifier(verbose=1),
#     'XGBoost': XGBClassifier(),
}

In [None]:
## create a dataframe to report the training scores of each MLA
MLA_columns = ['MLA_name','Parameters','Train_AUC','Validation_AUC','Validation_AUC_Sigma','Training_Time']
MLA_compare = pd.DataFrame(columns=MLA_columns)

for _,(name,clf) in enumerate(MLA.items()):
    print("Training:",str(clf.__class__.__name__))
    cv_results = cross_validate(clf,X,y,cv=kfold,scoring='roc_auc',return_train_score=True)
    MLA_compare = MLA_compare.append({
        'MLA_name': name,
        'Parameters': str(clf.get_params()),
        'Train_AUC': cv_results['train_score'].mean(),
        'Validation_AUC': cv_results['test_score'].mean(),
        'Validation_AUC_Sigma': cv_results['test_score'].std(),
        'Training_Time': cv_results['fit_time'].mean() 
    },ignore_index = True)

In [None]:
MLA_compare

A simple training of these three models with default hyperparameters shows good AUC on validation set. In the next section, lets see if we can further improve the AUC score of logistic regression and lightGBM through hyperparameter tunning.

In [None]:
MLA_compare_w_featureSelection = MLA_compare.copy(deep=True)

for _,(name,clf) in enumerate(MLA.items()):
    print("Training:",str(clf.__class__.__name__))
    cv_results = cross_validate(clf,X[feature_selected],y,cv=kfold,scoring='roc_auc',return_train_score=True)
    MLA_compare_w_featureSelection = MLA_compare_w_featureSelection.append({
        'MLA_name': name + ' w/ FS',
        'Parameters': str(clf.get_params()),
        'Train_AUC': cv_results['train_score'].mean(),
        'Validation_AUC': cv_results['test_score'].mean(),
        'Validation_AUC_Sigma': cv_results['test_score'].std(),
        'Training_Time': cv_results['fit_time'].mean() 
    },ignore_index = True)

In [None]:
MLA_compare_w_featureSelection

Here we can see that feature selection improves the ROC score for all three models.

**5.4 hyperparameter tunning**

Here we will use GridSearchCV method to optimize the hyperparameters.

In [None]:
## optimize the parameter of Logistic Regression 
param_grid_logReg = {
    'scoring':['roc_auc'],
    'multi_class': ['ovr'],
#     'class_weight':[None, {0:1,1:9}],
    'max_iter':[50,100],
    'fit_intercept':[True],          # "False" is faster but worse fitting
    'solver':['lbfgs'],  # 'sag' and 'saga' are slower and gave worse fitting,'newton-cg' is slightly better by much slower 
    'random_state':[42]
}
logReg_clf = LogisticRegressionCV()
logReg_gs  = GridSearchCV(logReg_clf,param_grid=param_grid_logReg,cv=kfold,scoring='roc_auc',n_jobs=4,verbose=2)
logReg_gs.fit(X[feature_selected],y)

In [None]:
logReg_best = logReg_gs.best_estimator_
print(logReg_best.get_params())
print(logReg_gs.best_score_)

Here we can see that logistic regression is hard to see significant improvement, likely due to bias issue coming from the model simplicity.

In [None]:
## optimize the parameter of Naive Bayes
param_grid_nb = {'priors': [None], 'var_smoothing': [1e-03,1e-06,1e-09]}
nb_clf = GaussianNB()
nb_gs  = GridSearchCV(nb_clf,param_grid=param_grid_nb,cv=kfold,scoring='roc_auc',verbose=2)
nb_gs.fit(X[feature_selected],y)

In [None]:
nb_best = nb_gs.best_estimator_
print(nb_best.get_params())
print(nb_gs.best_score_)

For light GBM model, we have checked multiple hyperparameters. To save running time, we only listed the optimized values for some hyperparameters. The "n_estimators" of value 1,000,000 gives even better AUC score, however, we only listed 10,000 here to save run time in this step. 

In [None]:
## optimize the parameter of Light GBM
param_grid_lgb = {
    'boosting_type': ['gbdt'],
#     'class_weight': [None, {0:1,1:7}],
    'colsample_bytree': [0.05],  ## compared [0.05, 0.5, 1]
    'learning_rate':[0.01],
#     'max_depth':[5,20,-1],
    'min_child_samples': [80],
    'min_child_weight': [10],
    'n_estimators':[10000],  ## compared [100,1000,10000,1000000]
    'num_leaves':[13],         ## compared [13, 31]
    'objective':['binary'],
#     'random_state':[42],
    'reg_alpha': [0],          ## compared [0,5,10]
    'reg_lambda': [10],        ## compared [0,5,10]
    'subsample': [0.4],        ## compared [0.4, 1]
    'subsample_freq': [5],     ## compared [0, 5]
#     'n_jobs':[4],
}

lgb_clf = lgb.LGBMClassifier()
lgb_gs  = GridSearchCV(lgb_clf,param_grid=param_grid_lgb,cv=kfold,scoring='roc_auc',n_jobs=4,verbose=2)
lgb_gs.fit(X[feature_selected],y)

In [None]:
lgb_best = lgb_gs.best_estimator_
print(lgb_best.get_params())
print(lgb_gs.best_score_)

Wow! By tunning hyperparameter, we get 0.9003 AUC now with LightGBM (We will see further improvement in we increase the 'n_estimators' to 1,000,000).

**5.5 visulization: learning curves and tree structure**

Learning curves are a good way to see the overfitting effect on the training set and the effect of the training size on the accuracy.


In [None]:
# ## optimized hyperparameters
# MLA_param = {
#     'LogisticRegression': {'Cs': 10, 'class_weight': None, 'cv': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1.0, 'l1_ratios': None, 'max_iter': 50, 'multi_class': 'ovr', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'refit': True, 'scoring': 'roc_auc', 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0},
#     'Naive Bayes': {'priors': None, 'var_smoothing': 1e-09},
#     'LightGBM': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.05, 'importance_type': 'split', 'learning_rate': 0.01, 'max_depth': -1, 'min_child_samples': 80, 'min_child_weight': 10, 'min_split_gain': 0.0, 'n_estimators': 10000, 'n_jobs': -1, 'num_leaves': 13, 'objective': 'binary', 'random_state': None, 'reg_alpha': 0, 'reg_lambda': 10, 'silent': True, 'subsample': 0.4, 'subsample_for_bin': 200000, 'subsample_freq': 5},
# #     'XGBoost': {},
# }

## optimized estimators
MLA_optm = {
    'LogisticRegression': logReg_best,
    'Naive Bayes': nb_best,
    'LightGBM': lgb_best,
#     'XGBoost': XGBClassifier(),
}

In [None]:
## plot learning curves
def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=-1, train_sizes=np.linspace(.1,1.0,5)):
    plt.figure(figsize=(12,6))
    plt.title(title)
    plt.xlim((0,200000))
    plt.xlabel("Training Sample Size (%)")
    plt.ylabel('AUC Score')
    train_sizes_abs, train_scores, val_scores = learning_curve(estimator,X,y,scoring='roc_auc',cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std  = np.std(train_scores, axis=1)
    val_scores_mean   = np.mean(val_scores, axis=1)
    val_scores_std    = np.std(val_scores, axis=1)
    plt.grid()
    plt.plot(train_sizes_abs,train_scores_mean,'o-',color='blue',label='Training AUC')
    plt.plot(train_sizes_abs,val_scores_mean,'o-',color='r',label='Validating AUC')
    plt.fill_between(train_sizes_abs,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='blue')
    plt.fill_between(train_sizes_abs,val_scores_mean-val_scores_std,val_scores_mean+val_scores_std,alpha=0.1,color='r')
    plt.legend(loc='best')
    plt.show()
    return plt  

In [None]:
plot_learning_curve(logReg_best, 'Logistic Regression', X, y, cv=kfold)
plot_learning_curve(nb_best, 'Naive Bayes', X, y, cv=kfold)
plot_learning_curve(lgb_best, 'LightGBM', X, y, cv=kfold)

Let's also take a look at the first tree structure in the light GBM estimator.

In [None]:
lgb.plot_tree(lgb_best, tree_index=0, figsize = (20,14))

# **6. Model Ensemble and Prediction**

**6.1 prediction correlations**

Compare MLA predictions with each other, where 1 = exactly similar and 0 = exactly opposite. Based on the correlation, we can create a "super algorithm" (ensemble algorithm) by combining them.

In [None]:
MLA_prediction = pd.DataFrame(columns=list(MLA_optm.keys()))

for i,(column_name,optm_estimator) in enumerate (MLA_optm.items()):
    MLA_prediction[column_name] = optm_estimator.predict(X_test[feature_selected])

In [None]:
for column_name in MLA_prediction.columns:
    print(MLA_prediction[column_name].value_counts())

In [None]:
## plot the correlations between the prediced
fig, axe = plt.subplots(figsize=(8,6))
colormap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(MLA_prediction.corr(),annot=True, fmt = ".2f", cmap = colormap, ax=axe)

**6.2 ensemble modeling**

In [None]:
## ensembled estimator using voting classifier
estimators = list(MLA_optm.items())
vote_clf = VotingClassifier(estimators=estimators,voting='soft',n_jobs=4)
## run cross validate 
cv_results = cross_validate(vote_clf,X[feature_selected],y,cv=kfold,scoring='roc_auc',return_train_score=True,return_estimator=True)

In [None]:
print('Train_Score:',cv_results['train_score'])
print('mean:',cv_results['train_score'].mean())
print('Validate_Score:',cv_results['test_score'])
print('mean:',cv_results['test_score'].mean())
print('Run Time:',cv_results['fit_time'])
print('mean:',cv_results['fit_time'].mean())

In [None]:
## let's also compare if Light GBM outperform the ensemble method
estimator = MLA_optm['LightGBM']
## run cross validate 
cv_results_lgb = cross_validate(estimator,X[feature_selected],y,cv=kfold,scoring='roc_auc',return_train_score=True,return_estimator=True)

In [None]:
print('Train_Score:',cv_results_lgb['train_score'])
print('mean:',cv_results_lgb['train_score'].mean())
print('Validate_Score:',cv_results_lgb['test_score'])
print('mean:',cv_results_lgb['test_score'].mean())
print('Run Time:',cv_results_lgb['fit_time'])
print('mean:',cv_results_lgb['fit_time'].mean())

**6.3 predict and submit results**

Since lightGBM outperforms the ensembled model, we will use the lighGBM for submission.

In [None]:
# vote_clf_optm = cv_results['estimator'][0]

In [None]:
## final lightGBM model
params = {'objective' : "binary", 
               'boost':"gbdt",
               'metric':"auc",
               'boost_from_average':"false",
               'num_threads':8,
               'learning_rate' : 0.01,
               'num_leaves' : 13,
               'max_depth':-1,
               'tree_learner' : "serial",
               'feature_fraction' : 0.05,
               'bagging_freq' : 5,
               'bagging_fraction' : 0.4,
               'min_data_in_leaf' : 80,
               'min_sum_hessian_in_leaf' : 10.0,
               'verbosity' : 1}
num_round = 1000000

In [None]:
%%time
kFold = StratifiedKFold(n_splits=10, shuffle=False, random_state=42)
y_pred_lgb = np.zeros(len(test_standard))

for fold_n, (train_index, valid_index) in enumerate(kFold.split(X,y)):
    print('Fold', fold_n, 'started at', time.ctime())
    X_train, X_valid = X.iloc[train_index], X.iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    
    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_valid, label=y_valid)
        
    lgb_model = lgb.train(params,train_data,num_round,
                          valid_sets = [train_data, valid_data],verbose_eval=1000,early_stopping_rounds = 3500)##change 10 to 200
            
    y_pred_lgb += lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration)/5

In [None]:
## create the submit dataFrame
submit = pd.DataFrame(columns=['ID_code','target'])
submit['ID_code'] = test_standard['ID_code']
submit['target']  = y_pred_lgb

In [None]:
## finally! submit
submit.to_csv("submit3.csv", index=False)

In [None]:
print('Submit Data Distribution: \n', submit['target'].value_counts())
submit.sample(10)