<a id="0"></a> <br>
# Table of Contents

1. [Introduction to Tabular Playground Series - Nov 2021](#1)
    1. [Variable Describtions](#7)
1. [Load and Glance at the Data](#2)
1. [Missing Values](#3)
1. [Create a Validation Set](#4)
    1. [Remove target column](#5)
1. [Feature Scaling](#6)    
1. [First Model](#8)
    1. [Evaluation Metrics for Training set](#9)
    1. [Evaluation Metrics for Validation set](#10)
    1. [First Submission](#11)
1. [Selecting Models](#12)  
    1. [Helper Functions to Try New Models](#13) 
    1. [Split to the Small Data for Evaluating Models Fast](#14)
    1. [ML Models](#15)
    1. [XGBoost](#16)

<a id="1"></a> <br>
# 1. Introduction to Tabular Playground Series - Nov 2021 

TPS is a monthly competition prepared by Kaggle. The data is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. More information can be found on the [Competition Overview Page](https://www.kaggle.com/c/tabular-playground-series-nov-2021/overview).

**The goal** is **predicting probability** of the observed target 0 or 1. So it is **supervised learning** and **classification task**. Also **evaluation metric** is selected **area under the ROC curve**.

[back to the top](#0)

<a id="7"></a> <br>
## A. Variable Describtions:
- **df_train** : Pandas data frame for training data set
- **df_test** : Pandas data frame for test data set
- **x_all_train** : All training data
- **y_all_train** : All labels for training data


- **train_set** : Training Pandas data frame is splitted from training data set
- **val_set** : Validation Pandas data frame is splitted from training data set

#### Development Data Frames:
- **x_train** : Pandas data frame removed target columns from train_set
- **y_train** : List of targets from train_set
- **x_val** : Pandas data frame removed target columns from val_set
- **y_val** : List of targets from val_set
- **df_test_dev** : Standardized Pandas data frame for test data set

<a id="2"></a> <br>
# 2. Load and Glance at the Data
First things first, load and glance at the data.

[back to the top](#0)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
print('numpy version : ', np.__version__)
print('pandas version : ', pd.__version__)

In [None]:
df_train = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/train.csv")
df_test = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/test.csv")

In [None]:
df_train.head()

[back to the top](#0)

In [None]:
%%time
df_train.info()

<font color=green>All variables is numerical. So we will not strive with categorical data.</font>

In [None]:
df_train.describe()

[back to the top](#0)

<a id="3"></a> <br>
# 3. Missing Values
Check missing values with several ways.

[back to the top](#0)

In [None]:
print(df_train.isnull().sum().shape)

In [None]:
df_train.isnull().sum()

In [None]:
df_train.isnull().sum().sum()

In [None]:
df_train.columns[df_train.isnull().any()]  # which columns has null value.

There is no null value in training set.

[back to the top](#0)

In [None]:
missing_val_count_by_column = (df_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

We knew there is any null value, just use the way.

[back to the top](#0)

In [None]:
missing_val_count_by_column_for_test = (df_test.isnull().sum())
print(missing_val_count_by_column_for_test[missing_val_count_by_column > 0])

There is no null data in test set. 

[Variable Describtions](#7)

<a id="4"></a> <br>
# 4. Create a Validation Set
Before go any further, I will create a validation set from train data frame. We don't know what it represents in real life as it was created using CTGAN, so I'll split it randomly.

[back to the top](#0)

In [None]:
from sklearn.model_selection import train_test_split

random_state=42
test_size = 0.2

train_set, val_set = train_test_split(df_train, test_size = test_size, random_state=random_state)

In [None]:
train_set.info()

In [None]:
train_set.head()

<a id="5"></a> <br>
## A. Remove Target Column
Remove target and id columns from train_set and make x_train data frame and make y_train list.

[back to the top](#0)

[back to the top](#0)

In [None]:
# x_train = train_set.drop(labels = ["id","target"], axis=1)
x_train = train_set.drop(labels = "target", axis=1)
y_train = train_set["target"].values

# validation set
x_val = val_set.drop(labels = "target", axis=1)
y_val = val_set["target"].values

# development test set
df_test_dev = df_test.copy()

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
x_train.head()

In [None]:
x_val.head()

In [None]:
print(y_train[0:5])
print(y_val[0:5])

<a id="6"></a> <br>
# 5. Feature Scaling
In order to, ML algorithms perform well, I will scale data with Standardization method.

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
columns = x_train.columns[1:]  # we get all columns except index.
print(columns)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
x_train.head()

In [None]:
x_val.head()

In [None]:
df_test_dev.head()

In [None]:
print(x_train.shape)

In [None]:
std_scaler = StandardScaler()

**x_train is still contained id column!**

In [None]:
# v1 method
# x_train[columns] = std_scaler.fit_transform(x_train[columns])
# x_val[columns] = std_scaler.transform(x_val[columns])

# df_test_dev[columns] = std_scaler.transform(df_test_dev[columns])

## Create a Pipeline for Preparing Data to Training

In [None]:
num_pipeline = Pipeline([(('std_scaler'), StandardScaler()),])
full_pipeline = ColumnTransformer([('num', num_pipeline, columns),])

x_train[columns] = full_pipeline.fit_transform(x_train)
x_val[columns] = full_pipeline.transform(x_val)
df_test_dev[columns] = full_pipeline.transform(df_test_dev)

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
# x_train.shape
x_train.head()

In [None]:
x_val.head()

In [None]:
df_test_dev.head()

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
print(len(list(df_test_dev)))
print(list(df_test_dev))

In [None]:
# assert False

<a id="8"></a> <br>
# 6. First Model and Submit
I will apply simple ML models first and submit to the competition.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
%%time
lin_reg = LogisticRegression()
lin_reg.fit(x_train[columns], y_train)
y_train_head = lin_reg.predict(x_train[columns])

In [None]:
(y_train_head<1).any()

In [None]:
(y_train_head==0).sum().sum()

[Variable Describtions](#7)

[back to the top](#0)

<a id="9"></a> <br>
## A. Evaluation Metrics for Training set

In [None]:
# firts evaluation
print(lin_reg.score(x_train[columns], y_train))

In [None]:
from sklearn.metrics import roc_curve, auc, roc_auc_score

In [None]:
fpr, tpr, thresholds = roc_curve(y_train, y_train_head)
roc_auc_score_train = roc_auc_score(y_train, y_train_head)
print("ROC AUC Score:",roc_auc_score_train)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal3
    plt.legend(loc = 'lower right')
    plt.show()
    
plot_roc_curve(fpr, tpr, label=("roc auc score ="+str(round(roc_auc_score_train*100,2))))

In [None]:
print("ROC AUC Score:",roc_auc_score(y_train, y_train_head))

### Probability of Predictions

In [None]:
# Before ROC AUC Score
y_train_prob_head = lin_reg.predict_proba(x_train[columns])
print(y_train_prob_head.shape)
y_train_prob_head[0,:]

In [None]:
y_train_prob_head = lin_reg.predict_proba(x_train[columns])[:,1] # score = proba of positive
roc_auc_score_train_prob = roc_auc_score(y_train, y_train_prob_head)
print("ROC AUC Score of probability:", roc_auc_score_train_prob)

In [None]:
fpr_prob, tpr_prob, thresholds_prob = roc_curve(y_train, y_train_prob_head)
plot_roc_curve(fpr_prob, tpr_prob, label=("probability roc auc score ="+str(round(roc_auc_score_train_prob*100,2))))

**This metrics are just for training data. Now I will evaluate model on validation data with predicted probability.**

[back to the top](#0)

<a id="10"></a> <br>
## B. Evaluation Metrics for Validation set

In [None]:
y_val_prob_head = lin_reg.predict_proba(x_val[columns])[:,1] # score = proba of positive
roc_auc_score_val_prob = roc_auc_score(y_val, y_val_prob_head)
print("ROC AUC Score of probability:", roc_auc_score_val_prob)

In [None]:
first_preds = lin_reg.predict_proba(df_test_dev[columns])[:,1]

In [None]:
first_preds

<a id="11"></a> <br>
## C. First Submission
[back to the top](#0)

In [None]:
sub = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv')

In [None]:
sub['target']=first_preds
sub.to_csv('submission.csv', index=False)

In [None]:
sub

<a id="12"></a> <br>
# 7. Selecting Models
[back to the top](#0)

In [None]:
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


<a id="13"></a> <br>
## A. Helper Functions to Try New Models
[back to the top](#0)

In [None]:
def plot_roc_curve(fpr, tpr, label=None, title=None):
    """
    """
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal3
    plt.legend(loc = 'lower right')
    plt.title(title)
    plt.show()

    
def roc_auc_score_func(y_true, y_head, plot_roc=False):
    """ evaluate roc auc score
    Args:
        y_true : a numpy array. True labels.
        y_head : a numpy array. Predicted labels.
        plot_roc : if you want to plot roc curve. (Default is False)
        
    """
    
    if plot_roc:
        fpr, tpr, thresholds = roc_curve(y_true, y_head)
        plot_roc_curve(fpr, tpr, label=None, title=None)
        
        
    # evaluate roc auc score
    model_roc_auc_score = roc_auc_score(y_true, y_true)
#     print("ROC AUC Score:",roc_auc_score_train)
    
    return model_roc_auc_score


def train_model_w_kfold(clf, X, y, n_splits=5):
    """train ml models with kfold and return auc score for probability
    
    Args:
        clf : model classifier
        X : a numpy.darray training data 
        y : a numpy.darray training labels
        n_splits : number of Kfold splits
        
    Returns:
        
    """
    roc_auc_score_list = []  # roc auc score list
    acc_score_list = [] # auc score list
    
    skf = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    
    print("Model:", clf)
    
    for i, (train_index, val_index) in enumerate(skf.split(X, y)):
        print("Fitting fold", i+1)
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]
        
        
        # TRAINING
        training_start_time = time.time()
        
#         model = LogisticRegression(solver='liblinear')
        model = clf
        model.fit(X_train, y_train)
        
        training_end_time = time.time()
        training_time = training_end_time - training_start_time
        print(f"fold {i+1} elapsed seconds: {training_time}")
        
        
        # EVALUATING
        evaluating_start_time = time.time()
        
        roc_auc_score_list.append(roc_auc_score(y_val, model.predict_proba(X_val)[:, 1]))
        acc_score_list.append(accuracy_score(y_val, model.predict(X_val)))
        
        evaluating_end_time = time.time()
        evaluating_time = evaluating_end_time - evaluating_start_time
        print(f"fold {i+1} evaluating scores elapsed seconds: {evaluating_time}")
    
    
        print(f"fold: {i+1}, accuracy: {round(acc_score_list[i]*100,3)}, auc: {round(roc_auc_score_list[i]*100,3)}")

        
    roc_auc_score_mean = np.mean(roc_auc_score_list)
    accuracy_mean = np.mean(acc_score_list)    
    
    return roc_auc_score_mean, accuracy_mean

[back to the top](#0)

In [None]:
# assert False

<a id="14"></a> <br>
## B. Split to the Small Data for Evaluating Models Fast
[back to the top](#0)

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.05, random_state=42)

for train_index, test_index in split.split(df_train, df_train["target"]):
#     strat_dev_set = housing.loc[train_index]
#     strat_test_set = housing.loc[test_index]
    strat_dev_set = df_train.loc[test_index]
    
    
print(strat_dev_set.shape)

In [None]:
strat_dev_set.head()

In [None]:
(strat_dev_set["target"]==1).sum()

In [None]:
# assert False

<a id="15"></a> <br>
## C. ML Models
[back to the top](#0)

In [None]:
!pip install --upgrade xgboost

# # xgb.__version__

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_clf = XGBClassifier(
    max_depth=8,
    learning_rate=0.01,
    n_estimators=10000,
    verbosity=1,
    silent=None,
    objective='binary:logistic',  
    tree_method = 'gpu_hist',
    booster='gbtree',
    n_jobs=-1,
    nthread=None,
    eval_metric='auc',
    gamma=0,
    min_child_weight=1,
    max_delta_step=0,
    subsample=0.7,
    colsample_bytree=1,
    colsample_bylevel=1,
    colsample_bynode=1,
    reg_alpha=0,
    reg_lambda=1,
    scale_pos_weight=1,
    base_score=0.5,
    random_state=0,
    seed=None
)  # logistic regression for binary classification, output probability

[back to the top](#0)

In [None]:
target = strat_dev_set['target'].values
strat_dev_set = full_pipeline.fit_transform(strat_dev_set[columns])

In [None]:
%%time

import time


classifiers = [LogisticRegression(solver='liblinear', random_state = random_state),
               SVC(random_state = random_state, probability=True),
               DecisionTreeClassifier(random_state = random_state),
               RandomForestClassifier(random_state = random_state),
               KNeighborsClassifier(),
               xgb_clf]

# for SVC, predict_proba is not available when probability=False

clf_roc_auc_score_mean = [] 
clf_auc_score_mean = []

for clf in classifiers:
    
    start_time = time.time()
    
    roc_auc_score_mean, accuracy_mean = train_model_w_kfold(clf, X=strat_dev_set, y=target, n_splits=2)
    
    clf_roc_auc_score_mean.append(roc_auc_score_mean)
    clf_auc_score_mean.append(accuracy_mean)
    
    end_time = time.time()
    
    print('Elapsed seconds classifier training time:', end_time-start_time)

[back to the top](#0)

In [None]:
clf_roc_auc_score_mean

In [None]:
ML_Models = ["LogisticRegression",
             "SVC",
             "DecisionTreeClassifier",
             "RandomForestClassifier",
             "KNeighborsClassifier",
             "XGB"]

cv_results = pd.DataFrame({"clf_roc_auc_score_mean":clf_roc_auc_score_mean, 
                           "ML Models": ML_Models})

g = sns.barplot("clf_roc_auc_score_mean", "ML Models", data = cv_results)
g.set_xlabel("Mean ROC AUC Score of Probability")
g.set_title("Stratified KFold")
plt.show()

<a id="16"></a> <br>
## C. XGBoost
[back to the top](#0)

### XGB Training

In [None]:
%%time

xgb_clf = XGBClassifier(
    max_depth=8,
    learning_rate=0.01,
    n_estimators=10000,
    verbosity=1,
    silent=None,
    objective='binary:logistic',  
    tree_method = 'gpu_hist',
    booster='gbtree',
    n_jobs=-1,
    nthread=None,
    eval_metric='auc',
    gamma=0,
    min_child_weight=1,
    max_delta_step=0,
    subsample=0.7,
    colsample_bytree=1,
    colsample_bylevel=1,
    colsample_bynode=1,
    reg_alpha=0,
    reg_lambda=1,
    scale_pos_weight=1,
    base_score=0.5,
    random_state=0,
    seed=random_state
)

xgb_clf.fit(x_train[columns], y_train)

[back to the top](#0)

### XGB Evaluation

In [None]:
%%time
y_val_prob_head = xgb_clf.predict_proba(x_val[columns])[:,1] # score = proba of positive
roc_auc_score_val_prob = roc_auc_score(y_val, y_val_prob_head)

print("ROC AUC Score of probability:", roc_auc_score_val_prob)

### Submission with XGB

In [None]:
xgb_preds = xgb_clf.predict_proba(df_test_dev[columns])[:,1]

In [None]:
sub = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv')
sub['target'] = xgb_preds
sub.to_csv('submission.csv', index=False)
sub