<a id="0"></a> <br>
# Table of Contents

1. [Introduction to Tabular Playground Series - Nov 2021](#1)
    1. [Variable Describtions](#2)
1. [Load and Glance at the Data](#3)   
1. [Merge the Data](#4)
1. [Helper Functions](#5) 
1. [XGBoost](#6) 

<a id="1"></a> <br>
# 1. Introduction to Tabular Playground Series - Nov 2021 

TPS is a monthly competition prepared by Kaggle. The data is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. More information can be found on the [Competition Overview Page](https://www.kaggle.com/c/tabular-playground-series-nov-2021/overview).

**The goal** is **predicting probability** of the observed target 0 or 1. So it is **supervised learning** and **classification task**. Also **evaluation metric** is selected **area under the ROC curve**.

1. [My first Notebook on this competition: TPS Nov 2021 Starter with XGBoost](https://www.kaggle.com/ahmetekiz/tps-nov-2021-starter-with-xgboost#7.-Selecting-Models)
1. [My second Notebook on this competition: TPS Nov 2021 Stacking](https://www.kaggle.com/ahmetekiz/tps-nov-2021-stacking)

[back to the top](#0)

<a id="2"></a> <br>
## A. Variable Describtions:
- **df_train** : Pandas data frame for training data set
- **df_test** : Pandas data frame for test data set
- **x_train** : Pandas data frame removed target columns from df_train
- **y_train** : Pandas data frame from df_train

<a id="3"></a> <br>
# 2. Load the Data

[back to the top](#0)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df_train = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/train.csv")
df_test = pd.read_csv("/kaggle/input/tabular-playground-series-nov-2021/test.csv")

df_valid_preds_0 = pd.read_csv("/kaggle/input/tps-nov-2021-logistic-reg-xgboost-stacking1/valid_preds_LogisticRegression.csv")
df_valid_preds_1 = pd.read_csv("/kaggle/input/tps-nov-2021-logistic-reg-xgboost-stacking1/valid_preds_XGB.csv")

df_test_preds_0 = pd.read_csv("/kaggle/input/tps-nov-2021-logistic-reg-xgboost-stacking1/test_preds_LogisticRegression.csv")
df_test_preds_1 = pd.read_csv("/kaggle/input/tps-nov-2021-logistic-reg-xgboost-stacking1/test_preds_XGB.csv")

random_state = 42

In [None]:
df_train.head()

[back to the top](#0)

In [None]:
df_valid_preds_0.head()

In [None]:
df_test_preds_0.head()

[back to the top](#0)

<a id="4"></a> <br>
# 3. Merge the Data

In [None]:
df_train = df_train.merge(df_valid_preds_0, on="id", how="left")
df_train = df_train.merge(df_valid_preds_1, on="id", how="left")

df_train.head()

In [None]:
df_test = df_test.merge(df_test_preds_0, on="id", how="left")
df_test = df_test.merge(df_test_preds_1, on="id", how="left")

df_test.head()

In [None]:
useful_features_train = ["pred_0", "pred_1", "target"]

df_train[useful_features_train].head()

In [None]:
columns = df_test.columns[1:]

# columns = ["pred_0", "pred_1"]
print(columns)

In [None]:
# assert False

In [None]:
y_train = df_train["target"]
df_train = df_train.drop(labels = "target", axis=1)

df_train.head()

In [None]:
df_test.head()

<a id="4"></a> <br>
# 3. Feature Scaling
In order to, ML algorithms perform well, I will scale data with Standardization method.

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

## Create a Pipeline for Preparing Data to Training

In [None]:
std_scaler = StandardScaler()

num_pipeline = Pipeline([(('std_scaler'), StandardScaler()),])
full_pipeline = ColumnTransformer([('num', num_pipeline, columns),])

df_train[columns] = full_pipeline.fit_transform(df_train)
df_test[columns] = full_pipeline.transform(df_test)

[Variable Describtions](#7)

[back to the top](#0)

In [None]:
df_train.head()

In [None]:
df_test.head()

[Variable Describtions](#7)
[back to the top](#0)

<a id="5"></a> <br>
# 4. Helper Functions
[back to the top](#0)

In [None]:
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
def plot_roc_curve(fpr, tpr, label=None, title=None):
    """
    """
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--') 
    plt.legend(loc = 'lower right')
    plt.title(title)
    plt.show()

    
def roc_auc_score_func(y_true, y_head, plot_roc=False):
    """ evaluate roc auc score
    Args:
        y_true : a numpy array. True labels.
        y_head : a numpy array. Predicted labels.
        plot_roc : if you want to plot roc curve. (Default is False)
        
    """
    
    if plot_roc:
        fpr, tpr, thresholds = roc_curve(y_true, y_head)
        plot_roc_curve(fpr, tpr, label=None, title=None)
        
        
    # evaluate roc auc score
    model_roc_auc_score = roc_auc_score(y_true, y_true)
    
    return model_roc_auc_score


def train_and_predict(clf, clf_name, X, Y, x_test, n_splits=5):
    """train ml models with Stratified Kfol return auc score for probability of test data with ml model name
    
    Args:
        clf : model classifier
        clf_name : classifier name
        x_train : a numpy.darray training data 
        y_train : a numpy.darray training labels
        x_test: test data
        n_splits : StratifiedKFold splits number
        
    Returns:
        roc_auc_score_mean
        accuracy_mean
        valid_preds_dict : a dict that stores validation set ids and predictions probabilities from StratifiedKFold
        test_preds: predictions probabilities of test set
    """
    
    valid_preds = []
    valid_ids = []
    valid_preds_dict = {}
    
    test_preds = []
    
    roc_auc_score_list = []  # roc auc score list
    acc_score_list = [] # auc score list
    


    skf = StratifiedKFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    
    for i, (train_index, val_index) in enumerate(skf.split(X, Y)):
        print("Fitting fold", i+1)
        X_train, X_val = X[train_index], X[val_index]
        Y_train, Y_val = Y[train_index], Y[val_index]
        
        print("Model:", clf_name)

        # TRAINING
        training_start_time = time.time()
        local_time = time.ctime(training_start_time)
        print("Local time:", local_time)
        
        model = clf
        model.fit(X_train, Y_train)  # except id column

        training_end_time = time.time()
        training_time = training_end_time - training_start_time
        print(f"Training elapsed seconds: {round(training_time,3)}")


        # EVALUATING
        evaluating_start_time = time.time()
        local_time = time.ctime(evaluating_start_time)
        print("Local time:", local_time)
        
        valid_preds = model.predict_proba(X_val)[:, 1]  # predict probabilty of validation set
        valid_ids = X[val_index,0].tolist()
        valid_preds_dict.update(dict(zip(valid_ids, valid_preds)))  
    
        
        roc_auc_score_list.append(roc_auc_score(Y_val, valid_preds))
        acc_score_list.append(accuracy_score(Y_val, model.predict(X_val)))

        evaluating_end_time = time.time()
        evaluating_time = evaluating_end_time - evaluating_start_time
        print(f"evaluating scores elapsed seconds: {round(evaluating_time,3)}")
    
    
        # PREDICTION
        prediction_start_time = time.time()
        local_time = time.ctime(prediction_start_time)
        print("Local time:", local_time)
        
        test_pred = clf.predict_proba(x_test)[:, 1]
        test_preds.append(test_pred)

        prediction_end_time = time.time()
        prediction_time = prediction_end_time - prediction_start_time
        print(f"predicting test probability scores elapsed seconds: {round(prediction_time,3)}")
    
    roc_auc_score_mean = np.mean(roc_auc_score_list)
    accuracy_mean = np.mean(acc_score_list) 
    
    
    print(f"Mean accuracy: {round(accuracy_mean*100,3)}, Mean AUC Score: {round(roc_auc_score_mean*100,3)}")
    
    return roc_auc_score_mean, accuracy_mean, valid_preds_dict, test_preds

One of my references here [abhishek's notebook](https://www.kaggle.com/abhishek/competition-part-6-stacking/notebookhttps://www.kaggle.com/abhishek/competition-part-6-stacking/notebook), especially mean of several kfold test predictions. 

[back to the top](#0)

<a id="7"></a> <br>
## A. Split Train and Validation Data
[back to the top](#0)

In [None]:
# IT CONTROLS THE CODE WORKS BEFORE SUBMIT
# from sklearn.model_selection import StratifiedShuffleSplit

# split = StratifiedShuffleSplit(n_splits=1, test_size=0.001, random_state=42)

# for train_index, val_index in split.split(df_train, y_train):
#     df_train, y_train = df_train.loc[val_index], y_train.loc[val_index] 
#     x_val, y_val = df_train.loc[val_index], y_train.loc[val_index]

# # x_train_2 = x_train_2.values
# # y_train_2 = y_train_2.values

# # # x_train_2[:,0] # id

<a id="6"></a> <br>
# 5. XGBoost
[back to the top](#0)

In [None]:
!pip install --upgrade xgboost

# # xgb.__version__

In [None]:
from xgboost import XGBClassifier

In [None]:
# assert False

[back to the top](#0)

I've chosen three classifiers according to my previous notebook's evaluation results.

In [None]:
df_train.head()

In [None]:
df_train = df_train.values
y_train = y_train.values

In [None]:
df_train.shape

In [None]:
# clf = XGBClassifier(max_depth=8,
#                     learning_rate=0.01,
#                     n_estimators=10000,
#                     verbosity=1,
#                     silent=None,
#                     objective='binary:logistic',  
#                     tree_method = 'gpu_hist',
#                     booster='gbtree',
#                     n_jobs=-1,
#                     nthread=None,
#                     eval_metric='auc',
#                     gamma=0,
#                     min_child_weight=1,
#                     max_delta_step=0,
#                     subsample=0.7,
#                     colsample_bytree=1,
#                     colsample_bylevel=1,
#                     colsample_bynode=1,
#                     reg_alpha=0,
#                     reg_lambda=1,
#                     scale_pos_weight=1,
#                     base_score=0.5,
#                     random_state=random_state,
#                     seed=None)

# clf_name = "XGB"

In [None]:
clf = LogisticRegression(solver='liblinear', random_state = random_state)
clf_name = "LogisticRegression"

In [None]:
df_train[:,1:].shape

In [None]:
%%time

import time


roc_auc_score_mean, accuracy_mean, valid_preds_dict, test_preds = train_and_predict(clf, 
                                                                                    clf_name, 
                                                                                    df_train[:,1:], 
                                                                                    y_train,  
                                                                                    df_test[columns],
                                                                                    n_splits=2)
    
    
    


#save predictions of test data to csv
sub = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv')
print(sub)
test_preds = np.mean(np.column_stack(test_preds), axis=1)  # mean of every kfold predictions. Before that it gives a list has two columns

sub['target'] = test_preds
sub.to_csv('submission.csv', index=False)
sub

[back to the top](#0)