## **1. Introduction**

Starting from January this year, the kaggle competition team is offering a month-long tabulary playground competitions. This series aims to bridge between inclass competition and featured competitions with a friendly and approachable datasets.

For this competition, you will be predicting a binary target based on 100 feature columns given in the data. All columns are continuous.

The data is synthetically generated by a GAN that was trained on a real-world dataset used to identify spam emails via various extracted features from the email.

*Files to work with*:

train.csv - the training data with the target column

test.csv - the test set; you will be predicting the target for each row in this file (the probability of the binary target)

sample_submission.csv - a sample submission file in the correct format

*Evaluation*:

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

In [None]:
import os
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
from matplotlib.ticker import FormatStrFormatter

import warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv(r'/kaggle/input/tabular-playground-series-nov-2021/train.csv', index_col='id')
test = pd.read_csv(r'/kaggle/input/tabular-playground-series-nov-2021/test.csv', index_col='id')
submission= pd.read_csv(r'/kaggle/input/tabular-playground-series-nov-2021/sample_submission.csv', index_col='id')

## **2. Dataset Overview**

### Data size
- Train data has 600000 rows and 101 features including the target variable
- Test dataset has 540000 rows and 100 features.

### Missing Values
- No missing values in both train and test datasets!

### Features
- All features area numerical features.

### Target
- Binary target (1, 0)
- Target distribution is balanced.

In [None]:
print('shape')
print(train.shape)
print(test.shape)

print('Nullvalues')
display(train.isna().sum().sum())
display(test.isna().sum().sum())

## **3. Target Distribution**

In [None]:
target = train['target']

In [None]:
pal = ['#6495ED','#ff355d']
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=target, palette=pal)
ax.set_title('Target variable distribution', fontsize=20, y=1.05)

sns.despine(right=True)
sns.despine(offset=10, trim=True)

In [None]:
train_ = train.sample(10000, random_state=1121)
test_ = test.sample(5000, random_state=1121)

features = train.columns
num_features = features[:-1]

In [None]:
#Features which have peculiar distribution
ff = ['f0', 'f2',  'f4', 'f9', 'f12',  'f16', 'f19', 'f20', 'f23', 'f24',  'f27', 'f28',  'f30','f31', 'f32', 'f33', 'f35', 'f36', 'f39', 
'f42',  'f44', 'f46', 'f48', 'f49', 'f51', 'f52', 'f53', 'f56', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f68', 'f69', 
'f72', 'f73', 'f75', 'f76', 'f78', 'f79', 'f81', 'f83', 'f84',  'f87', 'f88', 'f89', 'f90', 'f92', 'f93', 'f94', 'f95', 'f98', 'f99']

## **4. Features Distibution**

In [None]:
def density_plotter(a, b, title):    
    L = len(num_features[a:b])
    nrow= int(np.ceil(L/10))
    ncol= 10
    fig, ax = plt.subplots(nrow, ncol,figsize=(24, 12), sharey=False, facecolor='#dddddd')

    fig.subplots_adjust(top=0.90)
    i = 1
    for feature in num_features[a:b]:
        plt.subplot(nrow, ncol, i)
        ax = sns.kdeplot(train_[feature], shade=True,  color='#6495ED',  alpha=0.85, label='train')
        ax = sns.kdeplot(test_[feature], shade=True, color='#ff355d',  alpha=0.85, label='test')
        ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
        ax.xaxis.set_label_position('top')
        ax.set_ylabel('')
        ax.set_yticks([])        
        ax.set_xticks([])
        
        if feature in ff:
            ax = sns.kdeplot(train_[feature], shade=True,  color='black',  alpha=0.85, label='train')
            ax = sns.kdeplot(test_[feature], shade=True, color='gold',  alpha=0.85, label='test')
            ax.set_facecolor('#dddddd')
        
        i += 1

    lines, labels = fig.axes[-1].get_legend_handles_labels()    
    fig.legend(lines, labels, loc = 'upper center',borderaxespad= 4.0) 

    plt.suptitle(title, fontsize=20)
    plt.show()

In [None]:
density_plotter(a=0, b=50, title='Density plot: train & test data (f0 -f50)')

In [None]:
density_plotter(a=50, b=100, title='Density plot: train & test data (f50 - f100)')

## **5. Correlation Heatmap**

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16 , 16), facecolor='#dddddd')
corr = train.sample(600000, random_state=2021).corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr, ax=ax, square=True, center=0, linewidth=1, vmax=0.2, vmin=-0.2,
        cmap=sns.diverging_palette(240, 10, as_cmap=True),
        cbar_kws={"shrink": .85}, mask=mask ) 

ax.set_title('Correlation heatmap: Numerical features', fontsize=24, y= 1.05);
#ax.set_facecolor(None);

## **5. Base Models** 

### 5.1: xgboost

In [None]:
## Code from kaggles starter notebook of TPS_september

import pandas as pd
import numpy as np
from pathlib import Path
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.model_selection import cross_validate
import warnings 
warnings.filterwarnings('ignore')

data_dir = Path('../input/tabular-playground-series-nov-2021/')

df_train = pd.read_csv(
    data_dir / "train.csv",
    index_col='id',
    #nrows=25000, 
)

X_test = pd.read_csv(data_dir / "test.csv", index_col='id')

FEATURES = df_train.columns[:-1]
TARGET = df_train.columns[-1]

X = df_train.loc[:, FEATURES]
y = df_train.loc[:, TARGET]

seed = 0
fold = 5

In [None]:
model_xgb = XGBClassifier(max_depth=3,
    subsample=.85,
    colsample_bytree=.1,
    n_jobs=-1,
    tree_method='gpu_hist',
    sampling_method='gradient_based', 
    random_state= seed,
)
def score(X, y, model_xgb, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model_xgb, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model_xgb, cv=fold)
display(scores)

### 5.2: Lgbm

In [None]:
model_lgbm = LGBMClassifier(
    num_iterations=100,
    objective = "binary",
    feature_pre_filter = False,  
    device_type = 'gpu',
    )
def score(X, y, model_lgbm, cv):
    scoring = ["roc_auc"]
    scores = cross_validate(
        model_lgbm, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model_lgbm, cv=fold)
display(scores)

### 5.3: Submission

In [None]:
model_xgb.fit(X, y, eval_metric='auc')
X_test = pd.read_csv(data_dir / "test.csv", index_col='id')

y_pred_xgb = pd.Series(
    model_xgb.predict_proba(X_test)[:, 1],
    index=X_test.index,
    name=TARGET,
)
y_pred_xgb.to_csv("submission_xgb.csv")

In [None]:
y_pred_xgb

## Thank you for visiting this notebook! 