## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Introduction</div>

**Hi**,<br><br>
I just wanted to share my quick EDA and a basic spot-check for this month competition.
Since the dataset is quite big, I just used "the big three" (LGBM, XGB & CATB) for a spot-check with GPU enabled.
Anyway, I hope it's still useful for someone...

**Thanks for taking some time to check out my notebook. Feel free to leave an upvote if you like it or even copy some parts**

Best Regards

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Import Data</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
# colors
bg_color = '#fbfbfb'
txt_color = '#5c5c5c'

cmap = ['#68595b','#7098af','#6f636c','#907c7b']

In [None]:
%%time
df_train = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Simple EDA</div>

### <div style='background:#5d7c8c;color:white;padding:0.25em;border-radius:0.2em'>Basic Overview</div>

In [None]:
# basic overview
print('train_shape:',df_train.shape)
print('test_shape:', df_test.shape)

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
# missing values
print('missing values train:',sum(df_train.isna().sum()))
print('missing values test:',sum(df_test.isna().sum()))

#### **Insights:**
* quite a big dataset with 1 million rows and 287 colums *(...Hi there GPU quota)*
* all numerical variables
* mixture of continous (float, 240 cols) and discrete variables (int, 47 cols)
* we have no missing values

### <div style='background:#5d7c8c;color:white;padding:0.25em;border-radius:0.2em'>Univariate Analysis: Target</div>

In [None]:
# metrics
target_count_0 = df_train.query('target == 0')['target'].count()
total_count = df_train['target'].count()

# plot
fig, ax = plt.subplots(tight_layout=True, figsize=(12,2.5))
fig.patch.set_facecolor(bg_color)

ax.barh(
    y=1, width=target_count_0, 
    color=cmap[1], alpha=0.75, lw=1, edgecolor='white'
)

ax.barh(
    y=1, width=total_count-target_count_0, left=target_count_0,
    color=cmap[2], alpha=0.25, lw=1, edgecolor='white'
)

ax.axis('off')

# annotations
ax.annotate(
    s=f"{np.round(target_count_0/total_count*100,2)} %",
    xy=(2.5e5,1.05),
    va='center', ha='center',
    fontsize=36, fontweight='bold', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s='Count Target Class: 0',
    xy=(2.5e5,0.85),
    va='center', ha='center',
    fontsize=16, fontstyle='italic', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s=f"{np.round((total_count-target_count_0)/total_count*100,2)} %",
    xy=(7.5e5,1.05),
    va='center', ha='center',
    fontsize=36, fontweight='bold', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s='Count Target Class: 1',
    xy=(7.5e5,0.85),
    va='center', ha='center',
    fontsize=16, fontstyle='italic', fontfamily='serif',
    color='#fff'
)

fig.text(
    s='::Target Distribution',
    x=0, y=1.25,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    the target variable is nearly
    equally distributed among both classes.
    ''',
    x=0, y=1.2,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

### <div style='background:#5d7c8c;color:white;padding:0.25em;border-radius:0.2em'>Univariate Analysis: Features</div>

In [None]:
# sampling data to speed up EDA
np.random.seed(2003)
smpl_train = df_train.sample(10000)
smpl_test = df_test.sample(10000)

In [None]:
# get continous variables
cont_feat = [col for col in smpl_train.columns if smpl_train[col].dtype == 'float']

# plot
fig, ax = plt.subplots(20, 12, tight_layout=True, figsize=(12, 12))
ax = ax.flatten()

for idx, feat in enumerate(cont_feat):
    
    sns.kdeplot(
        data=smpl_test,
        x=feat,
        shade=True,
        color=cmap[0],
        edgecolor='black',
        alpha=0.8,
        ax=ax[idx]
    )
    
    sns.kdeplot(
        data=smpl_train,
        x=feat,
        shade=True,
        color=cmap[1],
        edgecolor='black',
        alpha=0.8,
        ax=ax[idx]
    )
    
    ax[idx].set_xticks([])
    ax[idx].set_yticks([])
    ax[idx].set_xlabel('')
    ax[idx].set_ylabel('')
    ax[idx].set_title(f'{feat}', loc='center', fontsize=10)
    sns.despine(left=True)

fig.text(
    s='::Continous Feature Distribution',
    x=0, y=1.05,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='Train & Test-Data are nearly equally distributed',
    x=0, y=1.02,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

In [None]:
# get discrete variables
disc_feat = [col for col in smpl_train.columns if smpl_train[col].dtype == 'int' and col not in ['id','target']]

fig, ax = plt.subplots(9, 5, tight_layout=True, figsize=(12, 12))
ax = ax.flatten()

for idx, feat in enumerate(disc_feat):
    
    sns.kdeplot(
        data=smpl_test,
        x=feat,
        shade=True,
        color=cmap[0],
        edgecolor='black',
        alpha=0.8,
        ax=ax[idx]
    )
    
    sns.kdeplot(
        data=smpl_train,
        x=feat,
        shade=True,
        color=cmap[1],
        edgecolor='black',
        alpha=0.8,
        ax=ax[idx]
    )
    
    ax[idx].set_xticks([])
    ax[idx].set_yticks([])
    ax[idx].set_xlabel('')
    ax[idx].set_ylabel('')
    ax[idx].set_title(f'{feat}', loc='center', fontsize=10)
    sns.despine(left=True)

fig.text(
    s='::Discrete Feature Distribution',
    x=0, y=1.05,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='Train & Test-Data are nearly equally distributed',
    x=0, y=1.02,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

#### **Insights:**
* our target classes are equally distributed (~50/50)
* we can spot heavily skewed features and features with a multi-modal distribution
* we have some interestingly skewed features among the discrete variables (e.g. f275-f284)

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Spot-Checking</div>

In [None]:
# prepare dataframe for modeling
X = df_train.drop(columns=['id','target']).copy()
y = df_train['target'].copy()

In [None]:
# model params
lgbm_params = {
    'device_type' : 'gpu'
}

catb_params = {
    'task_type' : 'GPU',
    'devices' : '0',
    'verbose' : 0
}

xgb_params = {
    'predictor': 'gpu_predictor',
    'tree_method': 'gpu_hist',
    'gpu_id' : 0,
    'verbosity': 0
}

In [None]:
%%time
# spot checking which model to chose
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=2003)

models = [
    ('LGBM', LGBMClassifier	(**lgbm_params)),
    ('CATB', CatBoostClassifier(**catb_params)),
    ('XGB', XGBClassifier(**xgb_params))
]

scores = dict()

for name, model in models:
    model.fit(X_train, y_train)
    y_hat = model.predict_proba(X_test)[:,1]
    fpr, tpr, _ = roc_curve(y_test, y_hat)
    auc_score = auc(fpr, tpr)
    scores[name] = auc_score

In [None]:
scores_df = pd.DataFrame([scores]).transpose().rename(columns={0:'AUC'})

fig, ax = plt.subplots(figsize=(12,6))

sns.barplot(
    data=scores_df,
    x='AUC',
    y=scores_df.index,
    orient='h',
    color=cmap[1],
    ax=ax
)

for idx in range(0, len(scores_df)):
    x = scores_df['AUC'][idx]
    ax.annotate(
        s=f"AUC: {np.round(x,3)}",
        xy=(x-0.01, idx),
        va='center', ha='right'
    )

sns.despine(left=True)
plt.show()

#### **Conclusion:**
* Catboost performs best, "straight-out-of-the-box"
* the treatment of the skewed features and especially the discrete variables might be interesting for feature-engineering