# Tabular Playground Series: March 2021

![](https://storage.googleapis.com/kaggle-competitions/kaggle/25225/logos/header.png?t=2021-01-27-17-34-26)

## Introduction:

Starting from January this year, the kaggle competition team is offering a month-long tabulary playground competitions. This series aims to bridge between inclass competition and featured competitions with a friendly and approachable datasets.

For the month of March, kaggle is offering a dataset which is synthetic but based on a real dataset and generated using a CTGAN. The original dataset, this synthetic dataset is derived from, deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

The data has: 

* 19 categorical variables: **cat0** to **cat18**
* 11 continuous variables: **cont0** to **cont10**
* 1 binary **target** column

Files provides:

- train.csv - the training data with the target column
- test.csv - the test set; you will be predicting the target for each row in this file
- sample_submission.csv - a sample submission file in the correct format

The goal of the competition is to predict a binary **target** based on the given categorical and continuous features. However, the goal of **this notebook** is to explore (EDA) and visualize the given data. And when possible try to discover (engineer) *potentially usefull* features for further data modelling and prediction.


[1. Set-up](#Set-up)

# Set-up

In [None]:
import os
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kurtosis, skew
from matplotlib.offsetbox import AnchoredText

import warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the data

In [None]:
train_ = pd.read_csv(r'/kaggle/input/tabular-playground-series-mar-2021/train.csv', index_col='id')
test = pd.read_csv(r'/kaggle/input/tabular-playground-series-mar-2021/test.csv', index_col='id')

submission= pd.read_csv(r'/kaggle/input/tabular-playground-series-mar-2021/sample_submission.csv', index_col='id')

# Explore the data

In [None]:
train = train_.copy()

In [None]:
print('Train data of shape {}'.format(train.shape))
display(train.head())
print('Test data of shape {}'.format(test.shape))
display(test.head())

In [None]:
display(train.describe().T)

In [None]:
target = train.pop('target')

In [None]:
cat_features =[]
num_features =[]

for col in train.columns:
    if train[col].dtype=='object':
        cat_features.append(col)
    else:
        num_features.append(col)
print('Catagoric features: ', cat_features)
print('Numerical features: ', num_features)

## Null-values in the data
(No null values in the data.)

In [None]:
print('Number of NA values in train data is {}'.format(train.isna().sum().sum()))
print('Number of NA values in test data is {}'.format(test.isna().sum().sum()))

## Categorical features (group by cardinality)

- There are four features with high cardinality (>20, one very high with 299)
- The rest is less that or equal to 20
<div class="alert alert-block alert-danger">  
Watch for cat10 !!!
</div>

- In **cat10** there are 299 unique values in train data whereas the test data has 295
- Some elements of cat10 are present in test data but **NOT** in train data and the vice-versa (details below)

In [None]:
for col in cat_features:
    print('{} unique values in {}'.format(train[col].nunique(), col))

In [None]:
for col in cat_features:
    print('{} unique values in {}'.format(test[col].nunique(), col))

In [None]:
train_cat10 = list(pd.DataFrame(train_['cat10'].value_counts()).index)
test_cat10 = list(pd.DataFrame(test['cat10'].value_counts()).index)
print('Elements of cat10 which are present in train_data but NOT in test_data.')
print('-----------------------------------------------------------------------')
for item in train_cat10:
    if item not in test_cat10:
        print(item)
print('')
print('Elements of cat10 which are present in test_data but NOT in train_data.')
print('-----------------------------------------------------------------------')
for item in test_cat10:
    if item not in train_cat10:
        print(item)

In [None]:
# group columns according to cardinality/only for plotting
low_cardinal_cols = []
high_cardinal_cols = []

for col in cat_features:
    if train[col].nunique() <= 20:
        low_cardinal_cols.append(col)
    else:
        high_cardinal_cols.append(col)

# display the values
print("low_cardinal_cols")
print(low_cardinal_cols)
print("high_cardinal_cols")
print(high_cardinal_cols)

# Data Visualization
## Low cardinality features

In [None]:
def count_plot_testTrain(data1, data2, features, titleText):
    L = len(features)
    nrow= int(np.ceil(L/2))
    ncol= 2

    remove_last= (nrow * ncol) - L

    fig, ax = plt.subplots(nrow, ncol,figsize=(18, 26))#, facecolor='#D6E8D8')
    ax.flat[-remove_last].set_visible(False)
    fig.subplots_adjust(top=0.95)
    i = 1
    for feature in features:
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x=feature, color='#1eb069', data=data1, label='train')
        ax = sns.countplot(x=feature, color='#056d87', data=data2, label='test')
        plt.legend()
        i += 1
    plt.suptitle(titleText ,fontsize = 20)
    plt.show()

In [None]:
count_plot_testTrain(train, test, low_cardinal_cols, titleText='Train & test data categorical features (low cardinality)')

In [None]:
def count_plot(data, features, titleText, hue=None):
    L = len(features)
    nrow= int(np.ceil(L/2))
    ncol= 2

    remove_last= (nrow * ncol) - L

    fig, ax = plt.subplots(nrow, ncol,figsize=(18, 26))
    ax.flat[-remove_last].set_visible(False)
    fig.subplots_adjust(top=0.95)
    i = 1
    for feature in features:
        total = float(len(data)) 
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x=feature, palette='viridis', data=data, hue=hue)        
        i += 1
    plt.suptitle(titleText ,fontsize = 20)
    plt.show()    
    

In [None]:
count_plot(train, low_cardinal_cols, 'Train data cat_feats (low cardinal): target dist', hue=target)

## High cardinality categorical features
- As described above there are FOUR catagories which has more than 20 elements (cat5, 7, 8 and 10)
- For plotting convenience, these catagories are condensed to 20 elements (19 most frequent elements + the rest merged to 'etc' catagory). Here 20 is an arbitrary number but equal to the highest cardinality in the 'low cardinality' categories. 

In [None]:
# got a hint from notebook (https://www.kaggle.com/dwin183287/tps-mar-2021-eda/) for this code snippet
for cat in high_cardinal_cols:
    new_cat_train = f'train_{cat}' 
    new_cat_train= list(pd.DataFrame(train_[cat].value_counts()/len(train_[cat]))[:19].index)
    new_cat_test = f'test{cat}' 
    new_cat_test = list(pd.DataFrame(test[cat].value_counts()/len(test[cat]))[:19].index)
    train_[cat] = np.where(~train_[cat].isin(new_cat_train), 'etc', train_[cat])
    test[cat] = np.where(~test[cat].isin(new_cat_test), 'etc', test[cat])

In [None]:
L = len(high_cardinal_cols)
i =1
nrow= int(np.ceil(L/2))
ncol= 2

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(18, 8))
ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)

for cat in train_[high_cardinal_cols]:
    plt.subplot(nrow, ncol, i)
    ax = sns.countplot(x=cat, color='#1eb069', data=train_, label='train') 
    ax = sns.countplot(x=cat, color='#056d87', data=test, label='test')
    plt.suptitle('Train & test data categorical features (high cardinality)' ,fontsize = 20, y=1.002)
    plt.legend()
    i+=1
plt.show() 

L = len(high_cardinal_cols)
i =1
nrow= int(np.ceil(L/2))
ncol= 2

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(18, 8))
ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
for cat in train_[high_cardinal_cols]:
    plt.subplot(nrow, ncol, i)
    ax = sns.countplot(x=cat, palette='viridis', data=train_, hue=target) 
    plt.suptitle('High cardinality categorical features: target dist' ,fontsize = 20, y=1.002)
    plt.legend()
    i+=1
plt.show()

## Numerical features
- Kde plots are made to compare train and test data
- No major differences in distribution 
- Density is consistent with the embalance of the target variable


In [None]:
L = len(num_features)
nrow= int(np.ceil(L/4))
ncol= 4

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(18, 12))
ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in num_features:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_[feature], shade=True, color='#1eb069',  alpha=0.5, label='train')
    ax = sns.kdeplot(test[feature], shade=True, color='#056d87',  alpha=0.5, label='test')
    plt.xlabel(feature, fontsize=9)
    plt.legend()
    i += 1
plt.suptitle('DistPlot: numerical features of train & test data', fontsize=20)
plt.show()

In [None]:
L = len(num_features)
nrow= int(np.ceil(L/4))
ncol= 4

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(18, 12))
ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.95)
i = 1
for feature in num_features:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_[feature], shade=True, palette='viridis',  alpha=0.5, hue= target, multiple="stack")
    plt.xlabel(feature, fontsize=9)
    plt.legend(['1', '0'])
    i += 1
plt.suptitle('DistPlot: numerical features of train data', fontsize=20)
plt.show()

### The same story, different look (looks cool, for me) of the above kde plots
(Normalized distribution at each value)

In [None]:
L = len(num_features)
nrow= int(np.ceil(L/4))
ncol= 4

remove_last= (nrow * ncol) - L

fig, ax = plt.subplots(nrow, ncol,figsize=(18, 12))
ax.flat[-remove_last].set_visible(False)
fig.subplots_adjust(top=0.92)
i = 1
for feature in num_features:
    plt.subplot(nrow, ncol, i)
    ax = sns.kdeplot(train_[feature], shade=True, palette='coolwarm',  alpha=0.75, hue= target, multiple="fill")
    plt.xlabel(feature, fontsize=9)
    plt.legend(['1', '0'])
    i += 1
plt.suptitle('DistPlot: numerical features of train data', fontsize=20)
plt.show()

# Target Variable
(Target variable is imbalanced: more 0's than 1's)

In [None]:
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=target, palette='viridis')
ax.set_title('Target variable distribution', fontsize=20, y=1.05)

sns.despine(right=True)
sns.despine(offset=10, trim=True)


# Feature-target correlation 
- More numerical features than categoricals seem to correlate with target
- cat16 has highest correlation with target
- cont0 is the least correlated with target 


In [None]:
correlation_table = []
for cols in num_features:
    y = target
    x = train[cols]
    corr = np.corrcoef(x, y)[1][0]
    dict ={
        'Features': cols,
        'Correlation coefficient' : corr,
        'Feat_type': 'numerical'
    }
    correlation_table.append(dict)
dF1 = pd.DataFrame(correlation_table)
fig = plt.figure(figsize=(10,6), facecolor='#EAECEE')
ax = sns.barplot(x="Correlation coefficient", y="Features", 
                     data=dF1.sort_values("Correlation coefficient", ascending=False),
                     palette='viridis', alpha=0.75)
ax.grid()
#ax.set_title("Correlation of numerical features with Target", fontsize=20, y=1.05)

title =  'Correlation of numerical features with target'
sub_title = 'In comparison with categorical features \
\nnumericals are less correlated with target.'

plt.gcf().text(0.05, 1.02, title, fontsize=24)
#plt.gcf().text(0.05, 0.9, sub_title, fontsize=14)

at1 = AnchoredText(sub_title,
                   loc='lower left', frameon=True,
                   bbox_to_anchor=(-0.1, 1.01),
                   bbox_transform=ax.transAxes,
                   #prop=dict(size=8),
                   )
at1.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
ax.add_artist(at1);


In [None]:
correlation_table= []
for cols in cat_features:
    y = train_['target']
    X = train[cols]
    corr = pd.concat((X, y), axis=1).apply(lambda x : pd.factorize(x)[0]).corr()
    dict ={
        'Features': cols,
        'Correlation coefficient' : corr['target'][:].values[0],
        'Feat_type': 'categorical'
    }
    correlation_table.append(dict)
dF2 = pd.DataFrame(correlation_table)
fig = plt.figure(figsize=(12,8), facecolor='#EAECEE')
ax = sns.barplot(x="Correlation coefficient", y="Features", 
                     data=dF2.sort_values("Correlation coefficient", ascending=False),
                     palette='viridis', alpha=0.75)
ax.grid()
#ax.set_title("Correlation of categorical features with target", fontsize=20, y=1.05)

title =  'Correlation of categorical features with target'
sub_title = 'Categorical features are better\ncorrelated with target \
than\nnumerical features.\n\ncat16 and cat18 stand-out'

plt.gcf().text(0.05, 1.04, title, fontsize=24)
#plt.gcf().text(0.05, 0.9, sub_title, fontsize=14)

at1 = AnchoredText(sub_title,
                   loc='lower left', frameon=True,
                   bbox_to_anchor=(-0.1, 1.01),
                   bbox_transform=ax.transAxes,
                   #prop=dict(size=8),
                   )
at1.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
ax.add_artist(at1);

# Feature-to-feature correlation
- Correlation between numerical features dominates correlation between categoricals

In [None]:
sns.set_style("darkgrid")

corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(16, 10), facecolor='#EAECEE')
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0, vmin=-1.0, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": 0.75})

#ax.set_title('Correlation heatmap: numerical features', fontsize=24, y= 1.05)
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([-0.75, 0, 0.75])
colorbar.set_ticklabels(['negative_corr','Little_to_no_corr','positive_corr'])

title = 'Highly correlated ones are:\ncont1_cont2\ncont0_cont10'
title_ =  'Correlation heatmap: numerical features (train data)'
plt.gcf().text(0.23, 0.98, title_, fontsize=24)
#plt.gcf().text(0.2, 0.9, title, fontsize=12)

#textstr = 'Features with highest correlation\ncon1$con2, cont0&cont10, cont'
at1 = AnchoredText(title,
                   loc='lower left', frameon=True,
                   bbox_to_anchor=(-0.1, 1.01),
                   bbox_transform=ax.transAxes,
                   #prop=dict(size=8),
                   )
at1.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
ax.add_artist(at1);

In [None]:
corr = (train_[cat_features]).apply(lambda x : pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(20, 12), facecolor="#EAECEE")
cmap = sns.color_palette("vlag", as_cmap=True)
sns.heatmap(np.round(corr, 2), mask=mask, cmap=cmap, vmax=1.0, vmin=-1.0, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": 0.75})

colorbar = ax.collections[0].colorbar
colorbar.set_ticks([-0.75, 0, 0.75])
colorbar.set_ticklabels(['negative_corr','Little_to_no_corr','positive_corr'])

sub_title = 'Not as many highly correlated as num_features.\nMost notable is:\ncat11_cat2'
title =  'Correlation heatmap: catagorical features (train data)'
plt.gcf().text(0.26, 0.98, title, fontsize=24)

at1 = AnchoredText(sub_title,
                   loc='lower left', frameon=True,
                   bbox_to_anchor=(-0.098, 1.02),
                   bbox_transform=ax.transAxes, #prop=dict(size=8),
                   )
at1.patch.set_boxstyle("round,pad=0.,rounding_size=0.4")
ax.add_artist(at1);

## End of notebook!

# Thank you very much for reading this notebook!