## Tabular Playground Series - JUN

Now many users are using my EDA and visualizations in previous competitions.

**Very happy and honored. :)**

- [⚫ Dark Mode Visualization (Apple Version)](https://www.kaggle.com/subinium/dark-mode-visualization-apple-version)
- [[TPS-Apr] Highlighting the Data ✨✨✨](https://www.kaggle.com/subinium/tps-apr-highlighting-the-data)
- [[TPS-May] Categorical EDA](https://www.kaggle.com/subinium/tps-may-categorical-eda)


But then again, I'm going to use a new visualization technique to create a nice EDA.

**Please enjoy.**

## Introduction

In the announcement of the competition, it is said that the dataset was created through CTGAN.

The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. 

Similar to May's data set, but with **9 labels** and **75 features** now, it's a bit more difficult.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')

The id value is meaningless, so I will leave it out in advance.

In [None]:
train = train.drop('id', axis=1)
test = test.drop('id', axis=1)

## EDA with the Table

Statistical information can be viewed through the table. This is information that can be viewed in a variety of ways through the visualization below, and is useful when looking at it briefly.

- [[TPS-May] Categorical EDA](https://www.kaggle.com/subinium/tps-may-categorical-eda)

In [None]:
# train.describe().T.style.bar(subset=['mean'], color='#205ff2')\
#                             .background_gradient(subset=['std'], cmap='Reds')\
#                             .background_gradient(subset=['50%'], cmap='coolwarm')

In [None]:
# test.describe().T.style.bar(subset=['mean'], color='#205ff2')\
#                             .background_gradient(subset=['std'], cmap='Reds')\
#                             .background_gradient(subset=['50%'], cmap='coolwarm')

## Train / Test Feature EDA

There are a **total of 75 features.**

train has **200,000** data and test has **100,000** data.

It is confirmed that there are no missing values. However, even if there are no missing values, there is a possibility that 0 or mode is a missing value because the data is anonymized.

In [None]:
import missingno as msno

fig, ax = plt.subplots(1, 2, figsize=(20, 5))

msno.matrix(train, ax=ax[0], sparkline=False)
msno.matrix(test, ax=ax[1], sparkline=False)

ax[0].set_title('Train Null Data Check (0)', fontweight='bold')
ax[1].set_title('Test Null Data Check (0)', fontweight='bold')
plt.show()

## Target Distribution

The target is unbalanced.

In [None]:
fig = plt.figure(figsize=(12, 8))
gs = fig.add_gridspec(7, 4)
ax = fig.add_subplot(gs[:-1,:])
ax2 = fig.add_subplot(gs[-1,:])
ax2.axis('off')

target_cnt = train['target'].value_counts().sort_index()
target_cum = target_cnt.cumsum()
ax.bar(target_cnt.index, target_cnt, color=['#d4dddd' if i%2==0 else '#fafafa' for i in range(9)],
       width=0.55, 
       edgecolor='black', 
       linewidth=0.7)


for i in range(9):
    ax.annotate(f'{target_cnt[i]}({target_cnt[i]/len(train)*100:.3}%)', xy=(i, target_cnt[i]+1000),
                   va='center', ha='center',
               )
    ax2.barh([0], [target_cnt[i]], left=[target_cum[i] - target_cnt[i]], height=0.2, 
            edgecolor='black', linewidth=0.7, color='#d4dddd' if i%2==0 else '#fafafa'
            )
    ax2.annotate(i+1, xy=(target_cum[i]-target_cnt[i]/2, 0), 
                 va='center', ha='center', fontsize=10)

ax.set_title('Target Distribution', weight='bold', fontsize=15)
ax.grid(axis='y', linestyle='-', alpha=0.4)

fig.tight_layout()
plt.show()

If you arrange them proportionally, up to 4 accounts for 75%. The other four are also less than 10%.

In [None]:
target_cnt_df = pd.DataFrame(target_cnt)
target_cnt_df['ratio(%)'] = target_cnt_df/target_cnt.sum()*100
target_cnt_df.sort_values('ratio(%)', ascending=False, inplace=True)
target_cnt_df['cummulated_sum(%)'] = target_cnt_df['ratio(%)'].cumsum()
target_cnt_df

There are a total of 9 classes, and there are 2 classes with the highest class being 25%.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 6))

y = np.array([train[f'feature_{i}'].nunique() for i in range(75)])
y2 = np.array([test[f'feature_{i}'].nunique() for i in range(75)])
comp = y-y2


ax.bar(range(75), y2, alpha=0.7, color='lightgray', label='Test Dataset')
ax.bar(range(75),  comp*(comp>0), bottom=y2, color='#336666', alpha=0.7, label='Train > Test')
ax.bar(range(75), comp*(comp<0), bottom=y2-comp*(comp<0), color='#e3120b', alpha=0.7, label='Train < Test')

ax.set_yticks(range(0, 110, 5))
ax.set_xticks(range(0, 75, 5))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('# of Features Unique Values (Train/Test)', loc='left', fontweight='bold')
ax.legend()
plt.show()

Please check these features because the number of features in the train and test are different.

In [None]:
pd.DataFrame(data={'feature' : np.arange(75)[comp>0], 
              'delta' : comp[comp>0]}, index=None)

You can see more features of train. Handling that feature will be another issue in this competition.

## Statistical comparison

Let's sort the mean values ​​of the train dataset and compare them with the test.

In [None]:
test_mean = test.mean()
idx = np.argsort(test_mean)

train_mean = train.mean()[idx]
test_mean = test_mean.sort_values()

comp = train_mean-test_mean

fig, ax = plt.subplots(1, 1, figsize=(15, 6))

ax.bar(range(75), test_mean, alpha=0.7, color='lightgray', label='Test Dataset')
ax.bar(range(75),  comp*(comp>0), bottom=test_mean, color='#336666', alpha=0.7, label='Train > Test')
ax.bar(range(75), comp*(comp<0), bottom=test_mean-comp*(comp<0), color='#e3120b', alpha=0.7, label='Train < Test')

ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('Mean of Features (Sorted by Test)', loc='left', fontweight='bold')
ax.legend()
plt.show()

The overall mean of train and test appears to be similar.

Let's check if the aspect of the mean is different for each class.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 12))

train_mean = train.mean().sort_values()
mean_idx = np.argsort(train.mean())
axes = axes.flatten()

for idx, target_feature in enumerate(sorted(train['target'].unique())):
    sub_mean = train[train['target']==target_feature].mean()[mean_idx]
    comp = train_mean-sub_mean

    ax = axes[idx]
    ax.bar(range(75), sub_mean, alpha=0.7, color='lightgray')
    ax.bar(range(75),  comp*(comp>0), bottom=sub_mean, color='#336666', alpha=0.7, label=f'Train > {target_feature}')
    ax.bar(range(75), comp*(comp<0), bottom=sub_mean-comp*(comp<0), color='#e3120b', alpha=0.7, label=f'Train < {target_feature}')

    ax.margins(0.01)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_title(f'{target_feature}', fontweight='bold', loc='left', bbox=dict(boxstyle='round', fc="#efe8d1", ec="k"))
    ax.spines['left'].set_visible(False)
    ax.legend(fontsize=8)

fig.supxlabel('Average by class (by Class)', ha='center', fontweight='bold')    
plt.show()

I tried to separate features between classes for the first time, and there is definitely a difference in tendency. I hope the model can learn this well

In [None]:
fig, axes = plt.subplots(15, 5, figsize=(10, 20))

target_order = sorted(train['target'].unique())
mean = train.groupby('target').mean().sort_index()
std = train.groupby('target').std().sort_index()

for idx, ax in zip(range(75), axes.flatten()):
    ax.bar(mean[f'feature_{idx}'].index, mean[f'feature_{idx}'], 
           color=['#efe8d1' if i%2==0 else '#acc8d4' for i in range(9)],
           edgecolor='black', 
           linewidth=0.4,
           width=0.6)
    
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.margins(0.1)
    ax.spines['left'].set_visible(False)
    ax.set_title(f'Feature_{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Average by class (by feature)', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

The mean according to the features is showing significance.

## Distribution by each feature

Most of them are 0, so I focused on the distribution of 0 and proceeded with the visualization.

If you look at the actual ratio of zero, it looks like this:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 6))

ax.bar(range(75), 100, linewidth=0.2, edgecolor='black', alpha=0.2, color='lightgray')
ax.bar(range(75), ((train == 0).sum() / len(train)*100)[:-1].sort_values(), linewidth=0.2, edgecolor='black', alpha=1, color='#244747')

ax.set_ylim(0, 100)
ax.set_yticks(range(0, 100, 10))

ax.set_xticks(range(0, 75, 5))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', linewidth=0.2, zorder=5)
ax.set_title('Ratio of Zeros (Sorted)', loc='center', fontweight='bold')
ax.set_ylabel('ratio(%)', fontsize=12)
ax.legend()
plt.show()

- Small ones have a percentage of zero, 20%, but many are close to 80%.
- Although not visualized, the distributions of zeros in train and zeros in test are almost similar.

## Dimension Reduction with Sampling

In [None]:
label_dict = {val:idx for idx, val in enumerate(sorted(train['target'].unique()))}
train['target'] = train['target'].map(label_dict)

In [None]:
%%time
from umap import UMAP

train_sub = train.sample(50000, random_state=72)
target = train_sub['target']
umap = UMAP(random_state=0)
dr = umap.fit_transform(train_sub.iloc[:,:-1], target)

### Class

In [None]:
# https://www.kaggle.com/subinium/dark-mode-visualization-apple-version
light_palette = [
    (0, 122, 255), # Blue
    (255, 149, 0), # Orange
    (52, 199, 89), # Green
    (255, 59, 48), # Red
    (175, 82, 222),# Purple
    (255, 45, 85), # Pink
    (88, 86, 214), # Indigo
    (90, 200, 250),# Teal
    (255, 204, 0)  # Yellow
]

In [None]:
fig = plt.figure(figsize=(20, 20))
gs = fig.add_gridspec(10, 9)
ax = fig.add_subplot(gs[:-1,:])

sub_axes = [None] * 9
for idx in range(9): 
    sub_axes[idx] = fig.add_subplot(gs[-1,idx])

for idx in range(9):
    ax.scatter(x=dr[:,0][target==idx], y=dr[:,1][target==idx],
              s=10, alpha=0.2
              )

    for j in range(9):
        sub_axes[j].scatter(x=dr[:,0][target==idx], y=dr[:,1][target==idx],
                            s=10, alpha = 0.4 if idx==j else 0.008, 
                            color = '#%02x%02x%02x' % light_palette[j] if idx==j else 'gray',
                            zorder=(idx==j)
                           )
        
    
    sub_axes[idx].set_xticks([])
    sub_axes[idx].set_yticks([])
    sub_axes[idx].set_xlabel('')
    sub_axes[idx].set_ylabel('')
    sub_axes[idx].set_title(f'Class_{idx+1}')
    sub_axes[idx].spines['right'].set_visible(True)
    sub_axes[idx].spines['top'].set_visible(True)

ax.set_title('Dimenstion Reduction (UMAP)', fontweight='bold', fontfamily='serif', fontsize=20, loc='left')   
    
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlabel('')
ax.set_ylabel('')
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

fig.tight_layout()
plt.show()

In [None]:
%%time
test_sub = train.sample(50000, random_state=72)
dr_test = umap.transform(test_sub)

In [None]:
import matplotlib.patches as mpatches


fig, ax = plt.subplots(1, 1, figsize=(20, 20))
ax.scatter(x=dr[:,0], y=dr[:,1], 
           color = '#%02x%02x%02x' % light_palette[0],
           s=10, alpha=0.3, label='Train')
ax.scatter(x=dr_test[:,0], y=dr_test[:,1], 
           color = '#%02x%02x%02x' % light_palette[1],
           s=10, alpha=0.3, label='Test')


ax.set_title('Dimenstion Reduction Compare (Train/Test)', fontweight='bold', fontfamily='serif', fontsize=20, loc='left')   
    
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlabel('')
ax.set_ylabel('')
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

train_dot = mpatches.Patch(color='#%02x%02x%02x' % light_palette[0], label='Train')
test_dot = mpatches.Patch(color='#%02x%02x%02x' % light_palette[1], label='Train')
ax.legend(handles=[train_dot, test_dot], loc='lower center', ncol = 2, fontsize=15)
fig.tight_layout()
plt.show()

## Reference

- color reference : https://pattern-library.economist.com/color.html

## If the content is helpful, please upvote. :)