## [TPS-AUG] Simple EDA

This time, it's a competition with so many features.

For a fun contest, I lightly conducted EDA.

- No missing value.
- There are 100 numerical continuous features.
- The target variable loss ranges from 0 to 42 for a total of 43 discrete values. However, this is a regression problem and it is OK to submit as decimal values.

## Load Data & Library

Let's load a library for basic data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 

In [None]:
# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

In [None]:
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

In [None]:
print(f'Train Shape :  {train.shape}')
print(f'Test Shape :  {test.shape}')

In [None]:
target = train['loss']
train.drop(['id'], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)

In [None]:
train.head()

In [None]:
train.info()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(17, 8))

target_cnt = train['loss'].value_counts().sort_index()

ax.bar(target_cnt.index, target_cnt, color=['#d4dddd' if i%2==0 else '#fafafa' for i in range(9)],
       width=0.55, 
       edgecolor='black', 
       linewidth=0.7)

ax.margins(0.02, 0.05)

for i in range(20):
    ax.annotate(f'{target_cnt[i]/len(train)*100:.3}', xy=(i, target_cnt[i]+1000),
                   va='center', ha='center',
               )

ax.set_title('Target Distribution', weight='bold', fontsize=15)
ax.grid(axis='y', linestyle='-', alpha=0.4)

fig.tight_layout()
plt.show()

- There are a total of 43 discrete losses.
- The top 12 distributions account for 80% of the total.
- All except the order of 2 and 1 are in increasing order.

In [None]:
target_cnt_df = pd.DataFrame(target_cnt)
target_cnt_df['ratio(%)'] = target_cnt_df/target_cnt.sum()*100
target_cnt_df.sort_values('ratio(%)', ascending=False, inplace=True)
target_cnt_df['cummulated_sum(%)'] = target_cnt_df['ratio(%)'].cumsum()
target_cnt_df.style.bar(subset=['cummulated_sum(%)'], color='#205ff2')

## Statistics Check

The scale of this data is really diverse.

It doesn't matter if you use a tree-based model, but scaling is essential by default.

In [None]:
train.describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                            .background_gradient(subset=['50%'], cmap='coolwarm')

- There is data without decimal point.
- The range of data is diverse.

## Discrete Features

Some data are found to have no decimal point.

In [None]:
discrete_features = []

for col in train.columns:
    if np.array_equal(train[col].values, train[col].values.astype(int)):
        discrete_features.append(col)

print(f'Total {len(discrete_features)} : ')
print(discrete_features)

A total of 6 features have no decimal point.

- `f1`
- `f16`
- `f27`
- `f55`
- `f60`
- `f86`


In [None]:
for dcol in discrete_features:
    print(f'{dcol} unique value : {train[dcol].nunique()}')

While the total number of data is 250000, most of the data in `f16` and `f60` are confirmed as continuous with different values, but the remaining `f1`, `f27`, `f55`, and `f86` look relatively categorical.

Looking at f1 and f86 with a small number of unique values:
For the relationship with the loss, we averaged after groupby.

In [None]:
f1_loss = train.groupby(['f1'])['loss'].mean().sort_values()
print((f1_loss==0).sum())

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))

ax.bar(range(len(f1_loss)), f1_loss, alpha=0.7, color='lightgray', label='Test Dataset')
ax.set_yticks(range(0, 20, 3))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('Average of loss grouped by f1', loc='left', fontweight='bold')
ax.legend()
plt.show()

- Depending on the value of f1, we can check the imbalance of loss.
- In 5 cases, we confirmed that the loss is all 0.

In [None]:
f86_loss = train.groupby(['f86'])['loss'].mean().sort_values()
print((f86_loss==0).sum())

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 6))

ax.bar(range(len(f86_loss)), f86_loss, alpha=0.7, color='lightgray', label='Test Dataset')
ax.set_yticks(range(0, 20, 3))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('Average of loss grouped by f86', loc='left', fontweight='bold')
ax.legend()
plt.show()

- I noticed that there is an imbalance, though not as much as f1.

## Scaling

Except for tree-based models, you need to scale the data.

Before visualization, we will adjust the line and proceed with the visualization.

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
features = [f'f{i}' for i in range(100)]
train[features] = ss.fit_transform(train[features])
test[features] = ss.transform(test[features])

## Target & Feature Relation

As the value of targets increases, the mean moves away from zero.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
sns.heatmap(train.groupby('loss').mean().sort_index(),
            square=True, center=0, linewidth=1,
            cmap=sns.diverging_palette(240, 10, as_cmap=True),
            cbar=False, 
           )

ax.set_title('Mean : Group by Target(Loss)',loc='left')
plt.show()

Let's take a look by adjusting the range of expression.

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12, 7))
sns.heatmap(train.groupby('loss').mean().sort_index(),
            square=True, vmin=-0.5, vmax=0.5, center=0, linewidth=1,
            cmap=sns.diverging_palette(240, 10, as_cmap=True),
            cbar=False, 
           )

ax.set_title('Mean : Group by Target(Loss)',loc='left')
plt.show()

There seems to be an increasing trend and a decreasing trend.

## Feature Distribution

In [None]:
fig, axes = plt.subplots(10,10,figsize=(12, 12))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    sns.kdeplot(data=train, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    sns.kdeplot(data=test, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Average by class (by feature)', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()
    
    

It's scaled up, but it's a pretty interesting aspect of the data.

It is safe to assume that the distributions of train and test are almost the same.

## Correlation

It can be seen that most of the correlations are close to zero.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12 , 12))

corr = train.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr, ax=ax,
        square=True, center=0, linewidth=1,
        cmap=sns.diverging_palette(240, 10, as_cmap=True),
        cbar_kws={"shrink": .82},    
        mask=mask
       ) 

ax.set_title(f'Correlation', loc='left', fontweight='bold')     

plt.show()