## [TPS-Oct] Simple EDA

It's surprising that 2021 is already only 3 months away.

I hope I can enjoy it with all the Kaglers this time too.

---

### Summary

- There are no missing values in both train ans test dataset.
- The train consists of 1000000 data, and the test consists of 500000 data.
- almost of `f0`~`f241` : continuous feature (240)
- **`f22`, `f43`**, `f242` ~ `f284` : binary feature (45)
- The value of target is 0 or 1.
- The value of target is almost half-and-half. 

## Import Library

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 

Since there are many columns, you need to configure it for EDA to be convenient.

In [None]:
# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

# pandas setting
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
train = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-oct-2021/sample_submission.csv')

Since there is too much data, it would be better to minimize the time for file input/output by using a library such as `datable`.

- [datable](https://github.com/h2oai/datatable)
    - example notebook : [@SRK : Getting started with Python datatable](https://www.kaggle.com/sudalairajkumar/getting-started-with-python-datatable)

In [None]:
print(train.shape)
print(test.shape)

train.head()

Some features are continuous real numbers, others are binary. Let's take a look.

In [None]:
for col in train.columns:
    if 'int' in str(train[col].dtype):
        print(col, end=' ')

## Feature Statistics

In [None]:
train.loc[:, 'f0':'f284'].describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                            .background_gradient(subset=['50%'], cmap='coolwarm')

In [None]:
train['target'].value_counts()

The value of target is almost half-and-half.

## Feature Distribution

If you have too much data, it's a good idea to sample and visualize the approximate distribution first.

In [None]:
np.random.seed(2110)
train = train.sample(10000)
test = test.sample(10000)

> Some unusual features are highlighted in red.

In [None]:
features = "f2 f10 f23 f44 f68 f70  f73 f98 f104 f105 f120 f138 f139 f141 f142 f143 f146 f157 f158 f175 f187 f188 f200 f222 f239".split()

In [None]:
fig, axes = plt.subplots(11,11,figsize=(14, 14))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    if idx in [22, 43]:
        ax.axis("off")
        continue
    sns.kdeplot(data=train, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    sns.kdeplot(data=test, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    if f'f{idx}' in features:
        ax.spines[:].set_visible(True)
        ax.spines[:].set_color('red')
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Feature Distribution (by feature f0-f120)', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(11,11,figsize=(14, 14))
axes = axes.flatten()

for idx, ax in enumerate(axes, 121):
    if idx in [22, 43]:
        ax.axis("off")
        continue
    sns.kdeplot(data=train, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    sns.kdeplot(data=test, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    if f'f{idx}' in features:
        ax.spines[:].set_visible(True)
        ax.spines[:].set_color('red')
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Feature Distribution (by feature f121-f241)', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

- The distribution of train and test is similar.
- There are various types of data distribution. Think about what the reason might be.

## Binary Feature



In [None]:
binary_mean = train.loc[:,['f22', 'f43']+[f'f{i}' for i in range(242, 285)]].mean()

fig, ax = plt.subplots(1, 1, figsize=(15, 6))

ax.bar(binary_mean.index, binary_mean, linewidth=0.2, edgecolor='black', alpha=1, color='#244747')

ax.set_ylim(0, 1)
ax.set_xticks(range(0, 44, 4))
ax.margins(0.01)
ax.grid(axis='y', linestyle='--', linewidth=0.2, zorder=5)
ax.set_title('Mean of binary features', loc='center', fontweight='bold')
ax.legend()
plt.show()