# Tabular Playground Series - May 2021

Tabular Playground Series are great way to practice ML skills on relatively easy datasets.
In May2021 competiton, the dataset is based on original data which was used for predicting the category on an eCommerce product given various attributes about the listing. Dataset is synthetic and generated using CTGAN, which is Conditional GAN for generating synthetic tabular data.

Feature in this dataset are anonymized, keeping properties related to real-world features. 

In this notebook, focus will be on data understanding and EDA.

- In first part - Manual EDA, matplotlib, seaborn and pandas will be used.

- In later part - Auto EDA, [dataprep](https://pypi.org/project/dataprep/) is used for performing EDA.

Modelling part would be covered in next notebook.

## Part 1 : Manual EDA - using matplotlib, seaborn and pandas

Importing libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

In [None]:
train_data = pd.read_csv("../input/tabular-playground-series-may-2021/train.csv")
test_data = pd.read_csv("../input/tabular-playground-series-may-2021/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-may-2021/sample_submission.csv")

In [None]:
train_data.describe().T

In [None]:
train_data.head()

In [None]:
train_data.info()

From above information, it seems that none of the row or feature has null values.
But its confirm it.

In [None]:
print('Missing values per columns in train dataset')
for col in train_data.columns:
    temp_col=train_data[col].isnull().sum()
    print(f'{col}: {temp_col}')
print()
print('Missing values per columns in test dataset')
for col in test_data.columns:
    temp_col= test_data[col].isnull().sum()
    print(f'{col}: {temp_col}')

So there is not null data, let's visualize it and perform EDA.

There are 50 features with an ID and a target column, in dataset.

### Correlation between all the columns

Let's do it using pandas method directly.

In [None]:
train_data.corr()

Let's plot correaltion using seaborn.

In [None]:
corrMatrix = train_data.corr()
plt.figure(figsize=(30, 30))
sns.heatmap(corrMatrix, annot=False, cmap='Blues')

plt.show()

From above heatmap, its clear that there is no correlation among the features.

Lets check details of target column.

In [None]:
train_data.target.describe()

In [None]:
train_data.target.unique()

There are four unique classes for target.

Let's check unique values of all features.

In [None]:
print("Feature", "Unique Count")
for column in train_data.columns :
    cname = f"{column}"
    print(cname, len(train_data[cname].unique()))

We will now create bar plot for each feature and its count.

In [None]:
fig, axes = plt.subplots(10, 5, figsize=(12, 17))
axes = axes.flatten()
plt.figure(figsize=(120, 100))
fig.suptitle('Variations in Features')
#plt.xticks(rotation=75)
fig.tight_layout(pad=3.0)

for i in range(50):
    g = sns.barplot(ax=axes[i], y=train_data.columns[i+1], x=train_data.columns[51], data = train_data)
    g.set_xticklabels(g.get_xticklabels(), rotation=75)

From above heatmaps, it seems each feature is almost equally distributed among four targer classes.

## Part 2 - Auto EDA, using [dataprep](https://pypi.org/project/dataprep/).

In this section, we will use DataPrep.EDA. It required just few lines of code to perform EDA.

Installing necessary library.

In [None]:
!pip install -U dataprep >./temp

In [None]:
from dataprep.eda import create_report

Now, lets create EDA report.

In [None]:
create_report(train_data).show()

### Next:

In next part, I will be attepting modelling part.

If you like this notebook, kindley upvote.