# Tabular Playground Series - Jun 2021 [EDA]
This notebook is based on the `TPS - Jun 2021` Competition organised by Kaggle. Through out this notebook we deal to understand the dataset and create patterns between the different features with the target variables. We try to understand which features affect the target field. 

# Load the dataset
In this section, we first import all the required libraries and then load the training data into the notebook. We also check the number of columns, number of missing values and other factors to understand our dataset.

In [None]:
!pip install chart-studio --quiet

In [None]:
import pandas as pd
import numpy as np
import plotly.express as ex
import cufflinks as cf
import chart_studio.plotly as py
from plotly.offline import iplot, init_notebook_mode, plot, download_plotlyjs
import plotly.graph_objects as go
import seaborn as sns
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import matplotlib


init_notebook_mode(connected=True)
cf.go_offline

import warnings
warnings.filterwarnings('ignore')

primary_color="#342A21"
secondary_color="#DA667B"

In [None]:
ds = pd.read_csv("../input/tabular-playground-series-jun-2021/train.csv")
ds.head()

In [None]:
test_ds = pd.read_csv("../input/tabular-playground-series-jun-2021/test.csv")
test_ds.head()

In [None]:
print(f"Number of columns: {len(ds.columns)}")
print(f"Length of the dataset: {len(ds)}")
print(f"Number of missing columns: {(ds.isna().sum() != 0).sum()}")

So, we don't have a missing value in our dataset and have 77 columns which includes the target variable also. Also we have a large dataset of 200K rows. Lets check out more deeper into the dataset through different statistics operations.

# Perform Statistics
In this section, we perform different statistics operation which includes mean, min, max, skew, etc. We also try to understand some statistics operation using the data visualization as the dataset is too large for undestand through values.

In [None]:
ds.describe()

Since we see that the min value and max value have vast difference of the feature columns which conclude that there must be probabilty of getting the outliers in the dataset is high. So we need to check out some of the columns more deeply to check wheather outliers present or not in the dataset.

In [None]:
fig = ex.box(ds[:10000], y=['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5'], title="Outliers Graph")
fig.update_traces(marker_color=secondary_color)

As we expected, we have outliers present in our dataset which we need to deal while fitting the data into the ml model. Lets check the skewness in our dataset.

In [None]:
def skew_frame(dataframe, train=True):
    skew_value = {}
    columns = []
    values = []
    if train:
        for col in dataframe.columns[1:-1]:
            columns.append(col)
            values.append(ds[col].skew())
    else:
        for col in dataframe.columns[1:]:
            columns.append(col)
            values.append(ds[col].skew())
    skew_value['columns'] = columns
    skew_value['values'] = values
    skew_ds = pd.DataFrame(skew_value, columns=['columns', 'values'])
    return skew_ds

train_skew = skew_frame(ds)
test_skew = skew_frame(test_ds, train=False)
    
fig = go.Figure(data=[go.Bar(x=train_skew['columns'], y=train_skew['values'], name="train", marker=dict(color=primary_color)),
                     go.Bar(x=test_skew['columns'], y=test_skew['values'], name="test", marker=dict(color=secondary_color))]) 
fig.update_layout(barmode='stack', xaxis=dict(title="columns"), 
                  yaxis=dict(title="frequency"), title="Skewness Graph")

So, we have lots of columns that are not normalize and contains skewness in it. We need to clear out the skewness from the dataset before training the ml model.

Lets perform the EDA and understand more about the dataset and found patterns in it.

# Exploratory Data Analysis
In this section, we perform the EDA between features and try to find out more insight from the data.

In [None]:
fig = ex.bar(x=ds['target'].value_counts().keys(), y=ds['target'].value_counts(), 
             labels={'x': 'targets', 'y': 'frequency'}, title="Target Graph")
fig.update_traces(marker_color=primary_color)

**So, we have a unbalanced dataset with `Class_6` and `Class_8` highly densed dataset with more than 50K entities.**

In [None]:
ds.head()

In [None]:
def unique(dataframe, train=True):
    unique_values = []
    if train:
        for col in dataframe.columns[1:-1]:
            unique_values.append(len(dataframe[col].unique()))
    else:
        for col in dataframe.columns[1:]:
            unique_values.append(len(dataframe[col].unique()))
    return unique_values
train_unique = unique(ds)
test_unique = unique(test_ds, train=False)
fig = go.Figure(data=[go.Bar(x=ds.columns[1:-1], y=train_unique, marker=dict(color=primary_color), name="train"),
                     go.Bar(x=test_ds.columns[1:], y=test_unique, marker=dict(color=secondary_color), name='test')]) 
fig.update_layout(barmode='stack', xaxis=dict(title="columns"), yaxis=dict(title="unique frequency"), title="Unique Value Graph")

**We have lots of columns having more than 50 unique values in both train and the test dataset.**

In [None]:
num_row=19
num_col=4
fig, ax = plt.subplots(num_row, num_col, figsize=(20, 20))
for col, index in zip(ds.columns[1:-1], np.arange(len(ds.columns[1:-1]))):
    i, j = (index // num_col, index % num_col)
    sns.lineplot(ds[col].value_counts().keys(), ds[col].value_counts(), ax=ax[i, j])
    ax[i, j].set_ylabel("")
    ax[i, j].set_title(col, fontweight='bold')
fig.tight_layout()
fig.show()

**In most of the columns, we have high percentage of zeros present in the dataset.**

In [None]:
corr = ds[1:].corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(30, 20))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", [primary_color, secondary_color])
sns.heatmap(ds[1:].corr(), mask=mask, cmap=cmap, linewidth=0.2)
plt.title('Features Heat Map', fontweight='bold', fontsize=24);

We have some relation between the columns while some columns have a very low relation with each columns. We need to extract the features while training the model, we need to perform feature selection with a high feature frequency with different approaches and find the best one from it.

**Other EDA can be perform for understanding the dataset more clearly. All the possible suggestion are invited in the comment section. Please upvote it you find this useful.**