# Loading Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')
FILE_PATH='/kaggle/input/tabular-playground-series-may-2021/'

# Introduction

The tabular playground series are hosted by kaggle that are always more approachable compared to the their normal featured competitions. The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. 


## Let's talk about data!

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a **CTGAN**. The original dataset deals with **predicting the category on an eCommerce product** given various attributes about the listing. Let's check them!

In [None]:
# reading the training and test data
train_data=pd.read_csv(FILE_PATH+'train.csv')
test_data=pd.read_csv(FILE_PATH+'test.csv')

In [None]:
#columns of training data
train_data.columns

The features are **anonymized**, so it's difficult to directly get insights for each features from their names. The feature 'id' is not normally used in EDA so for now, let's remove it for now.

In [None]:
# dropping the 'id' column from both train and test data
train_data.drop(['id'],inplace=True,axis=1)
test_data.drop(['id'],inplace=True,axis=1)

In [None]:
train_data.head(5)

In [None]:
# checking unique elements of target feature of train data
train_data['target'].unique()

All the 50 features seems to have **discrete numbers (starting from 0)**, and the target has 4 unique values but they have data type 'object'. ***It's a dataset with aim to classify the target among the 4 classes(namely Class_1, Class_2,Class_3,Class_4)***

In [None]:
# converting the data type from 'object' to 'int'
train_data['target']=train_data['target'].str[6:].astype('int')

In [None]:
# features columns
features=[]
for i in range(0,50):
    features.append('feature_'+str(i))
#features

In [None]:
#display features and their minimum value (other than zero).
print('For training dataset--')
for fea in features:
    if train_data[fea].min()!=0:
        print(fea," ",train_data[fea].min())
print('For test dataset--')
for fea in features:
    if test_data[fea].min()!=0:
        print(fea," ",test_data[fea].min())

We can see that very few features are present (given above) which have their minimum value less than zero. 

In [None]:
# Maximum value present in the dataset
print('Maximum value in training data',train_data.max().values.max())
print('Maximum value in test data',test_data.max().values.max())

The value is **not very high(compared to shape of data)**. Let's check in which feature the are maximum unique values.

In [None]:
# Finding feature that has maximum number of unique values both in training and test data
print('For training data--')
max_value=-1
feat=''
for fea in features:
    if train_data[fea].nunique()>max_value:
        feat=fea
        max_value=train_data[fea].nunique()
print(feat," ",max_value)
print('For test data--')
max_value=-1
feat=''
for fea in features:
    if test_data[fea].nunique()>max_value:
        feat=fea
        max_value=test_data[fea].nunique()
print(feat," ",max_value)

Thus, the maximum number of unique values is **71**, in training set and **65** in test set (not a large number compared to number of samples in the dataset), both present on the same feature **feature_38** . Thus, we can work thinking that all the features present are to be considered as categorical type, due to presence of discrete and finite values.

One more thing to note if one is working considering that the features are categorical is the values of features that is present of training data but not on test and vice-versa. 

In [None]:
print('Value of features that is present on training data but not on test data')
print('-'*100)
for f in features:
    train_set=set(train_data[f])
    test_set=set(test_data[f])
    # values present in train but not in test
    dif_set=train_set.difference(test_set)
    if dif_set != set():
        print(f,'-'*10,dif_set)

In [None]:
print('Value of features that is present on test set but not on training set')
print('-'*100)
for f in features:
    train_set=set(train_data[f])
    test_set=set(test_data[f])
    # values present in test but not in train
    dif_set=test_set.difference(train_set)
    if dif_set != set():
        print(f,'-'*10,dif_set)

# Time for Visuals!


The first thing that needs to be checked is obviously how the **target is distributed.**


In [None]:
#count plot
sns.countplot(x='target',data=train_data)

In [None]:
# Percentile of each class
for i in range(1,5):
    print('Class_',i,end=' is ')
    print(((train_data['target']==i).sum()/train_data.shape[0])*100,'%')

**57.497%** of training data has target 'Class_2', whereas only **8.49%** of training data has target 'Class_1'. Thus, our target is bit **imbalanced.**


Let's check the distribution for our features now,

In [None]:
def display(feature):
    ''' Function to display plot of a feature present both in training and test side by side'''
    ax=[]
    fig=plt.figure(figsize=(15,5))
    ax.append(fig.add_subplot(1,2,1))
    ax[-1].set_title("Training Data:")
    sns.histplot(x=feature,data=train_data,stat='density',kde =True)
    ax.append(fig.add_subplot(1,2,2))
    ax[-1].set_title("Test Data:")
    sns.histplot(x=feature,data=test_data,stat='density', kde=True)
    return plt.show()

In [None]:
for fea in features:
    print('\033[1m',fea.upper(),'\033[0m') #\033[1m and \033[0m can be used to make python text bold
    display(fea)

From the visuals, we can see that that majority of features are skewed. To check if a feature is left skewed or right skewed, we can just check the condition that for a feature to be **left skewed**, it's mean should be less than it's median, (from the figure below).
![Skewness](https://www.statisticshowto.com/wp-content/uploads/2014/02/pearson-mode-skewness.jpg)

In [None]:
for fea in features:
    if train_data[fea].mean() < train_data[fea].median():
        print(fea,' is left skewed')

Thus, we can clearly infer that majority of features are **right skewed**. 

Let's see how much percentage of our data has our value that is present the most at each feature

In [None]:
# making a table showing the maximum occuring value and what percentage of data does it occupy on the training data
freq_table=pd.DataFrame()
freq=[]
per=[]
for fea in features:
    freq.append(train_data[fea].mode()[0])
    per.append(((train_data[fea]==train_data[fea].mode()[0]).sum()/train_data.shape[0])*100)
freq_table['Features']=features
freq_table['Max Occuring Value']=freq
freq_table['Percentage Occupied']=per
freq_table=freq_table.sort_values('Percentage Occupied')
freq_table.reset_index()

The value that occurs most frequently at training data is **zero**, many of which features are occupying **more than 50% of our data**. Let's visualize it to get a better idea.

In [None]:
fig=plt.figure(figsize=(15,15))
barh=plt.barh(freq_table['Features'],freq_table['Percentage Occupied'])
plt.bar_label(barh, fmt='%.01f%%')
plt.xlabel('Percentage Occupied')

At last, let's use **heatmap** to visualize the **correlation** between the features or between our features and target.

In [None]:
corr=train_data.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(20, 20))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True,cmap="YlGnBu",linewidth=0.5)

From the above visual, we can easily confirm that there is **no high correlation present between any features.**

Any suggestions what can I also add are most welcome and **Kindly Upvote the notebook if it is of any help!**