# <p style="background-color:turquoise; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">Tabular Playground Series  - May 2021 </p>



<a id='table-of-contents'></a>
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">Table of Content</p>

* [1. Data visualization ðŸ“Š](#1)
    * [1.1 Target](#1.1)
    * [1.2 Numerical Columns](#1.2)
    * [1.3 Correlation matrix](#1.3)
    * [1.4 Stats and Skewness](#1.4)
* [2. Feature Engineering ðŸ› ](#2)
    * [2.1 Binning](#2.1)
    * [2.2 log transformation](#2.2)
* [3. Model building](#3)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
train = pd.read_csv('../input/tabular-playground-series-may-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

In [None]:
#checking the data size of train and test
print(train.shape)
print(test.shape)

In [None]:
train.head()

In [None]:
train.info()

observations:

- column names doesn't make much sense as all of columns are named by integer with prefix as feature.so from domain stand-point, cannot interpret much information from column names.
- no missing values in the dataset
- all the columns are of type integer
- Dimensionality reduction can be a better idea since all 50 columns of the data is of type integer





dropping the id value as it doesn't add any value

In [None]:
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

<a id='1.1'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">1.1 Target Distribution</p>



In [None]:
fig = px.histogram(
    train, 
    x=train['target'], 
    color=train['target'],
)
fig.update_layout(
    title_text='Target distribution', # title of plot
    xaxis_title_text='Value', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    
)
fig.show()

 - In the target variable, class2 has more data points compared to the remaining labels.so, we probably have to address class imbalance problem.
 

<a id='1.2'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">1.2 Correlation Matrix</p>



In [None]:
rename_labels = {val:idx for idx, val in enumerate(sorted(train['target'].unique()))}
train['target'] = train['target'].map(rename_labels)

In [None]:
fig, ax = plt.subplots(figsize=(12 , 12))
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
"""
sns.heatmap(corr,
        square=True, center=0, linewidth=0.2,
        cmap=sns.diverging_palette(240, 10, as_cmap=True),
        mask=mask, ax=ax) 
    """

sns.heatmap(corr,square=True, center=0, 
            linewidth=0.2, cmap='coolwarm',
           mask=mask, ax=ax) 

ax.set_title('Feature Correlation Matrix ', loc='left')
plt.show()

observations:
- The correlation between the continuos variables are mostly moderate and few of them are highly correlated.
- The correlation between this continuos features and the target are not strong.
- The variables are not high correlated with the class, so we are not going to delete any variable.

<a id='1.3'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">1.3 Stats and Skewness</p>



In [None]:
train.describe()

observations:
- The mean of the all the features are closer to zero.
- There is low variance across all the features.
- The median is mostly 0 except two columns

<a id='1.4'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">1.4 Numerical Columns</p>

- From the statistical summary, we know that the data range of Feature 38 and Feature 14 column is considerably larger than the other numeric columns.
- Feature 38 and Feature 14 has median 1 and all the remaining columns median is zero.

## Feature14

In [None]:
plt.rcParams['figure.dpi'] = 300
fig = plt.figure(figsize=(5, 2))
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])

ax0.tick_params(axis = "y", which = "both", left = False)

# KDE plots
ax0_sns = sns.kdeplot(ax=ax0, x=train['feature_14'], zorder=2, shade=True)
ax0_sns = sns.kdeplot(ax=ax0, x=test['feature_14'], zorder=2, shade=True)

# Axis and grid customization
ax0_sns.set_xlabel("feature_14",fontsize=5, weight='bold')
ax0_sns.set_ylabel('')
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE')

# Legend params
ax0.legend(['train', 'test'], prop={'size': 5})
ax0_sns.tick_params(labelsize=5)

plt.show()

## Feature38

In [None]:
plt.rcParams['figure.dpi'] = 300
fig = plt.figure(figsize=(5, 2))
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])

ax0.tick_params(axis = "y", which = "both", left = False)

# KDE plots
ax0_sns = sns.kdeplot(ax=ax0, x=train['feature_38'], zorder=2, shade=True)
ax0_sns = sns.kdeplot(ax=ax0, x=test['feature_38'], zorder=2, shade=True)

# Axis and grid customization
ax0_sns.set_xlabel("feature_38",fontsize=5, weight='bold')
ax0_sns.set_ylabel('')
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE')
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE')

# Legend params
ax0.legend(['train', 'test'], prop={'size': 5})
ax0_sns.tick_params(labelsize=5)

plt.show()

- Both the features are left skewed and considerably larger data range compared to other columns.Handling skewness will be a key for better results

In [None]:
skewed_features = train.apply(lambda x: x.skew()).sort_values(ascending=False)
skewed_features

Observations:
- Features are highly skewed.Applying feature transformations can help us build a better model.

<a id='1.4'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">2 Feature Engineering</p>



Binning:
- When you bin, you can use both the bin and the original feature.Binning also enables you to treat numerical features as categorical.

In [None]:
# 10 bins
train["f14_bin_10"] = pd.cut(train["feature_14"], bins=10, labels=False)
train["f38_bin_100"] = pd.cut(train["feature_38"], bins=10, labels=False)

log transformation
- All the features except feature 14 and 38 has low variance.Thus, we would want to reduce the variance of these columns, and that can be done by taking a log transformation.


In [None]:
#Letâ€™s take a look at the variance without and with the log transformation.
print("Feature 14 variance : " , train.feature_14.var())
train['feature_14'] = train.feature_14.apply(lambda x: np.log(1 + x))
print("Feature 14 variance after apply log(1 + x): " ,train.feature_14.var())


<a id='3'></a>
[back to top](#table-of-contents)
## <p style="background-color:turquoise; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">3 Model Building</p>

# In Progress...


<a id='7'></a>
[back to top](#table-of-contents)
## <p style="background-color:orange; font-family:newtimeroman; font-size:140%; text-align:center; border-radius: 15px 50px;">Stay Tuned and upvote if you like this notebook</p>

# Thanks for reading, will keep posting and updating this notebook with more visualizations, Features with diferent techniques.Â¶


