<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Tabular Playground Series - Jun 2021
</h1>
</div>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/25226/logos/header.png?t=2021-01-27-17-34-31)

## Data
The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

This competition dataset is similar to the Tabular Playground Series - May 2021 dataset, but with increased observations, increased features, and increased class labels.

* Files
* train.csv - the training data, one product (id) per row, with the associated features (feature_*) and class label (target)
* test.csv - the test data; you must predict the probability the id belongs to each class
* sample_submission.csv - a sample submission file in the correct format

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Trying Visualization with these Data!
    with PCA and TSNE
</h1>
</div>

## 1. Import Libraries and Load Data
### This time, I will use only train data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.manifold import TSNE

train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
train = train.drop('id',axis=1)
train.head()

### It has So Many columns...

![](https://media4.giphy.com/media/BEob5qwFkSJ7G/giphy.gif)

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Check the Target's rate
</h1>
</div>

In [None]:
target_count = pd.DataFrame(train['target'].value_counts()).reset_index()
target_count = target_count.sort_values(by='index')
target_count = target_count.reset_index(drop=True)
fig = plt.figure(figsize=(20,13))
sns.barplot(data = target_count, x= 'index', y='target')
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
for i in range(len(target_count)):
    value = target_count.loc[i,'target']
    plt.text(s=f'{value}',x = i-0.2, y = value+500, fontsize=20)
plt.ylabel('')
plt.xlabel('')
plt.show()

### Class_6 and Class_8 have more than 50000. But Class5 has only 3000...

In [None]:
import plotly.express as px
fig = px.pie(target_count, names='index',values='target',hole = 0.8)
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.update_layout(
    annotations=[dict(text="Target's percentage", x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.update_layout(showlegend=False)

### If you check percentage about Target. You can feel that Class_8 and Class_6 is Half..

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Visualization with 2D scatter plot
    - Using PCA
</h1>
</div>

In [None]:
pca = PCA(n_components=2)
X_train = pd.DataFrame(pca.fit_transform(train[train.columns[:-1]]))
X_train = X_train.sample(100000)
sns.jointplot(x = X_train[0], y= X_train[1],hue = train['target'],height = 15)
plt.show()

![](https://media2.giphy.com/media/7SF5scGB2AFrgsXP63/giphy.gif)

In [None]:
sns.jointplot(x = X_train[0], y= X_train[1],hue = train['target'],height = 15,
             xlim = (-15,50), ylim =(-60,50))
plt.axis('off')
plt.show()

### I can't visualization with 2D...

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
How about Visualization with 3D scatter plot?
</h1>
</div>

In [None]:
pca = PCA(n_components=3)
train_3d = pd.DataFrame(pca.fit_transform(train[train.columns[:-1]]))
train_3d['target'] = train['target']

fig = plt.figure(figsize=(20, 13))
ax = fig.add_subplot(111, projection='3d')
for target_class in train_3d['target'].unique():
    x = train_3d[train_3d['target'] == target_class][0]
    y = train_3d[train_3d['target'] == target_class][1]
    z = train_3d[train_3d['target'] == target_class][2]
    ax.scatter(x, y, z, s= 10, alpha=0.1)
plt.title("3D scatter")
plt.show()

### It's almost green too...

In [None]:
scaler = MinMaxScaler()
scaler_data = scaler.fit_transform(train_3d[train_3d.columns[:-1]])
scaler_data = pd.DataFrame(scaler_data)
scaler_data['target'] = train['target']

fig = plt.figure(figsize=(20, 13))
ax = fig.add_subplot(111, projection='3d')
for target_class in scaler_data['target'].unique():
    data = scaler_data[scaler_data['target'] == target_class].sample(3000)
    x = data[0]
    y = data[1]
    z = data[2]
    ax.scatter(x, y, z, s= 10, alpha=0.3)
plt.title("Balanced_3D scatter")
plt.show()

### I think, It's not good visualization too....

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Visualization with 2D scatter plot
    - Using TSNE
</h1>
</div>

### I use only 50000 Data, Because it take so much time...

In [None]:
train_sample = train.sample(50000)
x = train_sample[train_sample.columns[:-1]]
tsne = TSNE(n_components=2)
y = pd.DataFrame(tsne.fit_transform(x))
sns.jointplot(x = y[0], y= y[1],hue = train_sample['target'],height = 15)
plt.show()

### It's more better than PCA

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:skyblue;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
              color:white">
Visualization with 3D scatter plot
    - Using TSNE
</h1>
</div>

In [None]:
tsne = TSNE(n_components=3)
train_3d = pd.DataFrame(tsne.fit_transform(x))
train_3d['target'] = train_sample['target']

fig = plt.figure(figsize=(20, 13))
ax = fig.add_subplot(111, projection='3d')
for target_class in train_3d['target'].unique():
    x = train_3d[train_3d['target'] == target_class][0]
    y = train_3d[train_3d['target'] == target_class][1]
    z = train_3d[train_3d['target'] == target_class][2]
    ax.scatter(x, y, z, s= 10, alpha=0.5)
plt.title("3D scatter")
plt.show()