

<a id="section-zero"></a>

# TABLE OF CONTENTS


* [Library Importations](#section-one)
* [Loading Datasets](#section-two)
* [Exploratory Data Analysis](#section-three)
* [Data Preprocessing](#section-four)
* [Building Model](#section-six)
    - [Neural Network](#subsection-six-five)
* [Submission](#section-nine)

<a id="section-one"></a>
# Import all the required libraries

In [None]:
import numpy as np 
import pandas as pd
import os


from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from tensorflow.keras.optimizers import Adam

import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.layers import Dense, Input
from keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

import optuna


<a id="section-two"></a>
# Load datasets

In [None]:
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/train.csv', index_col='id')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-may-2021/test.csv', index_col='id')
sample_submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')

<a id="section-three"></a>
# Exploratory Data Analysis (EDA)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

> The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

See data set shape:-

In [None]:
print(df_train.shape)
print(df_test.shape)

See top 5 rows of the dataset:-

In [None]:
df_train.head()

We have 50 features with int64 dataType (Counting starting from 0:feature_0)

In [None]:
print(df_train.info())

Checking which all columns contain NaN values(is missing) in both the training and test datasets:-

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

Let's look at the statistical values for each feature.

In [None]:
df_train.describe()

In [None]:
df_test.describe()

**Target column**

Let's check target distribution and plot a bar chart:

In [None]:
df_train['target'].value_counts()

In [None]:
sns.barplot(df_train['target'].value_counts().index,df_train['target'].value_counts(),palette='rocket')

To summarize:-
* There is no missing data
* No categorical feature

<a id="section-four"></a>
# Preprocessing the data

Dropping target column from the dataset

In [None]:
target = df_train['target']
df_train.drop(['target'], inplace=True, axis=1)

We will use Keras to_categorical function to convert our target column into categorical values

In [None]:
label = {var:index for index, var in enumerate(sorted(target.unique()))}
target = target.map(label)

target =  to_categorical(target)
target

In [None]:
#X_train, X_val, y_train, y_val = train_test_split(df_train, target, test_size = 0.1, random_state = 2, stratify=target)

<a id="section-six"></a>
# Building Models

<a id="section-six-one"></a>
**Neural Network**

Define all the callbacks we will use in Neural Network fit step:
* ReduceLROnPlateau
        Reduce learning rate when a metric has stopped improving.

* EarlyStopping
        Stop training when a monitored metric has stopped improving.

In [None]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)


early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=7, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
#  Neural Network
nn = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=[50]),
    layers.Dropout(0.2),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(4,activation='softmax')
])

nn.summary()

In [None]:
nn.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['categorical_accuracy'])
history=nn.fit(
    df_train,target,
    validation_split=0.2,
    batch_size=128,
    epochs=25,
    callbacks=[early_stopping,learning_rate_reduction]
   )

loss: 1.0864 - categorical_accuracy: 0.5767 - val_loss: 1.1021 - val_categorical_accuracy: 0.5836

In [None]:
# summarizing historical accuracy
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

XGB

<a id="section-nine"></a>
# Submission

In [None]:
sample_submission[['Class_1','Class_2', 'Class_3', 'Class_4']] = nn.predict(df_test)
sample_submission.to_csv(f'submission.csv',index=False)
sample_submission