
<a id="section-zero"></a>

# TABLE OF CONTENTS


* [Library Importations](#section-libraryimportation)
* [Loading Datasets](#section-loadingdatasets)
* [Exploratory Data Analysis](#section-EDA)
* [Data Preprocessing](#section-preprocessing)
* [Building Model](#section-six)
    - [Neural Network](#subsection-six-five)
* [Submission](#section-submission)

<a id="section-libraryimportation"></a>
# Import all the required libraries

In [None]:
import numpy as np 
import pandas as pd
import os
import re


from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn import preprocessing
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder

import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Input
from keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical



# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns


[Back to Top](#section-zero)

<a id="section-loadingdatasets"></a>
# Load datasets

In [None]:
df_train = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/train.csv', index_col='id')
df_test = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/test.csv', index_col='id')
sample_submission = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')

df_all = pd.concat([df_train, df_test]).reset_index(drop=True)

In [None]:
df_all.shape

In [None]:
df_test.shape

In [None]:
df_train.head()

[Back to Top](#section-zero)

<a id="section-EDA"></a>
# Exploratory Data Analysis (EDA)

**Missing Values**

Check for missing values for each column for the train dataset
This can be done by
*dataset.isnull().sum()*

In [None]:
df_train.isnull().sum()

Check total missing values. This can be done by
*dataset.isnull().sum().sum()* (sum() on the above cell command)

In [None]:
df_train.isnull().sum().sum()

> Seems like we do not have any missing values on the train dataset

Check for missing values for each column for the test dataset
This can be done by
*dataset.isnull().sum()*

In [None]:
df_test.isnull().sum()

In [None]:
df_test.isnull().sum().sum()

> Seems like we do not have any missing values on the test dataset

[Back to Top](#section-zero)

**Target Column**

value_counts() return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [None]:
df_train.target.value_counts()

In [None]:
target_vc = df_train.target.value_counts()
values = target_vc.values.tolist()
indexes = target_vc.index.tolist()
colors = ['lightskyblue', 'indianred', 'aqua', 'limegreen', 'gold','teal','coral','tan','deeppink']


ax,fig = plt.subplots(1,2,figsize=(20,8))
plt.subplot(1,2,1)
plt.bar(indexes,values, color = 'darkturquoise')
plt.title("Target Distribution Bar Chart")
plt.subplot(1,2,2)
plt.pie(values,colors=colors, labels=indexes)
plt.title("Target Distribution Pie Chart")
plt.show()

**Features**

Let's plot the number of unique values of the features.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

y = np.array([df_train[f'feature_{i}'].nunique() for i in range(75)])
y2 = np.array([df_test[f'feature_{i}'].nunique() for i in range(75)])
comp = y-y2


ax.bar(range(75), y2, alpha=0.7, color='darkturquoise', label='Test Dataset')
ax.bar(range(75),  comp*(comp>0), bottom=y2, color='green', alpha=0.7, label='Train > Test')
ax.bar(range(75), comp*(comp<0), bottom=y2-comp*(comp<0), color='red', alpha=0.7, label='Train < Test')

ax.set_yticks(range(0, 120, 5))
ax.margins(0.02)
ax.grid(axis='y', linestyle='--', zorder=5)
ax.set_title('# of Features Unique Values (Train/Test)', loc='left', fontweight='bold')
ax.set_xlabel('Feature')
ax.legend()
plt.show()

Let's check the features with more no of unique values in Train dataset than test dataset

In [None]:
pd.DataFrame(data={'feature' : np.arange(75)[comp>0], 
              'delta' : comp[comp>0]}, index=None)

Plot a correlation matrix

In [None]:
features_set = df_train.drop(labels=['target'],axis=1)
def plot_diag_heatmap(data):
    corr = data.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    f, ax = plt.subplots(figsize=(11, 9))
    sns.heatmap(corr, mask=mask, cmap='viridis', center=0,square=True, linewidths=1, cbar_kws={"shrink": 1.0})
plot_diag_heatmap(features_set)

In [None]:
display(df_train.sort_values(by=['target'], ascending=True).head())

In [None]:
display(df_train.sort_values(by=['target'], ascending=False).head())

In [None]:
df_train.shape

In [None]:
df_train.describe()

[Back to Top](#section-zero)

<a id="section-preprocessing"></a>
# Data Preprocessing

In [None]:
target = df_train['target'].apply(lambda x: int(x.split("_")[-1])-1).to_numpy()

In [None]:
target

In [None]:
y_train = tf.keras.utils.to_categorical(target, num_classes=9)

In [None]:
y_train

In [None]:
df_train.drop('target', axis=1, inplace=True)

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(df_train)

In [None]:
df_train = scaler.fit_transform(df_train)

In [None]:
df_train

In [None]:
#Train-test split
train_X, val_X, train_y, val_y = train_test_split(df_train,y_train,random_state=1,test_size=0.2)

In [None]:
train_X

In [None]:
train_y

**Define Callbacks**

In [None]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)


early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=5, # how many epochs to wait before stopping
    restore_best_weights=True,
)

**Define Model**

In [None]:
model = keras.Sequential([
    layers.Dense(units=128, activation='relu', input_shape=[df_train.shape[1]]),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(units=16, activation='relu'),
    layers.Dense(units=8, activation='relu'),
    # the linear output layer 
    layers.Dense(9, activation='softmax'),
])

In [None]:
model.compile(
    optimizer=Adam(lr=0.01), 
    loss='categorical_crossentropy',
    metrics='accuracy'
)

In [None]:
history = model.fit(
    train_X, train_y,
    validation_data=(val_X, val_y),
    #batch_size=32,
    epochs=50,
    callbacks=[early_stopping, learning_rate_reduction]
)

In [None]:
score = model.evaluate(val_X, val_y, verbose = 0)
print('Test loss: {}%'.format(score[0]))
print('Test score: {}%'.format(score[1] * 100))
print("MLP Error: %.2f%%" % (100 - score[1] * 100))

Plot line plots for loss and validation loss for the model

In [None]:
fig, ax = plt.subplots(figsize = (10, 4))
sns.lineplot(x = history.epoch, y = history.history['loss'],color ='red')
sns.lineplot(x = history.epoch, y = history.history['val_loss'], color='blue')
ax.set_title('Learning Curve (Loss)')
ax.set_ylabel('Loss')
ax.set_xlabel('Epoch')
ax.legend(['train', 'test'], loc = 'best')
plt.show()

[Back to Top](#section-zero)

<a id="section-submission"></a>
# Making the Submission

In [None]:
sample_submission[['Class_1','Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6', 'Class_7', 'Class_8', 'Class_9']] = model.predict(df_test)
sample_submission.to_csv('submission.csv', index = False)

[Back to Top](#section-zero)