# 1. Data preprocessing

## Importing the relevant libraries and datasets

In [18]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [4]:
train_df=pd.read_csv('Datasets/train.csv')
train_df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
test_df=pd.read_csv('Datasets/test.csv')
test_df.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As we can see, the above data is divided into training and testing datasets. We will also create a separate validation dataset. The label section in the training dataset tells us the actual digit that the model must accurately determine.


Since we are dealing with image data in the form of pixel values, we will need to preprocess our data for it to be fed into the neural network.

## Extracting data from the csv files

After glancing through the training dataframe, we can realise that the first column named as label is the target column while everything else are inputs for the nueral net.

Let us try to separate this data.

In [13]:
unscaled_inputs=train_df.iloc[:,1:].values

In [17]:
targets=train_df.iloc[:,0].values

## Standardize the inputs

In [19]:
scaled_inputs=preprocessing.scale(unscaled_inputs)

## Shuffling the data

In case the data was arranged in some particular order, we would want to remove any bias by shuffling the data completely. This will make the dataset more homogeneous in nature and prevent any undue bias in the model.

In [22]:
total_indices=scaled_inputs.shape[0]

In [25]:
print('Total amount of data in the training dataset: {}'.format(total_indices))

Total amount of data in the training dataset: 42000


Let us now shuffle all these 42000 indices to make the data homogeneous in nature.

In [30]:
shuffled_indices=np.arange(total_indices)

In [33]:
np.random.shuffle(shuffled_indices)

In [34]:
shuffled_indices

array([15662, 22463,  3731, ..., 28468, 26921,  4389])

As we can see, the indices have now been all shuffled.

In [38]:
shuffled_inputs=scaled_inputs[shuffled_indices]
shuffled_targets=targets[shuffled_indices]

## Splitting the dataset into train,validation and test sets

In [39]:
samples_count=total_indices

train_samples_count=int(0.8*samples_count)
validation_samples_count=int(0.1*samples_count)
test_samples_count=samples_count-train_samples_count-validation_samples_count

As we can see from above few codes, we have allocated **80%** of the dataset for **training** , **10%** for **cross validation** and the remaining **10%** for **testing purpose**.

In [43]:
train_inputs=shuffled_inputs[:train_samples_count]
train_targets=shuffled_targets[:train_samples_count]

validation_inputs=shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets=shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs=shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets=shuffled_targets[train_samples_count+validation_samples_count:]

From the above code, we have separated all the train, validation and test data and separated the inputs from the targets aswell.

## Saving the three datasets into .npz form to be used in further neural network

In [44]:
np.savez('MNIST_train',inputs=train_inputs,target=train_targets)
np.savez('MNIST_validation',inputs=validation_inputs,target=validation_targets)
np.savez('MNIST_test',inputs=test_inputs,target=test_targets)