# 1. Data preprocessing

## Importing the relevant libraries and datasets

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [2]:
train_df=pd.read_csv('Datasets/train.csv')
train_df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
test_df=pd.read_csv('Datasets/test.csv')
test_df.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As we can see, the above data is divided into training and testing datasets. We will also create a separate validation dataset. The label section in the training dataset tells us the actual digit that the model must accurately determine.


Since we are dealing with image data in the form of pixel values, we will need to preprocess our data for it to be fed into the neural network.

## Extracting data from the csv files

After glancing through the training dataframe, we can realise that the first column named as label is the target column while everything else are inputs for the nueral net.

Let us try to separate this data.

In [4]:
unscaled_inputs=train_df.iloc[:,1:].values

In [5]:
targets=train_df.iloc[:,0].values

## Standardize the inputs

In [6]:
scaled_inputs=preprocessing.scale(unscaled_inputs)

## Shuffling the data

In case the data was arranged in some particular order, we would want to remove any bias by shuffling the data completely. This will make the dataset more homogeneous in nature and prevent any undue bias in the model.

In [7]:
total_indices=scaled_inputs.shape[0]

In [8]:
print('Total amount of data in the training dataset: {}'.format(total_indices))

Total amount of data in the training dataset: 42000


Let us now shuffle all these 42000 indices to make the data homogeneous in nature.

In [9]:
shuffled_indices=np.arange(total_indices)

In [10]:
np.random.shuffle(shuffled_indices)

In [11]:
shuffled_indices

array([20500, 26758, 18709, ..., 19864, 12634, 28051])

As we can see, the indices have now been all shuffled.

In [12]:
shuffled_inputs=scaled_inputs[shuffled_indices]
shuffled_targets=targets[shuffled_indices]

## Splitting the dataset into train,validation and test sets

In [13]:
samples_count=total_indices

train_samples_count=int(0.8*samples_count)
validation_samples_count=int(0.1*samples_count)
test_samples_count=samples_count-train_samples_count-validation_samples_count

As we can see from above few codes, we have allocated **80%** of the dataset for **training** , **10%** for **cross validation** and the remaining **10%** for **testing purpose**.

In [14]:
train_inputs=shuffled_inputs[:train_samples_count]
train_targets=shuffled_targets[:train_samples_count]

validation_inputs=shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets=shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs=shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets=shuffled_targets[train_samples_count+validation_samples_count:]

From the above code, we have separated all the train, validation and test data and separated the inputs from the targets aswell.

## Saving the three datasets into .npz form to be used in further neural network

In [15]:
np.savez('MNIST_train',inputs=train_inputs,target=train_targets)
np.savez('MNIST_validation',inputs=validation_inputs,target=validation_targets)
np.savez('MNIST_test',inputs=test_inputs,target=test_targets)

## Loading the NPZ files

In [21]:
npz=np.load('MNIST_train.npz')
train_inputs=npz['inputs'].astype(np.float)
train_targets=npz['target'].astype(np.int)

In [22]:
npz=np.load('MNIST_test.npz')
test_inputs=npz['inputs'].astype(np.float)
test_targets=npz['target'].astype(np.int)

In [23]:
npz=np.load('MNIST_validation.npz')
validation_inputs=npz['inputs'].astype(np.float)
validation_targets=npz['target'].astype(np.int)

# 2. Deep learning model

## Creating the neural network model

From the .CSV files, it is clear that we have the values for a total of **784 pixels** for each digit. This means,it is in the form of a rank 3 tensor as **28 X 28 X 1** . 

The above situation is a problem because it is not possible to feed these values as input in simple neural networks. For convolutional neural networks, there is no issue with such a tensor input. In this case however, we need to apply the layer flattening option provided by Keras.


As we have 784 pixels for each digit, so, out input nodes (or values) will be 784.

Let us take the number of hidden layers as 50

The digits may range from 0-9. Hence, the number of output values is taken as 10


We are implementing **three sets of hidden layers** initially.
The activation function we plan to use for the hidden layer is **'Relu'**


For backpropogation of the output layer, the activation function used is **Softmax** 

In [32]:
input_size=784
output_size=10
hidden_layer_size=50

model=tf.keras.Sequential([
    #Input layer
    tf.keras.layers.Dense(input_size),
    
    #Hidden layer 1
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
    #Hidden layer 2
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
    #Hidden layer 3
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
    
    #Output layer
    tf.keras.layers.Dense(output_size,activation='softmax')
])

## Choosing the optimizer and the loss function

In [33]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Training the dataset


We have an option of setting an early stopping criteria which checks the steps where the validation loss increases in the subsequent steps. We can set it to any value less than the numer of epochs. This helps to control the overfit issue. However, since we have already used the validation datasets, we will comment out the code in the model.




In [49]:
NUM_EPOCHS=50
BATCH_SIZE=100

early_stopping=tf.keras.callbacks.EarlyStopping(patience=20)

model.fit(train_inputs,train_targets,
          batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS,
          callbacks=[early_stopping],
          validation_data=(validation_inputs,validation_targets),
          verbose=2,validation_steps=10)

Train on 33600 samples, validate on 4200 samples
Epoch 1/50
33600/33600 - 2s - loss: 0.0281 - accuracy: 0.9936 - val_loss: 0.1498 - val_accuracy: 0.9570
Epoch 2/50
33600/33600 - 2s - loss: 0.0111 - accuracy: 0.9968 - val_loss: 0.1525 - val_accuracy: 0.9590
Epoch 3/50
33600/33600 - 2s - loss: 0.0095 - accuracy: 0.9975 - val_loss: 0.1683 - val_accuracy: 0.9610
Epoch 4/50
33600/33600 - 2s - loss: 0.0255 - accuracy: 0.9949 - val_loss: 0.1516 - val_accuracy: 0.9550
Epoch 5/50
33600/33600 - 2s - loss: 0.0265 - accuracy: 0.9942 - val_loss: 0.1482 - val_accuracy: 0.9540
Epoch 6/50
33600/33600 - 3s - loss: 0.0117 - accuracy: 0.9966 - val_loss: 0.1201 - val_accuracy: 0.9630
Epoch 7/50
33600/33600 - 2s - loss: 0.0080 - accuracy: 0.9981 - val_loss: 0.1195 - val_accuracy: 0.9620
Epoch 8/50
33600/33600 - 3s - loss: 0.0166 - accuracy: 0.9965 - val_loss: 0.1198 - val_accuracy: 0.9600
Epoch 9/50
33600/33600 - 3s - loss: 0.0125 - accuracy: 0.9964 - val_loss: 0.1080 - val_accuracy: 0.9650
Epoch 10/50
336

<tensorflow.python.keras.callbacks.History at 0x7f824f02ad50>

## Testing the model

Initial testing on a part of the training data will be first done to check how the neural net performs.

In [50]:
test_loss,test_accuracy=model.evaluate(test_inputs,test_targets)



In [52]:
print('\n Test loss:{0:.2f} Test accuracy: {1:.2f} %'.format(test_loss,test_accuracy*100))


 Test loss:0.61 Test accuracy: 96.17 %


In [53]:
values=model.predict(test_inputs)

In [57]:
pd.DataFrame(values).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.0,0.0,0.0,5.4631380000000005e-33,0.0,3.549717e-32,0.0,0.0,2.011065e-33
1,0.0,0.0,1.050896e-25,1.0,0.0,1.443968e-12,0.0,5.47698e-30,9.982512e-30,6.561062e-24
2,0.0,1.0,1.067324e-16,1.612615e-20,7.120113e-14,8.602918e-19,3.497817e-17,4.229974e-15,1.437009e-16,1.2083189999999999e-26
3,2.603826e-22,1.5794090000000001e-29,1.5719029999999998e-20,1.354971e-21,2.08697e-13,4.400346e-20,1.010215e-35,3.974627e-18,3.7126180000000002e-22,1.0
4,0.0,1.984955e-31,0.0,0.0,1.0,0.0,0.0,0.0,2.675328e-33,2.271231e-26


As we can see, the entries which have 1 are corresponding to the digits of their column name. 

## Final testing on new dataset

Once the model has been completely trained, we import the test dataset provided to us.

In [58]:
test_df.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
unscaled_inputs_test=test_df.values
scaled_inputs=preprocessing.scale(unscaled_inputs_test)

Unlike the training dataset, we shall not shuffle the testing  dataset since we need to preserve the order for submission purpose.

In [61]:
test_inputs=scaled_inputs

In [62]:
test_inputs.shape

(28000, 784)

As we can see, we have 28000 different images with their pixel intensities.

In [63]:
np.savez('Final_test',inputs=test_inputs)

In [64]:
npz_test=np.load('Final_test.npz')
test_inputs=npz_test['inputs'].astype(float)

In [67]:
values_df=pd.DataFrame(model.predict(test_inputs))

In [73]:
values_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,8.09581e-15,3.174967e-21,1.0,3.8432e-16,6.315272e-32,9.411662e-14,5.613766e-11,1.3336360000000001e-25,3.022911e-15,3.326376e-19
1,1.0,8.946794e-31,2.193718e-32,4.355092e-27,3.765841e-20,0.0,1.98422e-24,2.158371e-34,6.676272e-30,8.120166999999999e-19
2,9.377874e-24,3.573833e-17,5.152291e-13,5.891345e-13,8.699407e-13,5.10243e-11,7.084996999999999e-26,4.705539e-11,3.607684e-11,1.0
3,4.066479e-07,4.685435e-10,0.006393158,5.276445e-07,1.975e-06,3.024094e-10,2.636718e-16,4.279735e-08,1.530647e-12,0.9936039
4,1.0804690000000001e-33,1.455272e-26,7.463134e-15,1.0,1.6338760000000002e-31,3.366477e-08,1.7179560000000002e-27,3.042576e-24,3.059523e-16,3.485196e-14


We need to sort out the labels for each entry now. We can convert every element in the dataframe into int datatype such that we have only 1s and 0s to make it more readable.

In [74]:
values_df=values_df.astype(int)
values_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0


In [113]:
predictions_df=pd.DataFrame(values_df[values_df==1].stack())

In [116]:
predictions_df.drop(0,axis=1,inplace=True)

In [117]:
predictions_df.shape

(22520, 0)

The neural net could predict for 22520 cases. Rest could not be identified.

Let us organise the dataframe properly.

In [119]:
predictions_df.reset_index(inplace=True)

Unnamed: 0,level_0,level_1
0,0,2
1,1,0
2,2,9
3,4,3
4,6,0
5,8,0
6,9,3
7,10,5
8,13,0
9,17,1


In [121]:
predictions_df.rename(columns={'level_1':'Label'},inplace=True)
predictions_df.head()

Unnamed: 0,level_0,Label
0,0,2
1,1,0
2,2,9
3,4,3
4,6,0


In [156]:
image_id=pd.DataFrame(np.arange(0,28000),columns=['ImageId'])
image_id.head()

Unnamed: 0,ImageId
0,0
1,1
2,2
3,3
4,4


In [135]:
predictions_df.rename(columns={'level_0':'ImageId'},inplace=True)
predictions_df.head()

Unnamed: 0,ImageId,Label
0,0,2
1,1,0
2,2,9
3,4,3
4,6,0


In [137]:
final_preds=predictions_df.copy()

In [151]:
final_preds=pd.merge_ordered(final_preds,image_id,on='ImageId',fill_method='None')

In [158]:
final_preds.head()

Unnamed: 0,ImageId,Label
0,0,2.0
1,1,0.0
2,2,9.0
3,3,
4,4,3.0


In [159]:
final_preds['Label'].isna().value_counts()

False    22520
True      5480
Name: Label, dtype: int64

Sadly, we could not recognize about 5480 test cases. Let us see the various results amongst the recognized numbers.

In [160]:
final_preds['Label'].value_counts()

1.0    3020
0.0    2553
6.0    2515
8.0    2295
4.0    2182
7.0    2167
5.0    2108
9.0    2108
2.0    2018
3.0    1554
Name: Label, dtype: int64

As a brute and inaccurate manner, we will fill the null values with mode of the Label column.

In [162]:
final_preds['Label'].fillna(final_preds['Label'].mode()[0],inplace=True)

In [165]:
final_preds=final_preds.astype(int)

In [167]:
final_preds['ImageId']=final_preds['ImageId']+1

In [169]:
final_preds.to_csv('Final_predictions_ann.csv',index=False)

Unnamed: 0,ImageId,Label
0,1,2
1,2,0
2,3,9
3,4,1
4,5,3
