#  Deep Neural Network for MNIST Classification

The dataset provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).

The goal is to write an algorithm that detects which digit is written. Since there are only 

10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

The goal hers is to build a neural network with 2 hidden layers.

Each image consists of 28 * 28 = 784 pixels and each pixel will be an input for the neural network.

Each pixel corresponds to to the intensity of the colour from 0 to 255 (Black to White).

**The strategy will be:**

    1- Prepare and preprocesse the data (Create Training, Validation and Test Datasets).
    2- Outline a model and choose a validation function.
    3- Set appropriate advanced optimizers and loss functions
    4- Train the Model built (Use backpropagation to optimze each Epoch)
    5- Test the accuracy of the model
  
  
Le jeu de données fournit 70 000 images (28x28 pixels) de chiffres écrits à la main (1 chiffre par image).

L'objectif est d'écrire un algorithme qui détecte quel chiffre est écrit. Comme il n'y a que

10 chiffres (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), il s'agit d'un problème de classification avec 10 classes.

Son objectif est de construire un réseau neuronal avec 2 couches cachées.

Chaque image est composée de 28 * 28 = 784 pixels et chaque pixel sera une entrée pour le réseau de neurones.

Chaque pixel correspond à l'intensité de la couleur de 0 à 255 (Noir à Blanc).

**La stratégie sera :**

    1- Préparer et prétraiter les données (créer des jeux de données d'entraînement, de validation et de test).
    2- Esquisser un modèle et choisir une fonction de validation.
    3- Définir les optimisateurs avancés et les fonctions de perte appropriés.
    4- Faites apprendre le modèle (utilisez la rétropropagation pour optimiser chaque époque).
    5- Testez la précision du modèle



## Import libraries

In [1]:
import numpy as np
import tensorflow as tf 
import tensorflow_datasets as tfds

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2



## Loading Data

In [2]:
# as_supervised=True will load the dataset in a 2-tuple structure (input, target) else it will load the data into a dictionary
# with_info=True provide a tuple containing information about the version, features, number of samples
# The information about the dataset will be stored in mnist_info

mnist_dataset, mnist_info = tfds.load(name="mnist", with_info=True, as_supervised=True)

### Spliting data into Train, Validation and Test datasets

In [12]:
# Extracting Training and Test data
mnist_train = mnist_dataset["train"] # Etract the training data
mnist_test = mnist_dataset["test"] # Etract the test data


# Get the number of validation samples as a percentage of Training data to be extracted from Training data
num_validation_samples = 0.1 * mnist_info.splits["train"].num_examples # Get the number of 10% of training data
num_validation_samples = tf.cast(num_validation_samples, tf.int64) # Make sure number is an integer


# Get the number of test samples
num_test_samples = mnist_info.splits["test"].num_examples # Get number of test data
num_test_samples = tf.cast(num_test_samples, tf.int64) # Make sure the number is an integer

# Scaling the data to make it more numerical stable(e.g. inputs between 0 and 1)
def scale_inputs(image, label): # takes an image and label
    image = tf.cast(image, tf.float32) # Make sure all image values are floats
    # Since all the MNIST images contain values of 0 to 255, dividing images by 255 will give values between 0 and 1
    image /= 255.0 # the ".0" enforces the results to be of type float
    return image, label

# Scaling the Training and validation data
scaled_trained_and_validation_data = mnist_train.map(scale_inputs)
test_data = mnist_test.map(scale_inputs) # Scales test data to same scale as training and validstion dsts


# Shuffling data to have a more random spread of data to better optimize batching

# Takes 10000 samples shuffle them and take the next 10000 till whole dataset is shuffled. 
# This technique optimizes the the use of memory resources as all the dataset can't be feed 
# into the computer's memory for 

BUFFER_SIZE = 10000 
shuffled_scaled_trained_and_validation_data = scaled_trained_and_validation_data.shuffle(BUFFER_SIZE)

# Extracting the Validation data from the neewly shuffled_scaled_trained_and_validation_data
validation_data = shuffled_scaled_trained_and_validation_data.take(num_validation_samples)

# Extracting the training data from neewly shuffled_scaled_trained_and_validation_data but skipping the validation_data
training_data = shuffled_scaled_trained_and_validation_data.skip(num_validation_samples)

# Using mini-batch Gradient descent to train the model as it optimises speed and sample size
BATCH_SIZE = 100
training_data = training_data.batch(BATCH_SIZE) # Adds a new column to the tensor that indicates to the model how many samples it should take for each batch
validation_data = validation_data.batch(num_validation_samples)
test_data = test_data.batch(num_test_samples)

validation_inputs, validation_targets = next(iter(validation_data))

# Model

## Outlining the Model

Deep learning Algorithms are mostly about building models

In [19]:
input_size = 784
output_size = 10
# Use same hidden layer size for both hidden layers. Not a necessity.
# hidden_layer_size = 50
hidden_layer_size = 100

# Defining how the model will look like

model = tf.keras.Sequential([
    # the first layer (the input layer)
    # each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
    # since we don't know CNNs yet, we don't know how to feed such input into our net, so we must flatten the images
    # there is a convenient method 'Flatten' that simply takes our 28x28x1 tensor and orders it into a (None,) 
    # or (28x28x1,) = (784,) vector
    # this allows us to actually create a feed forward neural network
    
    tf.keras.layers.Flatten(input_shape=(28,28,1)), # input layer
    
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

# Choose the optimizer and the loss function

In [20]:
# Define the optimizer to be used based on the problem, 
# the loss function, based on the type of encoding  
# and the metrics to be obtained at each iteration

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Training the Model

Training the just built model

**During Training for each epoch:**

    1- At the begining for each epoch, the training loss will be set to 0
    2- The Algorithm will iterate over a preset number of batches from the training_data
    3- The Weights and Baises will be updated as many times as there are batches
    4- The values of the loss function will be displayed to show how the training is going
    5- The Training accuracy will also be displayed
    6- At the end of each epoch, the algorithm will forward propagate the whole validation dataset
    7- The above steps will be repeated as many times as there are epochs and at the end of the last epoch, the training will be over.
    

In [23]:
# NUM_EPOCHS = 5 # Number of epochs for which the Model should be trained
NUM_EPOCHS = 6

# Fit the model, specifying the
# training data
# the total number of epochs
# and the validation data created in the format: (inputs,targets)

model.fit(training_data, epochs=NUM_EPOCHS, validation_data = (validation_inputs, validation_targets), verbose=2)

Epoch 1/6
540/540 - 6s - loss: 0.0193 - accuracy: 0.9940 - val_loss: 0.0196 - val_accuracy: 0.9942
Epoch 2/6
540/540 - 5s - loss: 0.0182 - accuracy: 0.9939 - val_loss: 0.0258 - val_accuracy: 0.9918
Epoch 3/6
540/540 - 5s - loss: 0.0133 - accuracy: 0.9961 - val_loss: 0.0215 - val_accuracy: 0.9937
Epoch 4/6
540/540 - 5s - loss: 0.0126 - accuracy: 0.9960 - val_loss: 0.0131 - val_accuracy: 0.9958
Epoch 5/6
540/540 - 5s - loss: 0.0138 - accuracy: 0.9952 - val_loss: 0.0177 - val_accuracy: 0.9937
Epoch 6/6
540/540 - 5s - loss: 0.0109 - accuracy: 0.9967 - val_loss: 0.0117 - val_accuracy: 0.9955


<tensorflow.python.keras.callbacks.History at 0x19eb4a6ed00>

Changing the hidden_layer_size from 50 to hidden_layer_size = 100 increased the overall model accuracy(val_accuracy)  from 97% to 98% ( 0.9808).

Changing NUM_EPOCHS from 5 to 6 100 increased the overall model accuracy(val_accuracy)  from 98% to 99% (0.9955).

# Testing the Model

**Training** 

**Validation**

**Testing**

During the training, overfitting was prevented by validating the model on the validation_data.
After the first first training, each modification of the Hyperparameters, actually overfitted the validation dataset

After training on the training data and validating on the validation data, the final prediction power of the model is gotten by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance, hence testing should not be done before the model has completely been adjusted.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [25]:
test_loss, test_accuracy = model.evaluate(test_data)



In [26]:
# We can apply some nice formatting if we want to
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

Test loss: 0.09. Test accuracy: 98.03%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 97%.

Each time the code is rerun, there's a different accuracy as the batches are shuffled, the weights are initialized in a different way, etc.

Finally, a suboptimal solution has intentionally been reached, so model deployment can actually take place here.