## MNIST dataset with Neural Networks!

This week we are going to play with NN using our old MNIST dataset! As usual please add the MNIST dataset and run the first cell to get its path.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os
for dirname in os.listdir('/kaggle/input/phys591000-2022-week06/'):
    print(dirname,"/")
    for filename in os.listdir('/kaggle/input/phys591000-2022-week06/'+ dirname):
        print(filename)
    print("\n")

    
    
# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Load and transform the MNIST dataset

We'll use the first 10000 samples from the MNIST dataset, and treat each pixel as a feature (i.e. we have 28x28=784 features). But this time we don't have to do anything with the x_train and x_test--we'll take care of the flatten/reshape when we build the NN model.

On the other hand, we need to translate the ground truth y_train and y_test from a single digit to one-hot encoding, i.e. 
0 -> \[1,0,...,0], 1-> [0,1,0,..0],..., 9->[0,0,...,1]

(The reason why that's necessary will be clearer later).  The code in the next cell shows how to do that using 'np.eye'. Please check that you understand how it works.


In [None]:
# Use only 10000 sample 
# Use np.eye for one-hot encoding
# 0 -> [1,0,...,0], 1-> [0,1,0,..0],..., 9->[0,0,...,1] 

mnist = np.load('/kaggle/input/mnist-numpy/mnist.npz')
x_train = mnist['x_train'][:10000]/255.
y_train = np.array([np.eye(10)[n] for n in mnist['y_train'][:10000]])
x_test = mnist['x_test']/255.
y_test = np.array([np.eye(10)[n] for n in mnist['y_test']])

print('x_train shape is: ', x_train.shape)
print('y_train shape is: ', y_train.shape)

Now we will build a NN model with 784 neurons in the input layer (since we have 784 features), one hidden layer with 30 neurons, and an output layer with 10 neurons (since we have 10 possible outcomes, 0-9). 

Run the cell below to see the performance. Pay attention to how we specifiy hyperparameters such as the activation functions (of each layer) and the loss function. **Please make sure you understand what each line in the code means.**

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape
from tensorflow.keras.optimizers import SGD

model = Sequential()
# build a 784-30-10 model
model = Sequential()
model.add(Reshape((784,), input_shape=(28,28)))
model.add(Dense(30, activation='sigmoid'))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=SGD(learning_rate=1.0),
              metrics=['accuracy'])

rec = model.fit(x_train, y_train, epochs=50, batch_size=100,
                validation_data=(x_test, y_test))


print('Performance (training)')
print('Loss: %.5f, Acc: %.5f' % tuple(model.evaluate(x_train, y_train)))
print('Performance (testing)')
print('Loss: %.5f, Acc: %.5f' % tuple(model.evaluate(x_test, y_test)))                                                                       

## A close look at the prediction from the NN

As before we'd like to see what images have fooled our NN. The first thing to check is exactly what the prediction looks like. In the next cell, print out the prediction for the first image in x_test, and compare that to the ground truth (y_test) of this image. What do you find?

Hint: The prediction can be fetched via model.predict('name of test sample').

In [None]:
# Print the first element of the prediction and compare to y_test

p_test = model.predict(x_test)

print('The first element of the prediction is: ',p_test[0])
print('The first element of the ground truth is: ',y_test[0])

The prediction is an array of 10 numbers, all between 0 and 1. What do you think it means? How does that relate to the 'real' prediction is should make (i.e. a digit between 0 to 9)?

In the next cell, please plot (at least) 10 images that are wrongly classified by the NN. Please label the prediction and the ground truth.

Hint: Use np.argmax to compare the **indices** of the prediction and the y_test.

In [None]:
# Plot wrongly classified images using any method you like
# Note: It's better to print ~10 images for later comparison

p_test_index = np.argmax(p_test, axis=1)
y_test_index = np.argmax(y_test, axis=1)
wrong_index  = np.where(p_test_index != y_test_index)

In [None]:
fig = plt.figure(figsize=(12,6), dpi=200)

for i in range(10):
    plt.subplot(2,5,i+1)
    plt.imshow(x_test[wrong_index[0][i]])
plt.show()

To appreciate the 'probability' nature of the prediction from the softmax output layer, what if we manually choose the index corresponding to the second largest probability and interpret as the predicted digit? Will it happen to be the same as the ground truth?

In the next cell, compare the output from the second largest probability to the wrongly classified digits you've plotted above. Any '2nd choice' has made the right prediction?

Hint: You might find np.argsort useful for this case.

In [None]:
# Compare the output using 2nd largest probability to the ground truth
# for wrongly classified digits

second_pindex = np.argsort(p_test, axis=1)[:,-2]

fig = plt.figure(figsize=(12,6), dpi=200)

for i in range(10):
    plt.subplot(2,5,i+1)
    plt.imshow(x_test[wrong_index[0][i]])
    plt.title("real:"+str(y_test_index[wrong_index[0][i]])+" second:"+str(second_pindex[wrong_index[0][i]]))
plt.show()

## Overfitting and Regularization

The performance of the NN model we just built is superior with the training data, but slighly worse for the test data. This seems to be a sign of overtraining. Let's check the 'history' of the performance as a function of each epoch.

The cell below shows you how to plot the evolution of loss and accuracy from the NN model. Please make sure you understand the code.

In [None]:
# Plot showing the evolution of loss and accuracy, comparing the training and the test samples.
fig = plt.figure(figsize=(6,6), dpi=80)
plt.subplot(2,1,1)
plt.plot(rec.history['loss'], lw=3, label='Train')
plt.plot(rec.history['val_loss'], lw=3, label='Validation')
plt.xlabel('epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(2,1,2)
plt.plot(rec.history['accuracy'], lw=3, label='Train')
plt.plot(rec.history['val_accuracy'], lw=3, label='Validation')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend()
plt.show()

So indeeed while the performance with the training data improves along each epoch, the performance saturates much earlier for the test data. This means the model is overtrained/overfitting. 

In the next cell we try to mitigate the overfitting by adding a regularization term at the output layer, using an 'L2' regularization.

In [None]:
# Regularization at output layer
from tensorflow.keras.regularizers import l2
m2 = Sequential()
m2.add(Reshape((784,), input_shape=(28,28)))
m2.add(Dense(30, activation='sigmoid'))
m2.add(Dense(10, activation='softmax', kernel_regularizer=l2(0.01)))

m2.compile(loss='categorical_crossentropy',
              optimizer=SGD(learning_rate=1.0),
              metrics=['accuracy'])

rec2 = m2.fit(x_train, y_train, epochs=50, batch_size=100,
                validation_data=(x_test, y_test))

Please print out the performance of this regularized model on the training and on the test samples in the next cell.

In [None]:
# Print out performances on the training and the test data


print('Performance (training)')
print('Loss: %.5f, Acc: %.5f' % tuple(m2.evaluate(x_train, y_train)))
print('Performance (testing)')
print('Loss: %.5f, Acc: %.5f' % tuple(m2.evaluate(x_test, y_test)))      

So the performances on the training and the test data are much consistent this time. Let's verify that's really the case by checking the historgy of this L2-regularized model.

In the next 2 cells please compare the evolution of loss and accuracy from the first (un-regularized) model and this L2-regularized one.

In [None]:
# Compare accuracy of the unregularized to the regularized one

# Plot showing the evolution of loss and accuracy, comparing the training and the test samples.
fig = plt.figure(figsize=(6,6), dpi=80)
plt.subplot(2,1,1)
plt.plot(rec.history['accuracy'], lw=3, label='Train unregularized')
plt.plot(rec2.history['accuracy'], lw=3, label='Train regularized')
plt.plot(rec.history['val_accuracy'], lw=3, label='Validation unregularized')
plt.plot(rec2.history['val_accuracy'], lw=3, label='Validation regularized')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend()
plt.show()

In [None]:
# Compare loss of the unregularized to the regularized one
# Plot showing the evolution of loss and accuracy, comparing the training and the test samples.
fig = plt.figure(figsize=(6,6), dpi=80)
plt.subplot(2,1,1)
plt.plot(rec.history['loss'], lw=3, label='Train unregularized')
plt.plot(rec2.history['loss'], lw=3, label='Train regularized')
plt.plot(rec.history['val_loss'], lw=3, label='Validation unregularized')
plt.plot(rec2.history['val_loss'], lw=3, label='Validation regularized')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend()
plt.show()