# Fully connected neural network with Tensorflow for MNIST

## Introduction

Tensorflow is is a symbolic math library and one of the widely used libraries for implementing Machine learning/other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s open source now. It is used for both research and production at Google e.g. for implementing Machine learning in almost all applications 
- Google photos 
- Google voice search 

In this notebook we are going to build a fully connected neural network with Tensorflow

## Requirements

### Imports

In [1]:
import tensorflow as tf
from deep_teaching_commons.data.fundamentals.mnist import Mnist
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

### Loading dataset

The MNIST dataset is a classic Machine Learning dataset you can get it and more information about it from the website of Yann Lecun. MNIST contains handwrittin digits and is split into a tranings set of 60000 examples and a test set of 10000 examples. We use the ```deep_teaching_commons``` package to load the MNIST dataset in a convenient way.

In [2]:
train_images, train_labels, test_images, test_labels = Mnist().get_all_data(one_hot_enc=True, flatten=False)
train_images, test_images = train_images.reshape(60000, 28, 28, 1), test_images.reshape(10000,28,28,1)
print('train shapes:', train_images.shape, train_labels.shape)
print('test shapes:', test_images.shape, test_labels.shape)

auto download is active, attempting download
mnist data directory already exists, download aborted
train shapes: (60000, 28, 28, 1) (60000, 10)
test shapes: (10000, 28, 28, 1) (10000, 10)


### Placeholders

So far we have used numpy arrays to manage our data, but in order to build a model in tensorflow we need another structure, the placeholder. A placeholder is simply a variable that we will assign data to at a later date. It allows us to create our operations and build our computation graph, without needing the data. In TensorFlow terminology, we then feed data into the graph through these placeholders.

In [3]:
# input X: 28x28 grayscale images, the first dimension (None) will index the images in the mini-batch
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
# correct answers will go here
Y = tf.placeholder(tf.float32, [None, 10])

## Fully connected neural network for MNIST


### Initializing the weights
By initializing the weights of our neural network (the learnable parameter), we already define how our network is going to look like. We decided to use a neural network with 3 layer with a ReLU and Dropout function on top of each layer. 

In [4]:
# our neural network architecture:
#
#    · · · · · · ·           (input data, flattened pixels)               X [batch, 784]   # 784 = 28*28
#     \x/x\x/x\x/           -- fully connected layer (ReLU + Droput)      W1 [784, 256]    B3[256]
#      · · · · ·                                                          Y1 [batch, 256]
#       \x/x\x/             -- fully connected layer (ReLU + Droput)      W2 [256, 128]    B4[128]
#        · · ·                                                            Y2 [batch, 128]
#         \x/               -- fully connected layer (softmax)            W3 [128, 10]        B5[10]
#          ·                                                              Y3 [batch, 10]

W1 = tf.Variable(tf.truncated_normal([784, 256], stddev=0.1))  # 784 = 28 * 28
B1 = tf.Variable(tf.zeros([256]))
W2 = tf.Variable(tf.truncated_normal([256, 128], stddev=0.1))
B2 = tf.Variable(tf.zeros([128]))
W3 = tf.Variable(tf.truncated_normal([128, 10], stddev=0.1))
B3 = tf.Variable(tf.zeros([10]))

### Dropout

Dropout is a regularization technique which tries to prevent overfitting. Overfitting means that our network can't perform very well on images it haven't seen before which is obviously really bad.

In [5]:
# Probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = tf.placeholder(tf.float32)

### Building the network

We have, as described above, a 3 layer fully connected neural network with ReLU and Dropout on top of each layer.

In [6]:
flatten = tf.reshape(X, [-1, 784])
hidden1 = tf.nn.relu(tf.matmul(flatten, W1) + B1)
dropout1 = tf.nn.dropout(hidden1, pkeep)
hidden2 = tf.nn.relu(tf.matmul(hidden1, W2) + B2)
dropout2 = tf.nn.dropout(hidden2, pkeep)
output = tf.nn.relu(tf.matmul(hidden2, W3) + B3)

In our **Use Case**, we need a kind of prediction layer on top of our output layer. We use a, so called, Softmax layer or the prediction which we put on top of the output layer. 

In [7]:
prediction = tf.nn.softmax(output)

### Cross Entropy Loss function
In general, the loss functions tells us how "good" or how "bad" our neural network is. This function is then minimized by the neural network so that the neural network gives us the best performance based on the defined loss function. For this purpose we are going to use the cross entropy loss function which is used very heavily in neural networks and seems to work very well.

**Note:** TensorFlow provides the ```softmax_cross_entropy_with_logits``` function to avoid numerical stability problems with log(0) which is NaN

In [8]:
# cross-entropy loss function (= -sum(Y_i * log(Yi)) ), normalised for batches of 100  images
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=output, labels=Y)
cross_entropy = tf.reduce_mean(cross_entropy)*10

### Optimizer
We are going to use the gradient descent method **Adam** to minimize our loss function. We also add a learning rate with an exponential decay. In our setting we start at a learning rate of $0.003$ and exponentially reduce it to $0.00001$.

In [9]:
# step for variable learning rate
step = tf.placeholder(tf.int32)

# the learning rate is: # 0.0001 + 0.003 * (1/e)^(step/2000)
learning_rate = 0.0001 +  tf.train.exponential_decay(0.003, step, 2000, 1/np.exp(1))
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

### Training the network

We define a ```accuracy``` so that we can see whether our network actually improves while training

In [10]:
correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

0.0928


**Hyperparameter**

In [None]:
epochs = 100
batch_size = 256

**Main**

In [None]:
loss_history = []
for e in range(epochs):
    for batch_i in tqdm(range(0, train_images.shape[0], batch_size)):
        data, label = train_images[batch_i:batch_i + batch_size], train_labels[batch_i:batch_i + batch_size]

        # run the computational graph and calculate loss + training step
        with tf.Session() as sess:
            init = tf.global_variables_initializer()
            sess.run(init)
            # optimizer will not return something which is why we store it into a variable called empty
            loss, empty = sess.run([cross_entropy, train_step], feed_dict={X: data, Y: label,  pkeep: 0.75, step: e})
        # append to loss history
        loss_history.append(loss)

    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        sess.run(init)
        train_acc = sess.run(accuracy, feed_dict={X:train_images, Y: train_labels,  pkeep: 1})
        test_acc = sess.run(accuracy, feed_dict={X:test_images, Y: test_labels,  pkeep: 1})
    print('epoch:', e, 'loss:', loss)
    print('test accuracy', test_acc, 'train accuracy', train_acc)

100%|██████████| 235/235 [00:19<00:00, 11.89it/s]
  0%|          | 1/235 [00:00<00:23,  9.79it/s]

epoch: 0 loss: 1332.14
test accuracy 0.098 train accuracy 0.099416666


100%|██████████| 235/235 [00:22<00:00, 10.56it/s]
  0%|          | 1/235 [00:00<00:23,  9.81it/s]

epoch: 1 loss: 1486.891
test accuracy 0.0796 train accuracy 0.0735


100%|██████████| 235/235 [00:26<00:00,  9.03it/s]
  0%|          | 1/235 [00:00<00:27,  8.61it/s]

epoch: 2 loss: 1352.1313
test accuracy 0.0823 train accuracy 0.07935


100%|██████████| 235/235 [00:29<00:00,  8.09it/s]
  0%|          | 1/235 [00:00<00:27,  8.61it/s]

epoch: 3 loss: 1479.1973
test accuracy 0.1103 train accuracy 0.10543333


100%|██████████| 235/235 [00:30<00:00,  7.73it/s]
  0%|          | 1/235 [00:00<00:30,  7.61it/s]

epoch: 4 loss: 1515.1885
test accuracy 0.0705 train accuracy 0.06713333


100%|██████████| 235/235 [00:38<00:00,  6.11it/s]
  0%|          | 1/235 [00:00<00:42,  5.47it/s]

epoch: 5 loss: 1999.0657
test accuracy 0.1049 train accuracy 0.0991


100%|██████████| 235/235 [00:43<00:00,  5.45it/s]
  0%|          | 1/235 [00:00<00:46,  5.01it/s]

epoch: 6 loss: 2029.7053
test accuracy 0.0953 train accuracy 0.0989


100%|██████████| 235/235 [00:49<00:00,  4.79it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 7 loss: 1419.071
test accuracy 0.0952 train accuracy 0.09266666


100%|██████████| 235/235 [00:53<00:00,  4.37it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 8 loss: 1368.365
test accuracy 0.1125 train accuracy 0.108216666


100%|██████████| 235/235 [00:58<00:00,  4.03it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 9 loss: 1687.5942
test accuracy 0.1326 train accuracy 0.12691666


100%|██████████| 235/235 [01:03<00:00,  3.68it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 10 loss: 1664.1945
test accuracy 0.0858 train accuracy 0.08828333


100%|██████████| 235/235 [01:03<00:00,  3.73it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 11 loss: 1464.2645
test accuracy 0.0752 train accuracy 0.07173333


100%|██████████| 235/235 [01:12<00:00,  3.24it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 12 loss: 1563.6461
test accuracy 0.0902 train accuracy 0.08846667


100%|██████████| 235/235 [01:19<00:00,  2.95it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 13 loss: 1628.1176
test accuracy 0.1023 train accuracy 0.104033336


100%|██████████| 235/235 [01:24<00:00,  2.78it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 14 loss: 1396.0579
test accuracy 0.0744 train accuracy 0.073466666


100%|██████████| 235/235 [01:29<00:00,  2.63it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 15 loss: 1476.1855
test accuracy 0.075 train accuracy 0.080133334


100%|██████████| 235/235 [01:30<00:00,  2.59it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 16 loss: 1219.443
test accuracy 0.1038 train accuracy 0.103766665


100%|██████████| 235/235 [01:42<00:00,  2.30it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 17 loss: 1234.592
test accuracy 0.0927 train accuracy 0.09458333


100%|██████████| 235/235 [01:47<00:00,  2.18it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 18 loss: 1578.2714
test accuracy 0.0707 train accuracy 0.06625


100%|██████████| 235/235 [02:06<00:00,  1.85it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 19 loss: 2078.0562
test accuracy 0.156 train accuracy 0.15385


100%|██████████| 235/235 [02:26<00:00,  1.61it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 20 loss: 1915.5763
test accuracy 0.0595 train accuracy 0.058683332


100%|██████████| 235/235 [02:34<00:00,  1.52it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 21 loss: 2389.2183
test accuracy 0.0974 train accuracy 0.09911667


100%|██████████| 235/235 [02:32<00:00,  1.54it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 22 loss: 1357.7815
test accuracy 0.0868 train accuracy 0.09091666


100%|██████████| 235/235 [02:51<00:00,  1.37it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 23 loss: 1333.1732
test accuracy 0.0646 train accuracy 0.06956667


100%|██████████| 235/235 [02:42<00:00,  1.44it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 24 loss: 1107.1466
test accuracy 0.1205 train accuracy 0.1162


100%|██████████| 235/235 [03:03<00:00,  1.28it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 25 loss: 1399.3723
test accuracy 0.0904 train accuracy 0.085266665


100%|██████████| 235/235 [02:58<00:00,  1.31it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 26 loss: 1529.4054
test accuracy 0.0814 train accuracy 0.08546667


100%|██████████| 235/235 [03:19<00:00,  1.18it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 27 loss: 1682.2285
test accuracy 0.1062 train accuracy 0.104666665


100%|██████████| 235/235 [03:12<00:00,  1.22it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 28 loss: 1223.7552
test accuracy 0.0837 train accuracy 0.08308333


100%|██████████| 235/235 [03:33<00:00,  1.10it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 29 loss: 1503.5419
test accuracy 0.0913 train accuracy 0.09205


100%|██████████| 235/235 [03:29<00:00,  1.12it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 30 loss: 2280.2078
test accuracy 0.1202 train accuracy 0.11546667


100%|██████████| 235/235 [03:49<00:00,  1.02it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 31 loss: 1906.0785
test accuracy 0.1142 train accuracy 0.11568333


100%|██████████| 235/235 [03:52<00:00,  1.01it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 32 loss: 1546.8396
test accuracy 0.0878 train accuracy 0.10053334


100%|██████████| 235/235 [04:13<00:00,  1.08s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 33 loss: 2368.9478
test accuracy 0.1069 train accuracy 0.10733333


100%|██████████| 235/235 [04:29<00:00,  1.15s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 34 loss: 1130.6174
test accuracy 0.1348 train accuracy 0.13745


100%|██████████| 235/235 [04:34<00:00,  1.17s/it]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 35 loss: 1818.8826
test accuracy 0.1057 train accuracy 0.10785


 38%|███▊      | 89/235 [01:44<02:51,  1.17s/it]

### Evaluate model
Let us look at the optimization results. Final loss tells us how far we could reduce costs during traning process. Further we can use the first loss value as a sanity check and validate our implementation of the loss function works as intended. Recall loss value after first iteration should be $ log\:c$ with $c$ being number of classes. To visulize the whole tranings process we can plot losss values from each iteration as a loss curve. 

In [None]:
# check loss after last and first iteration
print('last iteration loss:',loss_history[-1])
print('first iteration loss:',loss_history[0])
# Plot a loss curve
plt.plot(loss_history)
plt.ylabel('loss')
plt.xlabel('iterations')

Evaluation above gave us some inside about the optimization process but did not quantified our final model. One possibility is to calculate model accuracy.

In [None]:
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    acc = sess.run(accuracy, feed_dict={X:test_images, Y: test_labels,  pkeep: 1})

print(acc)