# Fully connected neural network with Tensorflow for MNIST

## Introduction

Tensorflow is is a symbolic math library and one of the widely used libraries for implementing Machine learning/other algorithms involving large number of mathematical operations. Tensorflow was developed by Google and it’s open source now. It is used for both research and production at Google e.g. for implementing Machine learning in almost all applications 
- Google photos 
- Google voice search 

In this notebook we are going to build a fully connected neural network with Tensorflow

## Requirements

### Imports

In [1]:
import tensorflow as tf
from deep_teaching_commons.data.fundamentals.mnist import Mnist
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

### Loading and exploring our dataset

The MNIST dataset is a classic Machine Learning dataset you can get it and more information about it from the website of Yann Lecun. MNIST contains handwrittin digits and is split into a tranings set of 60000 examples and a test set of 10000 examples. We use the ```deep_teaching_commons``` package to load the MNIST dataset in a convenient way.

In [2]:
train_images, train_labels, test_images, test_labels = Mnist().get_all_data(one_hot_enc=True, flatten=False)
train_images, test_images = train_images.reshape(60000, 28, 28, 1), test_images.reshape(10000,28,28,1)
print('train shapes:', train_images.shape, train_labels.shape)
print('test shapes:', test_images.shape, test_labels.shape)

auto download is active, attempting download
mnist data directory already exists, download aborted
train shapes: (60000, 28, 28, 1) (60000, 10)
test shapes: (10000, 28, 28, 1) (10000, 10)


### Placeholders

So far we have used numpy arrays to manage our data, but in order to build a model in tensorflow we need another structure, the placeholder. A placeholder is simply a variable that we will assign data to at a later date. It allows us to create our operations and build our computation graph, without needing the data. In TensorFlow terminology, we then feed data into the graph through these placeholders.

In [3]:
# input X: 28x28 grayscale images, the first dimension (None) will index the images in the mini-batch
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
# correct answers will go here
Y = tf.placeholder(tf.float32, [None, 10])

## Fully connected neural network for MNIST

In this network, we are not going to use any regularization techniques (techniques which prevents overfitting: not being able to have a good performance on images it haven't seen before). 

### Initializing the weights
By initializing the weights of our neural network (the learnable parameter), we already define how our network is going to look like. We decided to use a neural network with 3 layer with a sigmoid function on top of each layer.

In [4]:
# our neural network architecture:
#
#    · · · · · · ·           (input data, flattened pixels)         X [batch, 784]   # 784 = 28*28
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W1 [784, 256]    B3[256]
#      · · · · ·                                                    Y1 [batch, 256]
#       \x/x\x/             -- fully connected layer (sigmoid)      W2 [256, 128]    B4[128]
#        · · ·                                                      Y2 [batch, 128]
#         \x/               -- fully connected layer (softmax)      W3 [128, 10]        B5[10]
#          ·                                                        Y3 [batch, 10]

W1 = tf.Variable(tf.truncated_normal([784, 256], stddev=0.1))  # 784 = 28 * 28
B1 = tf.Variable(tf.zeros([256]))
W2 = tf.Variable(tf.truncated_normal([256, 128], stddev=0.1))
B2 = tf.Variable(tf.zeros([128]))
W3 = tf.Variable(tf.truncated_normal([128, 10], stddev=0.1))
B3 = tf.Variable(tf.zeros([10]))

### Building the network

We have a 3 layer fully connected neural network with sigmoid on top of each layer. But you can also swap out every sigmoid function to another activation function.

In [5]:
flatten = tf.reshape(X, [-1, 784])
hidden1 = tf.nn.sigmoid(tf.matmul(flatten, W1) + B1)
hidden2 = tf.nn.sigmoid(tf.matmul(hidden1, W2) + B2)
output = tf.nn.sigmoid(tf.matmul(hidden2, W3) + B3)

In our **Use Case**, we need a kind of prediction layer on top of our output layer. We use a, so called, Softmax layer or the prediction which we put on top of the output layer. 

In [6]:
prediction = tf.nn.softmax(output)

### Cross Entropy Loss function
In general, the loss functions tells us how "good" or how "bad" our neural network is. This function is then minimized by the neural network so that the neural network gives us the best performance based on the defined loss function. For this purpose we are going to use the cross entropy loss function which is used very heavily in neural networks and seems to work very well.

**Note:** TensorFlow provides the ```softmax_cross_entropy_with_logits``` function to avoid numerical stability problems with log(0) which is NaN

In [7]:
# cross-entropy loss function (= -sum(Y_i * log(Yi)) ), normalised for batches of 100  images
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=output, labels=Y)
cross_entropy = tf.reduce_mean(cross_entropy)*10

### Optimizer
We are going to use the gradient descent method **Adam** to minimize our loss function. 

In [8]:
# training step, learning rate = 0.003
learning_rate = 0.001
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

### Training the network

We define a ```accuracy``` so that we can see whether our network actually improves while training

In [9]:
correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

**Hyperparamter**

In [10]:
epochs = 100
batch_size = 256

**Main**

In [None]:
loss_history = []
for e in range(epochs):
    for batch_i in tqdm(range(0, train_images.shape[0], batch_size)):
        data, label = train_images[batch_i:batch_i +
                                   batch_size], train_labels[batch_i:batch_i + batch_size]

        # run the computational graph and calculate loss + training step
        with tf.Session() as sess:
            init = tf.global_variables_initializer()
            sess.run(init)
            # optimizer will not return something which is why we store it into a variable called empty
            loss, empty = sess.run(
                [cross_entropy, train_step], feed_dict={X: data, Y: label})
        # append to loss history
        loss_history.append(loss)

    with tf.Session() as sess:
        init = tf.global_variables_initializer()
        sess.run(init)
        train_acc = sess.run(accuracy, feed_dict={
                             X: train_images, Y: train_labels})
        test_acc = sess.run(accuracy, feed_dict={X: test_images, Y: test_labels})
    print('epoch:', e, 'loss:', loss)
    print('test accuracy', test_acc, 'train accuracy', train_acc)

100%|██████████| 235/235 [00:26<00:00,  8.77it/s]
  0%|          | 1/235 [00:00<00:30,  7.74it/s]

epoch: 0 loss: 23.117252
test accuracy 0.1088 train accuracy 0.107933335


100%|██████████| 235/235 [00:27<00:00,  8.56it/s]
  0%|          | 1/235 [00:00<00:32,  7.18it/s]

epoch: 1 loss: 23.076494
test accuracy 0.0963 train accuracy 0.095233336


100%|██████████| 235/235 [00:33<00:00,  6.98it/s]
  0%|          | 1/235 [00:00<00:35,  6.54it/s]

epoch: 2 loss: 23.157417
test accuracy 0.0874 train accuracy 0.08485


100%|██████████| 235/235 [00:37<00:00,  6.27it/s]
  0%|          | 1/235 [00:00<00:34,  6.83it/s]

epoch: 3 loss: 23.055893
test accuracy 0.0959 train accuracy 0.08845


100%|██████████| 235/235 [00:44<00:00,  5.25it/s]
  0%|          | 1/235 [00:00<00:37,  6.30it/s]

epoch: 4 loss: 23.003235
test accuracy 0.0892 train accuracy 0.09035


100%|██████████| 235/235 [00:45<00:00,  5.12it/s]
  0%|          | 1/235 [00:00<00:44,  5.26it/s]

epoch: 5 loss: 23.086988
test accuracy 0.1305 train accuracy 0.13266666


100%|██████████| 235/235 [00:46<00:00,  5.04it/s]
  0%|          | 1/235 [00:00<00:45,  5.17it/s]

epoch: 6 loss: 23.145964
test accuracy 0.1009 train accuracy 0.1063


100%|██████████| 235/235 [00:56<00:00,  4.13it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 7 loss: 23.04631
test accuracy 0.0874 train accuracy 0.08033333


100%|██████████| 235/235 [01:02<00:00,  3.75it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 8 loss: 23.108122
test accuracy 0.0958 train accuracy 0.098633334


100%|██████████| 235/235 [01:13<00:00,  3.19it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 9 loss: 23.118208
test accuracy 0.1015 train accuracy 0.11075


100%|██████████| 235/235 [01:18<00:00,  2.99it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 10 loss: 23.111507
test accuracy 0.1021 train accuracy 0.10011667


100%|██████████| 235/235 [01:18<00:00,  2.99it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 11 loss: 23.185568
test accuracy 0.1009 train accuracy 0.09915


100%|██████████| 235/235 [01:18<00:00,  3.01it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 12 loss: 23.038567
test accuracy 0.1028 train accuracy 0.10411666


100%|██████████| 235/235 [01:36<00:00,  2.44it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 13 loss: 23.112528
test accuracy 0.1097 train accuracy 0.110316664


100%|██████████| 235/235 [01:41<00:00,  2.31it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 14 loss: 23.008224
test accuracy 0.097 train accuracy 0.09795


100%|██████████| 235/235 [01:43<00:00,  2.28it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 15 loss: 23.016012
test accuracy 0.1218 train accuracy 0.1243


100%|██████████| 235/235 [01:46<00:00,  2.21it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 16 loss: 23.016098
test accuracy 0.1356 train accuracy 0.1355


100%|██████████| 235/235 [01:59<00:00,  1.96it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 17 loss: 23.118822
test accuracy 0.1026 train accuracy 0.099366665


100%|██████████| 235/235 [02:06<00:00,  1.86it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 18 loss: 23.113745
test accuracy 0.095 train accuracy 0.093666665


100%|██████████| 235/235 [02:04<00:00,  1.89it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 19 loss: 23.015125
test accuracy 0.1012 train accuracy 0.10298333


100%|██████████| 235/235 [02:21<00:00,  1.66it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 20 loss: 23.042852
test accuracy 0.0754 train accuracy 0.0767


100%|██████████| 235/235 [02:22<00:00,  1.65it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 21 loss: 23.103647
test accuracy 0.0968 train accuracy 0.09911667


100%|██████████| 235/235 [02:30<00:00,  1.56it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 22 loss: 23.082792
test accuracy 0.0725 train accuracy 0.073916666


100%|██████████| 235/235 [02:43<00:00,  1.44it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 23 loss: 23.08358
test accuracy 0.09 train accuracy 0.09053333


100%|██████████| 235/235 [02:38<00:00,  1.48it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 24 loss: 23.16326
test accuracy 0.1029 train accuracy 0.10445


100%|██████████| 235/235 [03:10<00:00,  1.24it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 25 loss: 23.09859
test accuracy 0.0795 train accuracy 0.07895


100%|██████████| 235/235 [03:06<00:00,  1.26it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 26 loss: 23.120596
test accuracy 0.0825 train accuracy 0.085


100%|██████████| 235/235 [03:24<00:00,  1.15it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 27 loss: 23.106327
test accuracy 0.0954 train accuracy 0.09498333


100%|██████████| 235/235 [03:28<00:00,  1.12it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 28 loss: 22.924032
test accuracy 0.1006 train accuracy 0.10096667


100%|██████████| 235/235 [03:41<00:00,  1.06it/s]
  0%|          | 0/235 [00:00<?, ?it/s]

epoch: 29 loss: 23.143307
test accuracy 0.1153 train accuracy 0.112116665


  0%|          | 1/235 [00:00<03:07,  1.25it/s]

### Evaluate model
Let us look at the optimization results. Final loss tells us how far we could reduce costs during traning process. Further we can use the first loss value as a sanity check and validate our implementation of the loss function works as intended. Recall loss value after first iteration should be $ log\:c$ with $c$ being number of classes. To visulize the whole tranings process we can plot losss values from each iteration as a loss curve. 

In [None]:
# check loss after last and first iteration
print('last iteration loss:',loss_history[-1])
print('first iteration loss:',loss_history[0])
# Plot a loss curve
plt.plot(loss_history)
plt.ylabel('loss')
plt.xlabel('iterations')

Evaluation above gave us some inside about the optimization process but did not quantified our final model. One possibility is to calculate model accuracy.

In [None]:
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    acc = sess.run(accuracy, feed_dict={X:test_images, Y: test_labels})

print(acc)