# Create a DNN using tensorflow and keras

## Introduction

In this example a simple Deep Neural Network (DNN) is trained to recognize images of handwritten digits. It is partly inspired by the official [tensorflow keras tutorial](https://www.tensorflow.org/tutorials/keras/basic_classification), which also is a good tutorial on the concepts of DNNs for beginners.

This tutorial gives a brief insight into building DNNs by four majour steps:
* setup the model structure
* train the model
* evaluatre accuracy
* make predictions

Finally the results are discussed and improved by some some tricks:
* plot loss and accuracy of the model
* improve the performance of training the model

## setup the model structure


### load necessary libraries

The Keras library is being used, which by now is part of tensorflow. Following code loads the according libraries.

In [1]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

1.10.0


### use MNIST data

The [MNIST dataset](http://yann.lecun.com/exdb/mnist/) is a set of labeled images of handwritten digits (0-9). Each imaage has 28x28 digits. It consists of a training set of 60,000 examples and a set of 10,000 examples. Since it is a popular dataset, which is often used to benchmark machine learning libraries, tensorflow offers code to load MNIST.

In [2]:
# load mnist dataset
mnist_dataset = keras.datasets.mnist
#mnist_dataset = keras.datasets.fashion_mnist # for fashion mnist dataset instead of classic mnist dataset
(mnist_train_images, mnist_train_labels), (mnist_test_images, mnist_test_labels) = mnist_dataset.load_data()

It is good practice to normalize data of any dataset before using the optimization algorithms. This makes the gradients of the Mean Square Error function (RMSE) steeper and may cause the regression to converge faster. The pixel values of MNIST reach between 0 and 255. These integer values are casted to float values reaching between zero and one.
todo hands-on machine learning zitieren

In [3]:
mnist_train_images = mnist_train_images / 255.0
mnist_test_images = mnist_test_images / 255.0

### create the model

The structure of the DNN should consist of 28x28 (=256) input neurons, one hidden layer of 128 neurons and an output layer of 10 neurons (each corresponding to one digit from zero to nine).
![alt text](images/dnn_tf_keras/dnn_form.svg")

Following code shows how to create a DNN with this network topology using Keras.

In [5]:
# create the model
mnist_model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)
])

### configure the model

Before using the model it is necessary to setup its behaviour by some configurations:
* **optimizer:** Optimization Method (Gradient Descent, Stochastic Gradient descent, Mini-Batch, Nestrov, Adagrad, Adam Optimization<sup>1</sup> or conjugate gradient, bfgs, l-bfgs<sup>2</sup>)
* **loss:** Cost function needed by the optimizer to minimize the error during training. 
* **metrics:** A metrics function is a function that is used to judge the performance of a model. It is similary to a loss function. It is not used when training the function.

<sup>1: Optimization methods described by Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron</sup><br>
<sup>2: conjugate gradient, bfgs, l-bfgs are frequently used optimization algroithms, that Andrew Ng recommends besides gradient descent in his [lecture on logistic regression](https://youtu.be/6vO3DVJlsK4?t=1m53s) of his MIT machine learning course.</sup>

In [6]:
# configure the model
mnist_model.compile(optimizer=tf.train.AdamOptimizer(), 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## train the model

Now everything is setup properly to be able to train the model for the first time. This may take a while.

In [7]:
x_val=mnist_train_images[:10000]
partial_x_train=mnist_train_images[10000:]

y_val=mnist_train_labels[:10000]
partial_y_train=mnist_train_labels[10000:]

# train the model
history = mnist_model.fit(partial_x_train, 
                partial_y_train, 
                epochs=50,
                batch_size=512,
                validation_data=(x_val, y_val),
                verbose=1)

Train on 50000 samples, validate on 10000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


The models accuracy is about 97.8%:

In [8]:
# test the model
mnist_test_loss, mnist_test_acc = mnist_model.evaluate(mnist_test_images, mnist_test_labels)
print('Test accuracy:', mnist_test_acc)

Test accuracy: 0.9766


## make predictions

The model is trained and ready to use. In the following code an image of the test dataset is taken to predict which digit it shows.

In [9]:
# inference

predictions = mnist_model.predict(mnist_test_images)

if (np.argmax(predictions[0]) == mnist_test_labels[0]):
    print("Prediction correct!")
else:
    print("Prediction Error, check your code.")

Prediction correct!


## plot loss and accuracy of the model

To get a better impression of what is happening, it may be useful to plot the loss and accuracy of the example to make some interpretations.

In [10]:
history_dict = history.history
history_dict.keys()

dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

In [11]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

output_notebook()

p = figure()
p.line(epochs, loss, color='blue', line_width=2)
p.line(epochs, val_loss, color='red', line_width=2)
show(p)

Also the accuracy of training data and validation data is plotted. It can be seen that even after many epochs, there is no significant improvement.

In [12]:
p = figure()
p.line(epochs, acc, color='blue', line_width=2)
p.line(epochs, val_acc, color='red', line_width=2)
show(p)

## improve the performance of training the model
The first plot shows that the loss of the training data (blue) decreases permanently in each epoch. At about 30, however, the loss of the validation data (red) starts to increase. This is the point where the model starts to overfit the training data. The training can be stopped at this point. Following code shows, how this is implemented in Keras, which is quite straight forward.


In [13]:
# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)

# train the model
history = mnist_model.fit(partial_x_train, 
                partial_y_train, 
                epochs=50,
                batch_size=512,
                validation_data=(x_val, y_val),
                callbacks=[early_stop],
                verbose=1)

Train on 50000 samples, validate on 10000 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50


By this way the same accuracy can be achieved, by significantly less epochs. The training takes about half of the time. This is what Geoffrey Hinton calls a beautiful free lunch.

In [14]:
# test the model
mnist_test_loss, mnist_test_acc = mnist_model.evaluate(mnist_test_images, mnist_test_labels)
print('Test accuracy:', mnist_test_acc)

Test accuracy: 0.9772


## strategies to overcome overfitting

* a too small model will not be able to map the complexity of the data (underfitting)
* a too big model has too many hyperparameters and will overfit soon
* to find the right size of a model can be a complex task, which unfortunatelly cannot be overcome without some experience. A good strategy can be to use a model, that is more complex than necessary and use early stopping (stretch pants).
* The best way to prevent overfitting is to train it with more data. If this is not possible -> l1 / l2 regularization | dropout

#### further steps
- stratetegeis to overcome overfitting ...
- notebook on cnn with the same dataset, showing the better results
- auto-ml/auto-keras/bayesian hyperpararmeter optimization (see Deep Learning | Ian Goodfellow)

Idea: Notebook "It's no rocket science"

rocket equation and artificial neural network that learns it
model based approach vs. learning from data
"There is no free lunch" theorem and "Univseral Function Approximator" theorem