<a href="https://colab.research.google.com/github/rufimelo99/MNIST/blob/main/ML9_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

## Practical Lecture 9: Neural Networks in Code

This is an introductory tutorial about the creation, training, and evaluaton of neural networks in code. There are multiple deep learning libraries available, such as [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), or [DyNet](http://dynet.io/). In this tutorial we will use [TensorFlow](https://www.tensorflow.org/) and, more specifically, [Keras](https://www.tensorflow.org/api_docs/python/tf/keras), which is a high-level API over [TensorFlow](https://www.tensorflow.org/). Since this tutorial is not exhaustive, refer to the [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) documentation for further details.

If you have questions and/or suggestions, please send them to [eugenio.ribeiro@tecnico.ulisboa.pt](mailto:eugenio.ribeiro@tecnico.ulisboa.pt).



Let's start by importing the libraries that will be used in this tutorial:

* [tensorflow](https://www.tensorflow.org/): the neural network library
* [tensorflow_datasets](https://www.tensorflow.org/datasets): provides the datasets that we will use
* [numpy](https://numpy.org/): we will use it to store the data in array format for visualization
* [sklearn](https://scikit-learn.org/): provides a [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) implementation that we will use for visualization
* [matplotlib](https://matplotlib.org/): plotting library for visualization




In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

import numpy as np

import sklearn.decomposition
import matplotlib.pyplot as plt

## Feed-Forward Networks

To explain how to create and train a neural network model using the [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras), we will start by exploring feed-forward networks.

### Dataset

In this part of the tutorial we will use [Iris](https://archive.ics.uci.edu/ml/datasets/iris), which is a widely used dataset for introducing machine learning problems and approaches: 

* 150 examples of iris flowers
* 3 classes: iris setosa, iris virginica, and iris versicolor
* 4 features: length and width of sepals and petals

In [None]:
iris_data, iris_info = tfds.load('iris', with_info=True)

In [None]:
print(iris_info)

Among other things, in the description that accompanies the dataset, we can see that the dataset is not partitioned, that is, all the examples belong to the training set. Furthermore, the [tensorflow_datasets](https://www.tensorflow.org/datasets) library provides the data in the form of tensors which can be used directly to train and evaluate models. However, in order to visualize it and to generalize the tutorial to datasets that are not provided by that library, we will convert it to [NumPy](https://numpy.org/) arrays:

In [None]:
iris_x = np.asarray([instance['features'] for instance in tfds.as_numpy(iris_data['train'])])
iris_y = np.asarray([instance['label'] for instance in tfds.as_numpy(iris_data['train'])])

Let's take a look at some examples:

In [None]:
for f, c in zip(iris_x[:5], iris_y[:5]):
    print('Features: {}\tClass: {}'.format(f,c))

Since there are four features, it is hard to visualize the spatial distribution of the classes. Thus, just for visualization purposes, we will map the examples into a two-dimensional space, by applying [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html):

In [None]:
pca = sklearn.decomposition.PCA(n_components=2)
iris_2d = pca.fit_transform(iris_x)

Now we can visualize the dataset and see that two of the classes are not linearly separable:

In [None]:
iris_classes = iris_info.features['label'].names
colors = ['.r', '.g', '.b']

plt.figure()
plt.title('Iris Dataset')
for i in range(iris_info.features['label'].num_classes):
    plt.plot(*iris_2d[np.where(iris_y==i)].T, colors[i], label=iris_classes[i])
plt.legend(loc='best')
plt.show()

### Models

Now, let's create and train some networks to approach the problem posed by the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset. [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) provides two ways to create a model: the [Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential) model, in which the layers form a linear stack, and the [Functional API](https://www.tensorflow.org/guide/keras/functional), which is more flexible. For this tutorial, the first is enough. 

#### Single-Layer Network

We will start by creating a very simple network without hidden layers. Thus, in addition to the [input](https://www.tensorflow.org/api_docs/python/tf/keras/Input), it will only have a fully connected layer which, in the [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras), is called a [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer. Since the dataset poses a multiclass classification problem, the layer will have a number of neurons equal to the number of classes and we will use the softmax [activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations). The *name* parameter is optional for both the model and the layers, but it is useful in more complex scenarios.

In [None]:
single_layer_model = tf.keras.Sequential(name='single_layer')
single_layer_model.add(tf.keras.layers.Input(iris_info.features['features'].shape))
single_layer_model.add(tf.keras.layers.Dense(iris_info.features['label'].num_classes, activation='softmax', name='output'))

Now, let's state that the architecture of our network is complete and define the [loss function](https://www.tensorflow.org/api_docs/python/tf/keras/losses), the [optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers), and which additional [metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) we are interested in. Since it is a multiclass classification problem, we will use the categorical cross-entropy as loss function. We select the sparse version, since the labels are represented by integer values and not using a one-hot representation. Additionally, we request the accuracy values. To update the weights, we will use Stochastic Gradient Descent.

In [None]:
single_layer_model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

We can request a summary of the model, to check if everything is as expected:

In [None]:
single_layer_model.summary()

##### Training

Now the model has been initialized and is ready for training. In order to avoid building the model again, we can serialize the initial weights and load them later:

In [None]:
single_layer_model.save_weights('single_layer_init.h5')

To train the model, we call the [fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method. It receives the training data and the corresponding targets and trains the network for the number of epochs defined by the *epochs* parameter. By default, the training examples are shuffled every epoch. Additionally, the *batch_size* parameter defines the number of examples considered in each update of the weights. In more complex scenarios, the data can be provided using generators that create the batches according to some specific criteria.


In [None]:
single_layer_train = single_layer_model.fit(iris_x, iris_y, epochs=100, batch_size=32)

##### Evaluation

The output of the training process is a [History](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/History) object that describes how the network evolved over the epochs in terms of the loss and the other selected metrics. We can plot this information:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(single_layer_train.history['loss'], '-r', label='Train')

acc_ax.set_title('Accuracy')
acc_ax.plot(single_layer_train.history['accuracy'], '-r', label='Train')

plt.legend(loc=4)
plt.show()

Additionally, we can use the [predict](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict) method to obtain the output of the trained model for a given set of examples:

In [None]:
predictions = single_layer_model.predict(iris_x[:5])
print(predictions)

This gives the actual output of the network. Post-processing is necessary to atribute a class to the examples. In this case, selecting the index corresponding to the maximum value:

In [None]:
np.argmax(predictions, axis=1)

In order to evaluate the performance on a set of examples, given their expected targets, the [evaluate](https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate) method can be used directly:

In [None]:
loss, acc = single_layer_model.evaluate(iris_x, iris_y)
print('Accuracy: {}'.format(acc))

Since we only have training results, we don't know if the model generalizes well. The dataset does not have validation nor test partitions. Thus, we will use part of the training set for validation. In order to assess the generalization ability of the model, we must train it again, without that data. We have saved the initial weights of the model, so we can load them instead of compiling the model again: 

In [None]:
single_layer_model.load_weights('single_layer_init.h5')

The [fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method has an optional parameter, *validation_split*, which creates the validation partition automatically by taking the defined percentage of examples from the end of the training data. Thus, we must be careful when using it, since the training data may be ordered by class. This is not the case, so we can use it. If we have a predefined validation partition, we can provide it using the *validation_data* parameter instead.

In [None]:
single_layer_train = single_layer_model.fit(iris_x, iris_y, validation_split=0.2, epochs=100, batch_size=32)

Now we also have information regarding the validation data that we can plot:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(single_layer_train.history['loss'], '-r', label='Train')
loss_ax.plot(single_layer_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(single_layer_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(single_layer_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

Using the validation data, we can assess the generalization ability of our model. However, when should we stop training? An option is to stop after a predefined number of epochs without improvement on the validation data.  To do this, lets start by loading the initial weights again:

In [None]:
single_layer_model.load_weights('single_layer_init.h5')

In Keras, this behavior is implemented using callbacks that are provided to the [fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method. To stop the training phase , we use the [EarlyStopping](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) callback. The metric that controls the process is defined using the *monitor* parameter and the number of epochs to wait for improvement is defined using the *patience* parameter:

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=50, verbose=1)

Also, we can save the weights of the model whenever the performance on the validation data increases, so that we can use the best model to classify new examples. To save the best model, we use the [ModelCheckpoint](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) callback. In this case we must provide the filename in which we want to save the weights. We must also provide the metric to consider using the *monitor* parameter and state that we only want to keep the best models, by setting the *save_best_only* parameter to *True*:

In [None]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('single_layer_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

Now we can train the model again, providing the list of callbacks in the *callback* parameter. In this case, we select an excessively high number of epochs, since we expect the training phase to stop before that number is achieved.

In [None]:
single_layer_train = single_layer_model.fit(iris_x, iris_y, validation_split=0.2, callbacks=[earlystop, checkpoint], epochs=10000, batch_size=32)

We can still plot the evolution as the number of epochs increases:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(single_layer_train.history['loss'], '-r', label='Train')
loss_ax.plot(single_layer_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(single_layer_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(single_layer_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

To load the weights of the best model, we use the same method as for loading the initial weights, but using the filename defined in the callback:

In [None]:
single_layer_model.load_weights('single_layer_best.h5')

#### Multi-Layer Model

To create a network with hidden layers, we simply add additional [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layers before the output layer:

In [None]:
multi_layer_model = tf.keras.Sequential(name='multi_layer')
multi_layer_model.add(tf.keras.layers.Input(iris_info.features['features'].shape))
multi_layer_model.add(tf.keras.layers.Dense(256, activation='tanh', name='hidden'))
multi_layer_model.add(tf.keras.layers.Dense(iris_info.features['label'].num_classes, activation='softmax', name='output'))

multi_layer_model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
multi_layer_model.summary()

Let's also use callbacks for stopping the training phase:



In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=50, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('multi_layer_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

Now we can train the model:

In [None]:
multi_layer_train = multi_layer_model.fit(iris_x, iris_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=32)

And plot the evolution:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(multi_layer_train.history['loss'], '-r', label='Train')
loss_ax.plot(multi_layer_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(multi_layer_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(multi_layer_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

##### Regularization

When adding layers to the network, we can also include regularization in those layers using three different parameters: *kernel_regularizer*, *bias_regularizer*, and *activity_regularizer*. The first applies regularization to the weights of the layer, the second to its bias, and the last to its output. [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) also has implementations of multiple [regularizers](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers). As an example, lets create a network with the same architecture as the previous, but with [L2](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers/l2) regularization to the weights of the hidden layer:

In [None]:
multi_layer_reg_model = tf.keras.Sequential(name='multi_layer_regularization')
multi_layer_reg_model.add(tf.keras.layers.Input(iris_info.features['features'].shape))
multi_layer_reg_model.add(tf.keras.layers.Dense(256, activation='tanh', kernel_regularizer=tf.keras.regularizers.l2(0.01), name='hidden'))
multi_layer_reg_model.add(tf.keras.layers.Dense(iris_info.features['label'].num_classes, activation='softmax', name='output'))

multi_layer_reg_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
multi_layer_reg_model.summary()

Now, let's train it on the [Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset:

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=50, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('multi_layer_reg_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

multi_layer_reg_train = multi_layer_reg_model.fit(iris_x, iris_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=32)

We can see that the model converges faster. Also, by plotting the evolution, we can see that there are fewer oscillations:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(multi_layer_reg_train.history['loss'], '-r', label='Train')
loss_ax.plot(multi_layer_reg_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(multi_layer_reg_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(multi_layer_reg_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

## Convolutional Neural Networks

Convolutional layers capture patterns corresponding to relevant features independently of where they occur in the input. To do so, they slide a window over the input and apply the convolution operation with a set of kernels or filters that represent the features. Although it is not their only field of application, convolutional neural networks are mainly praised for their performance on image processing tasks. 

### Dataset

In this part of the tutorial we will use the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, which is widely used for introducing image classification problems: 

* 70k examples of handwritten digits
* Image size: 28x28
* 1 channel
* 10 classes: [0-9]

In [None]:
mnist_data, mnist_info = tfds.load('mnist', with_info=True)

In [None]:
print(mnist_info)

We can see that, in this case, the dataset has a standard partition of 60k examples for training and 10k for testing. Let's convert them to [NumPy](https://numpy.org/) arrays:

In [None]:
mnist_train_x = np.asarray([instance['image']/255 for instance in tfds.as_numpy(mnist_data['train'])])
mnist_train_y = np.asarray([instance['label'] for instance in tfds.as_numpy(mnist_data['train'])])

mnist_test_x = np.asarray([instance['image']/255 for instance in tfds.as_numpy(mnist_data['test'])])
mnist_test_y = np.asarray([instance['label'] for instance in tfds.as_numpy(mnist_data['test'])])

Furthermore, the dataset includes methods for visualizing examples:

In [None]:
tfds.show_examples(mnist_data['test'], mnist_info)

### Models

Now, let's create and train some networks to approach the problem posed by the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset.

#### Baseline

As a baseline, let's use a feed-forward network to approach the problem. The only difference from the previous networks is that we need to [flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten) the image before passing it to the [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer:

In [None]:
mnist_baseline_model = tf.keras.Sequential(name='mnist_baseline')
mnist_baseline_model.add(tf.keras.layers.Input(mnist_info.features['image'].shape))
mnist_baseline_model.add(tf.keras.layers.Flatten(name='flatten'))
mnist_baseline_model.add(tf.keras.layers.Dense(mnist_info.features['label'].num_classes, activation='softmax', name='output'))
mnist_baseline_model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
mnist_baseline_model.summary()

Now we can train the model. We will still use part of the training set for validation, in order to control when to stop the training phase:

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('mnist_baseline_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

mnist_baseline_model_train = mnist_baseline_model.fit(mnist_train_x, mnist_train_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=256)

Let's see how the performance evolved on the training and validation data:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(mnist_baseline_model_train.history['loss'], '-r', label='Train')
loss_ax.plot(mnist_baseline_model_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(mnist_baseline_model_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(mnist_baseline_model_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

Now let's load the best model for the validation data and evaluate it on the test set:

In [None]:
mnist_baseline_model.load_weights('mnist_baseline_best.h5')
loss, acc = mnist_baseline_model.evaluate(mnist_test_x, mnist_test_y)
print('Accuracy: {}'.format(acc))

#### Convolutional Neural Network

To create our CNN, instead of feeding the flatenned output directly to the output layer, we will first pass it through a convolutional layer followed by a max pooling operation. Since, we are dealing with 2D data, we will use the [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) and [MaxPool2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool2D) layers. 

In the convolutional layer, the *filters* parameter defines the number of kernels or filters used in the layer. The *kernel_size* parameter defines the size of the kernels. If only one number is provided, the kernel is assumed to be square. The stride values default to one, but can be changed using the *strides* parameter. Also, we can use same padding, by setting the *padding* parameter to 'same'.

For the pooling operation, we define the size of the pooling window using the *pool_size* parameter. Similarly to the *kernel_size* parameter of the convolutional layer, if only one number is provided, the window is assumed to be square. Additionally, the *strides* parameter defaults to the size of the pooling window. That is, there is no overlap.

In [None]:
mnist_conv_model = tf.keras.Sequential(name='mnist_cnn')
mnist_conv_model.add(tf.keras.layers.Input(mnist_info.features['image'].shape))
mnist_conv_model.add(tf.keras.layers.Conv2D(filters=16, kernel_size=4, activation='relu', padding='same', name='convolution'))
mnist_conv_model.add(tf.keras.layers.MaxPool2D(pool_size=2, name='pooling'))
mnist_conv_model.add(tf.keras.layers.Flatten(name='flatten'))
mnist_conv_model.add(tf.keras.layers.Dense(mnist_info.features['label'].num_classes, activation='softmax', name='output'))
mnist_conv_model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
mnist_conv_model.summary()

Let's train the model using the same approach as before:

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('mnist_conv_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

mnist_conv_model_train = mnist_conv_model.fit(mnist_train_x, mnist_train_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=256)

And visualize the evolution:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(mnist_conv_model_train.history['loss'], '-r', label='Train')
loss_ax.plot(mnist_conv_model_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(mnist_conv_model_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(mnist_conv_model_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

Finally, we can evaluate the model on the test set, and verify that the performance is higher than without the convolutional layers.

In [None]:
mnist_conv_model.load_weights('mnist_conv_best.h5')
loss, acc = mnist_conv_model.evaluate(mnist_test_x, mnist_test_y)
print('Accuracy: {}'.format(acc))

#### Dropout

In this scenario, instead of applying regularization to the weights, we will use a different approach to regularization, namely, dropout. The idea behind dropout is to disable a percentage of randomly selected neurons during each step of the training phase, in order to avoid overfitting. In [Keras](https://www.tensorflow.org/api_docs/python/tf/keras), we can apply dropout directly to some layers by defining the corresponding parameters, or by using the [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layer and stating the percentage of neurons to disable.

In [None]:
mnist_conv_drop_model = tf.keras.Sequential(name='mnist_cnn_dropout')
mnist_conv_drop_model.add(tf.keras.layers.Input(mnist_info.features['image'].shape))
mnist_conv_drop_model.add(tf.keras.layers.Conv2D(filters=16, kernel_size=4, activation='relu', padding='same', name='convolution'))
mnist_conv_drop_model.add(tf.keras.layers.MaxPool2D(pool_size=2, name='pooling'))
mnist_conv_drop_model.add(tf.keras.layers.Dropout(0.5, name='dropout'))
mnist_conv_drop_model.add(tf.keras.layers.Flatten(name='flatten'))
mnist_conv_drop_model.add(tf.keras.layers.Dense(mnist_info.features['label'].num_classes, activation='softmax', name='output'))
mnist_conv_drop_model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
mnist_conv_drop_model.summary()

Let's train the model:

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=10, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('mnist_conv_drop_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

mnist_conv_drop_model_train = mnist_conv_drop_model.fit(mnist_train_x, mnist_train_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=256)

By looking at the evolution, we can see that the performance of the model on the training data is now lower.

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(mnist_conv_drop_model_train.history['loss'], '-r', label='Train')
loss_ax.plot(mnist_conv_drop_model_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(mnist_conv_drop_model_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(mnist_conv_drop_model_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

And we can assess the performance on the test set:

In [None]:
mnist_conv_drop_model.load_weights('mnist_conv_drop_best.h5')
loss, acc = mnist_conv_drop_model.evaluate(mnist_test_x, mnist_test_y)
print('Accuracy: {}'.format(acc))

## Recurrent Neural Networks

Recurrent neural networks are particularly appropriate to deal with sequential inputs, which include dependencies between the elements of the sequence. Among others, an important field of application of this kind of network is text processing. 

### Dataset

In this part of the tutorial we will use the [IMDb Reviews](http://ai.stanford.edu/~amaas/data/sentiment/) dataset, which can be used to introduce not only the use of recurrent neural networks, but also the problems that arise while processing text.

* 100k textual movie reviews
* 2 classes: Positive or Negative (Sentiment Analysis)

In [None]:
imdb_data, imdb_info = tfds.load('imdb_reviews', with_info=True)

In [None]:
imdb_info

The dataset is partitioned into train and test sets, as well as a set of unlabeled data. Let's obtain the text and labels for the training and test data:

In [None]:
imdb_train_text = np.asarray([str(instance['text']) for instance in tfds.as_numpy(imdb_data['train'])])
imdb_train_y = np.asarray([instance['label'] for instance in tfds.as_numpy(imdb_data['train'])])

imdb_test_text = np.asarray([str(instance['text']) for instance in tfds.as_numpy(imdb_data['test'])])
imdb_test_y = np.asarray([instance['label'] for instance in tfds.as_numpy(imdb_data['test'])])

We can take a look at some examples:

In [None]:
for t, c in zip(imdb_train_text[:5], imdb_train_y[:5]):
    print('Text:\n{}\nClass: {}'.format(t,c))

### Model

We cannot feed the textual reviews directly to a network. First, we must tokenize them, that is, transform them into a sequence of words. Then, we transform each word into a numerical index that represents it. [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) has its own [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) implementation. When creating it, we can set the *num_words* parameter to only consider the most common words in the training set:

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)
tokenizer.fit_on_texts(imdb_train_text)

Now that we have the tokenizer, we can use it to transform the textual reviews into sequences of token indexes:

In [None]:
imdb_train_seqs = tokenizer.texts_to_sequences(imdb_train_text)
imdb_test_seqs = tokenizer.texts_to_sequences(imdb_test_text)

Not all the reviews have the same number of words:

In [None]:
print('Min length: {}'.format(min([len(seq) for seq in imdb_train_seqs])))
print('Max length: {}'.format(max([len(seq) for seq in imdb_train_seqs])))

However, all the examples in a training batch must have the same length. A possible approach (not the best) is to truncate the sequences to a maximum number of words and add padding to those which are shorter than that limit. We can do that using the [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) method:

In [None]:
imdb_train_x = tf.keras.preprocessing.sequence.pad_sequences(imdb_train_seqs, maxlen=50)
imdb_test_x = tf.keras.preprocessing.sequence.pad_sequences(imdb_test_seqs, maxlen=50)

Now let's create the model. The first layer in the network will be an [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer. This layer transforms the sparse index-based representation of words into a dense representation. Then, we add the recurrent layer. Since there are long-distance dependencies between the words, instead of using a basic recurrent layer, which in [Keras](https://www.tensorflow.org/api_docs/python/tf/keras) is called [SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN), we will use a [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), which is a more complex recurrent layer, that is able to capture those dependencies. Finally, since the dataset poses a binary classification problem, we will use a logistic regression for the output layer:

In [None]:
imdb_rnn_model = tf.keras.Sequential(name='imdb_rnn')
imdb_rnn_model.add(tf.keras.layers.Embedding(tokenizer.num_words, 128, name='embedding'))
imdb_rnn_model.add(tf.keras.layers.LSTM(128, name='recurrent'))
imdb_rnn_model.add(tf.keras.layers.Dense(1, activation='sigmoid', name='output'))
imdb_rnn_model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
imdb_rnn_model.summary()

Now we can train the model. We use a lower patience for early stopping than for the previous models because each training epoch of a recurrent network takes much longer.

In [None]:
earlystop = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5, verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint('imdb_rnn_best.h5', monitor='val_accuracy', verbose=1, save_best_only=True)

imdb_rnn_model_train = imdb_rnn_model.fit(imdb_train_x, imdb_train_y, validation_split=0.2, callbacks=[earlystop,checkpoint], epochs=10000, batch_size=64)

Let's visualize the evolution:

In [None]:
fig, (loss_ax, acc_ax) = plt.subplots(1, 2, figsize=(20,7))

loss_ax.set_title('Loss')
loss_ax.plot(imdb_rnn_model_train.history['loss'], '-r', label='Train')
loss_ax.plot(imdb_rnn_model_train.history['val_loss'], '-g', label='Validation')

acc_ax.set_title('Accuracy')
acc_ax.plot(imdb_rnn_model_train.history['accuracy'], '-r', label='Train')
acc_ax.plot(imdb_rnn_model_train.history['val_accuracy'], '-g', label='Validation')

plt.legend(loc=4)
plt.show()

Finally, let's assess the performance of the model on the test data:

In [None]:
imdb_rnn_model.load_weights('imdb_rnn_best.h5')
loss, acc = imdb_rnn_model.evaluate(imdb_test_x, imdb_test_y)
print('Accuracy: {}'.format(acc))