## A first look
In this notebook we will do the "Hello world!" of deep learning (DL). Do not worry if you do not understand all the steps performed in the notebook. Setting up a network in the beginning can seem complicated, but we will explain all steps later in the session and during the rest of the course. This notebook is only meant to show how "things look and work" in practice, and show you the typical steps involved in solving a problem with deep learning.

## Classification

In this notebook we will perform **classification** on a dataset containing 70000 images of handwritten digits (the MNIST dataset). Our task is to **classify** each handwritten digit as a number from 0 to 9. Every image in the dataset has already been correctly **labelled** as one of these numbers and we use these labels to guide the training.

Let's look at some examples:

![The MNIST dataset](images/mnist.png)

## Getting the data
The dataset is built in to the **Keras** library, so let us start by loading the library and the dataset. Please select the cell below, and either run it by pressing `Shift + Enter`, or with the run button at the top of the notebook looking like this: ![The Run buttton](images/run-button.png)

In [None]:
library(keras)

mnist <- dataset_mnist()
train_images <- mnist$train$x
train_labels <- mnist$train$y

Let's inspect the **dimensions** of the training images using R's [`dim`](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/dim) command:

In [None]:
dim(train_images)

In deep learning, the first dimension is typically used for **instances** or **samples**. That is: we have 60.000 training images (samples) available.

The remaining dimensions are the dimensions for each particular instance. We can see from the output above that each instance has two dimensions corresponding to the size of the image (28 by 28 pixels).

Let's do the same for the training *labels*:

In [None]:
dim(train_labels)

As you can see, we have 60.000 labels as well, each of them corresponding to an image. This data is one-dimensional, since we only need a single number as a label. To get a feel for the labels, we use the [`str`](https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/str) command:

In [None]:
str(train_labels)

You can see that the labels start from 0 and, since we have 10 classes, end at 9.

We can show the first image as a matrix to get a sense of how the images are represented numerically:

In [None]:
options(repr.matrix.max.cols=28, repr.matrix.max.rows=28)  # This will tell R to show the full matrix
train_images[1,,]

You should see a `5` digit appear in the matrix, with the background (black) represented as 0, and white as 255. Let's check the label of this sample to confirm:

In [None]:
train_labels[1]

## Building a model

Having inspected the data we now define the model, or network. The network will take an image as its input and output a label. At this stage it is not important to understand how this happens or what the lines below do exactly.

The single important thing to notice is how we use the Keras library and that we define two **dense** layers.

In [None]:
model <- keras_model_sequential()
model %>%
  layer_dense(units = 512, activation = "sigmoid", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")

At a high level the first layer will learn to **represent** the data and the second layer will output labels, 0-9. This is what deep learning is all about, putting layers on top of layers and then produce some output.

Usually, the first layers will **compress** raw data to useful **features**, which bypasses a lot of work (*feature engineering*) required for conventional machine learning (ML) techniques.

### Loss, optimizer and metric

But how do we make sure that the output of the model makes sense and produces good results? Put differently, how do we give feedback to the model's predictions? This is done during **training**. During training the model is fed an image and it will output a label. By providing the actual network output and the expected output (in this case a label between 0 and 9) to a **loss function**, we can calculate a **loss value**, or simply **loss**.

We seek to minimize this loss by updating the **parameters** of the model. The **optimizer** takes care of updating the parameters based on the loss function. Usually, the model processes a number of images in a **batch** and then performs one update **step**. Little by little the model starts performing better. 

The loss function is generally not easy to relate to any evaluation criterion that we as humans may understand. To relate the network performance to a number understandable by us, we define a **metric**.
In the next cell we define an optimizer, a loss function and a metric which measures the performance of our model, in this case **accuracy**. For the time being, ignore the values given to the optimizer and loss.

In [None]:
model %>% compile(
    optimizer = "rmsprop",
    loss = "categorical_crossentropy",
    metrics = c("accuracy")
)

### Pre-process data

Before we can run our data through the model and start training it, we need to make small adjustments to the data. At this stage it is not important to understand what the lines below do

In [None]:
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255
train_labels <- to_categorical(train_labels)

Let's inspect the dimensions of the preprocessed training data again:

In [None]:
dim(train_images)

In [None]:
dim(train_labels)

## Exercise 1
Compare the dimensions of the data with the ones from the original data set. Can you guess how the data has changed, and why?

In [None]:
<FILL IN YOUR ANSWER>

### Train the network

Now we can start training the model. We do this by calling the Keras [`fit`](https://keras.rstudio.com/reference/fit.html) function. Since neural networks are incrementally updated we often train using the same **samples** many times. The **epoch** parameter controls how many times we run the training set through the network.

In [None]:
history <- model %>% fit(
    train_images, train_labels, 
    epochs = 10, 
    batch_size = 1024
)

We can plot the **loss** and the **accuracy** of the trained model after each epoch:

In [None]:
plot(history)

As you can see, the loss decreases steadily, and the accuracy of the model improves after each epoch.

### Evaluate the network

In the image above, you can see that the training accuracy is more than 99%. This is very high, of course, but not necessarily a good measure of the actual performance of the model when it encounters samples it has not seen before.

To properly evaluate the performance of our model, we need a **test dataset**. Let's gather the test dataset and adjust it like we did with the training dataset:

In [None]:
test_images <- mnist$test$x
test_labels <- mnist$test$y

test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255
test_labels <- to_categorical(test_labels)

In [None]:
dim(test_images)

As you can see, we have 10000 images to test our model's performance on. Let's do that with the Keras [`evaluate`](https://www.rdocumentation.org/packages/keras/versions/0.3.8/topics/evaluate) function.

In [None]:
metrics <- model %>% evaluate(test_images, test_labels)
metrics

As you can see, performance is still great but lower than on our training set.

The reason is that the model is **overfitting**, a concept that we will explore more closely in the next session.

## Bonus exercise (optional)
Try to improve the training accuracy by increasing the number of epochs little by little. What effects do you see on the training and test accuracy? Can you think of a reason why you see these effects?

We have provided you with the necessary code in the cell below, so you can run it to train and evaluate the model.

**NOTE: please be aware that each epoch takes around one second, so try not to use too many epochs.**

In [None]:
# Build the model

model <- keras_model_sequential()
model %>%
  layer_dense(units = 512, activation = "sigmoid", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")

# Compile it

model %>% compile(
    optimizer = "rmsprop",
    loss = "categorical_crossentropy",
    metrics = c("accuracy")
)

# Train it

history <- model %>% fit(
    train_images, train_labels, 
    epochs = 100, 
    batch_size = 1024
)

# Plot the training accuracy and loss

plot(history)

# Show the loss and accuracy on the test set

model %>% evaluate(test_images, test_labels)