# Exercise 03.2 - Keras Sequential Model 

Before you can start, you have to find a GPU on the system that is not heavily used by other users. Otherwise you cannot initialize your neural network.


**Hint:** the command is **nvidia-smi**, just in case it is displayed above in two lines because of a line break.

As a result you get a summary of the GPUs available in the system, their current memory usage (in MiB for megabytes), and their current utilization (in %). There should be six or eight GPUs listed and these are numbered 0 to n-1 (n being the number of GPUs). The GPU numbers (ids) are quite at the beginning of each GPU section and their numbers increase from top to bottom by 1.

Find a GPU where the memory usage is low. For this purpose look at the memory usage, which looks something like '365MiB / 16125MiB'. The first value is the already used up memory and the second value is the total memory of the GPU. Look for a GPU where there is a large difference between the first and the second value.

**Remember the GPU id and write it in the next line instead of the character X.**

In [1]:
# Change X to the GPU number you want to use,
# otherwise you will get a Python error
# e.g. USE_GPU = 4
USE_GPU = 0

Alternatively, you can use the terminal command in the Jupyter notebook by prefixing the command with an exclamation mark.

In [3]:
# Alternatively, you can use the magic 
!nvidia-smi

Wed Nov 10 11:55:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8    29W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Choose one GPU

**The following code is very important and must always be executed before using TensorFlow in the exercises, so that only one GPU is used and that it is set in a way that not all its memory is used at once. Otherwise, the other students will not be able to work with this GPU.**

The following program code imports the TensorFlow library for Deep Learning and outputs the version of the library.

Then, TensorFlow is configured to only see the one GPU whose number you wrote in the above cell (USE_GPU = X) instead of the X.

Finally, the GPU is set so that it does not immediately reserve all memory, but only uses more memory when needed. 

(The comments within the code cell explains a bit of what is happening if you are interested to better understand it. See also the documentation of TensorFlow for an explanation of the used methods.)

In [2]:
# Import TensorFlow 
import tensorflow as tf

# Print the installed TensorFlow version
print(f'TensorFlow version: {tf.__version__}\n')

# Get all GPU devices on this server
gpu_devices = tf.config.list_physical_devices('GPU')

# Print the name and the type of all GPU devices
print('Available GPU Devices:')
for gpu in gpu_devices:
    print(' ', gpu.name, gpu.device_type)
    
# Set only the GPU specified as USE_GPU to be visible
tf.config.set_visible_devices(gpu_devices[USE_GPU], 'GPU')

# Get all visible GPU  devices on this server
visible_devices = tf.config.get_visible_devices('GPU')

# Print the name and the type of all visible GPU devices
print('\nVisible GPU Devices:')
for gpu in visible_devices:
    print(' ', gpu.name, gpu.device_type)
    
# Set the visible device(s) to not allocate all available memory at once,
# but rather let the memory grow whenever needed
for gpu in visible_devices:
    tf.config.experimental.set_memory_growth(gpu, True)

TensorFlow version: 2.7.0

Available GPU Devices:
  /physical_device:GPU:0 GPU

Visible GPU Devices:
  /physical_device:GPU:0 GPU


# Introduction to the Keras Sequential Model

### Part II: Training a keras model

In the previous section of this tutorial you have learned to build neural networks using the Sequential API. In this notebook you will learn how to configurate the training process and to train the model.

## Learning objectives

- configurate the learning process via `compile` method
- understand the various ways for passing in the required arguments to the `compile` method
- learn how to launch the training process  
- learn to evaluate the trained model
- learn to make predictions on test/new data



For this purpose we will build a lassification network for classifing  [MNIST](http://yann.lecun.com/exdb/mnist/) dataset of handwritten digits. It includes a training set of 60,000 examples, and a test set of 10,000 examples.
The data is available in [keras.datasets API](https://www.tensorflow.org/api_docs/python/tf/keras/datasets)  which provides some utility functions to fetch and load common datasets.  We can be load the data using the `load_data()` function.


In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  Flatten, Dense

In [5]:
mnist = tf.keras.datasets.mnist
(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [6]:
print(X_train_full.shape)

(60000, 28, 28)


In [7]:
print(X_train_full.dtype)

uint8


- As you can observe, dataset is already split into a training set and a test set, but there is no
validation set, so we will put aside 5000 example for validation purposes.
- Moreover we will scale the input features, which improves the optimization process. For simplicity, we just
scale the pixel intensities down to the 0-1 range by dividing them by 255.0 (this also
converts them to floats):

In [8]:
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

As you already learend in the first section of the tutorial, firstly, we import the `sequential` class from `tensorflow.keras.models`. We're also importing the `Flatten` and `Dense` layer from `tensorflow.keras.layers`.

In [9]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  Flatten, Dense

In [10]:
model = Sequential()                           # creates a Sequential model object
model.add(Flatten(input_shape=[28, 28]))       # to convert each input image into a 1D array, specify the input_shape because it is the first layer
model.add(Dense(300, activation="relu", name = 'first_hidden'))       # Dense hidden layer with 300 neurons, ReLU non-linearity
model.add(Dense(10, activation ="softmax", name = 'output_layer'))     #  output layer with 10 neurons (one per class)

In [11]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 784)               0         
                                                                 
 first_hidden (Dense)        (None, 300)               235500    
                                                                 
 output_layer (Dense)        (None, 10)                3010      
                                                                 
Total params: 238,510
Trainable params: 238,510
Non-trainable params: 0
_________________________________________________________________


### Configurate the learning process 

We build a keras model and  configurate the learning process via the `compile()` function which expects three important arguments:

   - An optimizer-  This could be the string identifier of an existing optimizer (e.g. as “rmsprop” or “adagrad”) or a call to an optimizer function (e.g. tensorflow.keras.optimizers.SGD()).

   - A loss function. This is the objective that the model will try to minimize. It can be the string identifier of an existing loss function (e.g. “binary_crossentropy” or “mse”) or a call to a loss function (e.g. tf.keras.losses.binary_crossentropy()).

   - A list of metrics. For any classification problem you will want to set this to metrics = ['accuracy']. A metric could be the string identifier of an existing metric or a call to metric function (e.g. tf.keras.metrics.categorical_accuracy).
 


In the image below you can see the required and optional arguments to be passed to the copile method:
<center>
<img src ='images/keras_compile.png', style = "zoom: 75%">
   
</center>


The possible values for the keywords arguments can be found at:
[Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers), 
[Losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses), 
[Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics)

In [12]:
# Define the model optimizer, loss function and metrics
model.compile(optimizer ='sgd', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

#### Training

We train a built model by calling `model.fit()`and pass it the `input features (X_train)` and the `target classes (y_train)`, as
well as the `number of epochs to train` (or else it would default to just 1, which would
definitely not be enough to converge to a good solution). </br>
During this proccess, by applying the chosen optimization algorithm when we compiled the model, the model weights are iterratively updated till the value of the cost function is minimized. </br>
Another relevant argument of the `fit()` method is the `batch_size`. It specifies the number of training examples used during a training epoch (default value is 32 and works the best in most cases).

Optionally, we can also pass a validation set: Keras will measure the loss and the extra metrics on this set at the
end of each epoch, which is very useful to see how well the model really performs: if
the performance on the training set is much better than on the validation set, your model is overfitting (you will learn about this concept in a couple of weeks).

As you can see in the print screen shown below, the fit method supports many other arguments. Please check the documentation for further details.

In [13]:
history = model.fit(x = X_train, y= y_train, epochs = 5, batch_size = 32, validation_data=(X_valid, y_valid) )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The returned "history" object has the attribute 'history' which is a dictionary that holds a record of the loss values and metric values during training.

In [14]:
history.history.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

In [15]:
# getting the loss for all training epochs
history.history['loss']

[0.6308482885360718,
 0.3378755748271942,
 0.2884528636932373,
 0.2571693956851959,
 0.23355156183242798]

### Evaluating the model
Once we are satisfied with your thee validation accuracy, we can evaluate it on the test set to estimate the generalization error
before  deploying the model to production. 
We can easily do this by calling the `evaluate()` method on the trained model object (it also supports several other arguments, such as `batch_size` or `sample_weight`, please check the documentation for more details).


In [16]:
model.evaluate(X_test, y_test)



[27.82224464416504, 0.9370999932289124]

### Using the model to make predictions


We can call the `predict()` method on the trained object model to make predictions on new instances.
Since we don’t have actual new instances, we will just use the first 3 instances of the test set:

In [17]:
X_new = X_test[:3]
classifications = model.predict(X_new)
print(classifications[1])

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]


### Important observation
As we have saw in the previous section of the tutorial (for activation functions), Keras provides the possibility for setting up the models in the form of **readable strings** that can be passed in to many of the options in the Keras API (also to the options in the compile method).
It is important to know that each of these strings is a **reference to another object or function** and we can always use that object or function directly.

### Keras 'compile()' - Option II

A)  arguments values set by calling **keras functions with default parameters**

The code below is equivalent to the previous one:

###### Observation: in order to compile the model with using the alternative code for the compile function please run again the code that defines the model!

In [18]:
model = Sequential()                           # creates a Sequential model object
model.add(Flatten(input_shape=[28, 28]))       # to convert each input image into a 1D array, specify the input_shape because it is the first layer
model.add(Dense(300, activation="relu", name = 'first_hidden'))       # Dense hidden layer with 300 neurons, ReLU non-linearity
model.add(Dense(10, activation ="softmax", name = 'output_layer'))     #  output layer with 10 neurons (one per class)

In [19]:
# equivalently....
model.compile(optimizer = tf.keras.optimizers.SGD(),
              loss = tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics = [tf.keras.metrics.SparseCategoricalAccuracy()])

In [20]:
# running this cell you will train the model for 5 epochs. More explanations are prvided in the section training
history = model.fit(x = X_train, y= y_train, epochs = 5, batch_size = 32, validation_data=(X_valid, y_valid) )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Customizing the training process
By passing directly the respective **objects** we gain **greater flexibility** as many of these objects
themselves have **options** that we might want to have **control** over, as you can see in the the following cell:

B)  arguments values set by calling **keras functions and passing in custom values for their parameters**
###### Observation: in order to compile the model with using the alternative code for the compile function please run again the code that defines the model!

In [21]:
model = Sequential()                           # creates a Sequential model object
model.add(Flatten(input_shape=[28, 28]))       # to convert each input image into a 1D array, specify the input_shape because it is the first layer
model.add(Dense(300, activation="relu", name = 'first_hidden'))       # Dense hidden layer with 300 neurons, ReLU non-linearity
model.add(Dense(10, name = 'output_layer'))  

In [22]:
model.compile(optimizer =tf.keras.optimizers.SGD(learning_rate = 0.001, momentum = 0.9, nesterov =True),
              loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits= True),
              metrics= [tf.keras.metrics.SparseCategoricalAccuracy()]
             )

The cell above shows some of the options we ca use to **control the training process:**</br>

##### The stochastic gradient descent 

An important parameter of the **stochastic gradient descent optimization algorithm** is the **learning_rate**. By default, the learning_rate
is set to 0.01 but here we creating in the SGD optimizer objects with learning_rate 0 .001. </br>
We are also appling momentum with a value of 0.9. By default, the momentum value is 0. </br>
And an extra option we can choose to set is whether or not to use nesterov momentum which
here is set True. </br>

##### The Sparse Categorical Crossentropy function.
Here we are setting the option `from_logits=True`. If you are look carefully, you notice that the activation function in the last layer
of the network was changed from softmax (in the previous model) to linear. In other words, now, there is no
activation function and we could as well have left this argument out as the linear activation is the default. And so the network is now outputting the
`logits` which is any real value before it is passed through the activation function. The *from_logits=True option*
tells the model that it should take the output of the network. 
Another consequence is, the loss function itself must handle the squeezing of the output through the softmax function. Mathematically, there's no difference between this and what we had before but this way turns out to be a more numerically stable approach.


In [23]:
# running this cell you will train the compiled model for 5 epochs. More explanations are prvided in the section training
history = model.fit(x = X_train, y= y_train, epochs = 5, batch_size = 32, validation_data=(X_valid, y_valid) )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Practice

### Exercise 1: 

Consider the final (output) layers. Why are there 10 of them? What would happen if you had a different amount than 10? For example, try training the network with 5

You get an error as soon as it finds an unexpected value. Another rule of thumb -- the number of neurons in the last layer should match the number of classes you are classifying for. In this case it's the digits 0-9, so there are 10 of them, hence you should have 10 neurons in your final layer.



- The reason why there are 10 neurons in the last(output) layer is that there are totally 10 different classes to be classified (from 0 to 9). 





### Exercise 2: 

Consider the effects of additional layers in the network or of the changing number of neurons in the hidden layers. What will happen if you add another layer between the one with 300 and the final layer with 10. 

Ans: There isn't a significant impact -- because this is relatively simple data. For far more complex data (like natural images), extra layers are often necessary. 

- There is no obvious effect, since MNIST task is quite easy for Neural Network.

### Exercise 3: 

Consider the impact of training for more or less epochs. Why do you think that would be the case? 

- Too many epochs may cause your model to over-fit the training data. It means that the model does not learn the data, it memorizes the data. 
- But if we train the network with relatively too little epochs, it may lead to underfitting, which means that the model can be further improved.

### Exercise 4: 

Train the model (two different versions) with the default learning rate as introduced at the first version of the 'compile' method and with the learning rate 1e-3 . What difference can you notice in the training process?

- The default learning rate of SGD is 1e-2. The latter version has smaller learning rate. However, it has fewer loss but more acuracy and it converges even faster. That's because with relative smaller learning rate can possibably get more closer to global minimum. 