# Deep Learning - Day 4 - Classify Electrocardiograms

### Exercise objectives:
- Discover a new type of application with temporal data
- Try different recurrent neural networks

<hr>
<hr>

We have seen that RNN are able to predict what happens after an observed sequence of data. Let's see here a different way of using RNN. Instead of predicting a value that occurs after the seen sequence, we will here classify the entire sequence itself, as if the whole sequence corresponds to a given category. 

# Data

The data corresponds to electrocardiograms (ECG), which are basically heart beats. Each sequence is therefore a sequence of amplitudes. These ECG are often used to observe heart malfunctions! In this dataset, there are 87554 heart beats and each corresponds to a heart beat type, from 0 to 4:
- 0 : Normal beat
- 1 : Supraventricular
- 2 : Ventricular
- 3 : Fusion
- 4 : Beats that cannot be classified


❓ **Question** ❓ Download the data from [here](https://storage.googleapis.com/data-sciences-bootcamp/ECG_data.zip), unzip them and read them thanks to the 
`np.load(path/to/data, allow_pickle=True).tolist()`

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Plot one ECG for each category in the dataset to see what an ECG looks like

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ You have probably noticed that each sequence (each ECG) has a different length. To corroborate your observation, plot the distribution of the sequence lengths in the dataset.

In [None]:
### YOUR CODE HERE



You remember that we pass a batch of data to the neural network. Thus, the tensor will have the following shape (batch_size, number of sequences, number of observations per sequence, size of each observation).

- The batch_size will be choosen in the model fit
- There are 87554 sequences
- Each observation is of size 1

However, the number of observations per sequence vary from one sequence to another. For computational reasons, this cannot be feed into a RNN. For that reason, you need to "fill in the blanks" thanks to the `pad_sequences` so that each sequence is filled with fake values. The resulting sequences will all be of the same length.


❓ **Question** ❓ Use the `pad_sequences` function on X directly (without extra arguments here), store the result in `X_pad` and print the first sequence.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

### YOUR CODE HERE

❓ **Question** ❓ You probably see that the returned sequence is composed only of 0's. The reason is because, by default, `pad_sequences` returns integers. If a float is between 0.0 and 0.99999, it is converted to 0. To change this default behavior, turn the `dtype` argument of `pad_sequences` to `float32`. Pad once again the sequences, store the new result in `X_pad` and print the first sequence.

In [None]:
# YOUR CODE HERE

The neural network, thanks to a `Masking` layer, will remove the 0 that you padded for computational reasons. 

**However**, if you look closely at the padded version of the first sequence, you have the padded zeros at the beginning of the sequence. But, also, there is a 0 value **_IN_** the heart-beat values. 
How could the neural network know which one to keep and which one to remove?

❓ **Question** ❓ Add the `value` keyword in the `pad_sequences` function to pad with values that **ARE NOT** in the initial dataset. Negative values for instance. Store it in `X_pad` and print the first sequence.

❗ **Remark** ❗ This is a good habit to pad the values **at the end** of the sequence (instead of the beginning as it is done by default). You can do that thanks to the `padding` keyword set to `post` (instead of `pre` by default).

[See full documentation here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Plot the shape of `X_pad`

In [None]:
### YOUR CODE HERE

Remember that we said that the input data have to be of shape (number of sequences, number of observations per sequence, observation size) [Apart from the batch size dimention which will be automatically added by the Neural Network] ? Here, we only have the two first dimensions. This is because the last dimension is of size 1. 

❓ **Question** ❓ To remedy this issue, expand the last dimension thanks to the `np.expand_dims` function. 

❗ **Remark** ❗ The assert should not return any error ;)

In [None]:
### YOUD CODE HERE

assert(X_pad.shape == (87554, 187, 1))

❓ **Question** ❓ The labels `y` have to be one-hot encoded categories. For that reason, transform them to categories thanks to the appropriate Keras function and store the result in `y_cat`

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Split your data between a train and test set (80/20 ratio).

In [None]:
### YOUR CODE HERE

# Model

We will now write the Recurrent Neural Network

❓ **Question** ❓ Write model that has the following layers:
- a Masking layer whose `mask_value` corresponds to the value you decided to pad your data with (it is probably a negative value as suggested) - this layer will simply tell the network not to take into account the computation artifact
- a `SimpleRNN` layer with 10 units and the `tanh` as the activation function
- a dense layer with 20 units
- a last layer

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Compile your model and train it - a very small patience equal to 2 should be sufficient. This is because you have a lot of sequences and thus, many optimizations per epochs. 

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Evaluate your model on the test data.

In [None]:
### YOUR CODE HERE

### You should be have quite promising results with a test accuracy a little higher than 80%, which is a good result on a 5-class problem.

❓ **Question** ❓ Let's try to improve this result. Repeat the last steps by using a `LSTM` instead of a `SimpleRNN`. If you feel like it, you can change the neural network parameters to improve the accuracy. Evaluate your accuracy on the test set.

In [None]:
### YOUR CODE HERE

Quite surprisingly, the LSTM is not much better than the SimpleRNN. What about a GRU?

❓ **Question** ❓ Build another model where you will use a GRU (instead of the LSTM or of the SimpleRNN), and the parameters are yours to choose. Report the test accuracy.

In [None]:
### YOUR CODE HERE

Once again, the final accuracy is likely to be similar to the one you got previously, which might be a bit strange. To investigate these results, we will compare the results to a baseline model.

❓ **Question** ❓ What is the accuracy of a baseline model which would predict, for `y_test`, the most probable category in y_train.

In [None]:
### YOUR CODE HERE

Basically, your RNNs are as good (bad?) as a model that predicts the most present category. The reason is probably because the RNNs really return only the most present category.

❓ **Question** ❓ Use the `predict` function on any of the previous model to see what are the different categories the model is predicting.

In [None]:
### YOUR CODE HERE

Your models are returning the category which is the most present in your train set. 

A possibility here is to either subsample the data to have balanced classes in the training set. Another possibility is to do some data augmentation on temporal data. However, none of these methods would work right away. In fact, predicting the category of ECG data is not an easy task - also, you have only **one** heart-beat, no repetitions of them! 

Classifying ECG is actually quite a complex task. So let's move on to another exercice. 

**The lesson here is not to be satisfied with results until you have compare them to a baseline method.**