In [None]:
import time

from tensorflow import keras 
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
%matplotlib inline

# In a first step, we will get a better understanding of the data at hand

The samples you are usually using are n-dimensional vectors. If you have m samples, you end up with a matrix of data of size $m \times n$. Each batch $k$ of data is a subsample $s_k \times n$ of data, such that $\sum s_k = m$.

Remember that in the case of $m$ images, each sample is of size $n_1 \times n_2$ ($n_1$ is the number of pixel in the vertical direction, $n_2$ is the number of pixels in the horizontal direction). Therefore, you end up with a tensor (i.e. a matrix with more than 2 dimensions) of size $m \times n_1 \times n_2$. Each batch of data is a subsample $s_k \times n_1 \times n_2$ such that $\sum s_k = m$.

In the case of temporal data, we are in a slightly different setting. You have $m$ samples. Each is a temporal sequence (or _time steps_), where each element is of size $n$. If you have $s_k$ elements for the $s$-th sequence, your total sample is of size $s_k \times m$.

An example : Consider that while walking in a park, you observe and measure a tree. You measure its height, its width and depth. Each time you observe the tree (i.e. each **element**) is of size 3: the height, the width, the depth. Let's say now that you walk regularly into that park, and that you observe the tree 10 times. Therefore, your sequence is a matrix of size $10 \times 3$.

Let's say that you observe another tree 5 times, which results in a matrix of size $5 \times 3$ (5 measurements of 3 features: weight, width and depth). 

Now, let's up say that, for the tree $s$, you observe it $s_k$ times ; $s_k$ being different for each $s$. And you have a total of $m$ tree.

The problem is that you _CANNOT_ stack these matrices together, as you can stack images one to each other. The reason is that $s_k$, i.e. the number of measurements per tree is different. 

And if you cannot stack these matrices, you cannot have a batch of data and thus cannot train your algorithm.

### Question : Before reading the following, think a bit about what a good solution could be. Don't go to fastly to the answer 

<hr><hr><hr><hr><hr><hr>

The way to solve this issue is to have a tensor $m \times \max(s_k) \times n$. What does this mean?
It you have a sequence (A1, A2, A3), another sequence (B1, B2, B3, B4) and another (C1, C2), you choose the longest one, being of size 4. The resulting matrix is of size $\max(s_k) = 4$ and the empty values are filled with zeros, as in this example : 


![Padding](padd.png)

Now, you have to imagine that each element, A1, B2, C2, B4 and the other are not of size one such that there is an additional dimension which is the dimension of each of these element.

In the case of our initial forest of tree, imagine that we have observed 150 trees. If the tree that has been observed the most has been observed 14 times, then the final matrix is of size $(150, 14, 3)$, 3 being the number of features : height, width, depth.

In [None]:
### Now, lets consider that we measure the size and weight of a shell

### The first shell are observed 3 times. The first time, it measures 1 and weights 2. Then it measures 2 and weights 4. etc.
X_1 = [[1, 2], [2, 4], [4, 9]]

### The second and third shell have been observed 5 and 2 times
X_2 = [[2, 2], [2, 4], [4, 9], [4, 7], [5, 10]]
X_3 = [[7, 10], [8, 13]]

### We stack these observation into a array X
X = [X_1, X_2, X_3]

### Question: Write a function `init_X_pad` that constructs a array of zeros given `X`. This array should be of size $m \times \max(s_k) \times n$

In [None]:
def init_X_pad(X):
    ## Todo

### Question: Write a function that fills in the `X_pad` array with values of `X` at the right place, as explained in the paragraph above.

Remark: the output have to be an numpy array

_Hint_: The assert function in two cells should be `true`

In [None]:
def pad(X):
    ## TODO


In [None]:
X_to_check = np.array([
[[ 1.,  2.], [ 2.,  4.], [ 4.,  9.], [ 0.,  0.], [ 0.,  0.]],
[[ 2.,  2.], [ 2.,  4.], [ 4.,  9.], [ 4.,  7.], [ 5., 10.]],
[[ 7., 10.], [ 8., 13.], [ 0.,  0.], [ 0.,  0.], [ 0.,  0.]]
])

assert(pad(X).tolist() == X_to_check.tolist())

Congratulation. You are now able to take input data of any input size and transform it into a Tensor useful for Keras.

Actually, the function `pad_sequences` [(see some tutorial here)](https://www.tensorflow.org/guide/keras/masking_and_padding) does that for you. The `padding` argument tells you where the `0` are appended : beginning or end. 

### Question: Check that the value of this cell is the same as the function you coded:

In [None]:
pad_sequences(X, padding='post', dtype='float32') 

assert(pad(X).tolist() == pad_sequences(X, padding='post', dtype='float32').tolist())

In the following, please use only `pad_sequences` from the keras Library

<hr><hr>

There is another key point in the Recurrent Neural Network when you pad you input data : the **masking**. As you added arbitrary `0` to some of your input, you should tell the model that these `0` were added arbitrarly and should not be taken into account in the prediction and fitting. Otherwise, the algorithm will take the `0` as a real input: in the example of the trees, you definitely don't want to tell the algorithm that there has been a tree with `0` width, height of depth.

To this end, we will inform the model that there are some values of the sequence that are not to be taken into account. This is done with a Keras layer that acts as a mask, as follow:

In [None]:
### Padding
padded_inputs = pad_sequences(X, padding='post')
print(padded_inputs)
print(padded_inputs.shape)

### 
embedding = layers.Embedding(input_dim=5000, output_dim=1, mask_zero=True)
masked_output = embedding(padded_inputs)
print(masked_output._keras_mask, masked_output.shape)

# Let's dig into real data now

Now, we consider stock market prices. In this setting, the idea is to forecast the stock price in $n$ days. Your input data are stock market prices during $m$ days, and you predict the stock market price $n$ days after the last seen day.

We will refer to the "seen days" as the days for which we know the stock market price. We will refer to the "predicted day" as the day at which we predict the stock market price.

Let's load the data and see what it looks like.

In [None]:
df = pd.read_csv('data/aa.us.txt')
df = df.sort_values('Date')

df['Middle'] = (df['High'] + df['Low'])/2.

df['Middle'].plot()
plt.show()

In [None]:
df.tail()

### What you will do

Out of this very long time-series, we will subsample it into shorter samples of "seen days" and "predicted day". Considering that this time-series has more than 12.000 consecutive days, it is feasible to have a lot of "seen days" of let's say 10 to 100 days, and predicted days 5 to 25 days in the future. (These are examples)

# Dataset construction


To ease the prediction, we will omit the fact that nothing happens on Saturdays and Sundays. We will do as Monday is the day after Sunday. To this end, we will add a new column which corresponds to the consecutive days - exclusing Saturdays and Sundays.

In [None]:
df['Consecutive Days'] = df.index

### Question: Write a function that, given a dataframe `df`, an integer `length` and an integer `temporal_horizon` outputs a sample `X_sample, y_sample` of the data:

The dataframe corresponds to the previous dataset
The `length` corresponds to `m`, the number of days during which the stock price is seen
The `temporal_horizon` corresponds to `n`, the forecast prediction horizon, i.e. the number of days till the prediction of the stock price.

Finally, `X_sample` corresponds to the values of the stock price during the "seen" days and `y_sample`  is the value of the stock price $n$ days after the last seen day. 

To add some normalization do the data, divide all the values of `X_sample` by the first value of `X_sample`. Similarly, divide `y_sample` by the first value of `X_sample`.


**Remark 1**: The selection of the seen days should be random within the whole possible range of values! You can use `np.random.randint` to select the first seen days within the dataframe.

**Remark 2**: Be careful about the randomness: you cannot select the last days of the dataframe as the "predicted day" can be out of the dataframe !

**Remark 3**: `X_sample` must be an array of shape `(m,)` and `y_sample` a scalar

**Remark 4**: `temporal_horizon=1` means that you predict "tomorrow"

In [None]:

def get_sample(df, length, temporal_horizon):

    # TODO
    
    return X_sample, y_sample

### Question : Verify that the following function works and outputs : 
### `[[1.]], 18.897`

This corresponds to asking `length=1` i.e. one "seen day" and the temporal horizon is equal to the size of the dataframe - 1. So you basically forecast the last value from the first one. 

This is a way to check the good behavior of your function

In [None]:
get_sample(df, 1, df.shape[0] - 1)

### Question: Check that the output of the following function corresponds to :
- `X`, aka the seen days, are all the values but the last
- `y`, aka the predict day, is the last day

In [None]:
get_sample(df, df.shape[0]-1, 1)

### Question: You will create the dataset `X, y` 

You should write a function that creates the dataset of seen days `X` and days to predict `y`. Therefore, the X is a list of sequences, each being the consecutive values of the stock market price, and `y` is the list of the stock price in the future, your target.

`X` and `y` are of the same length. However, each element in `X` corresponds to one sequence of seen days and there is no reason that all these sequences have the same length. To this end, to construct the dataset, the argument `length of sequences` is a list of integers, each corresponding to the length of the seen days.

For instance, for `get_X_y(df, 3, [3, 5, 2, 9])`, `X` and `y` should be of length 4 (as the length of `length_of_integers`), where the first element of `X` is of length 3, the second of length 5, the third of length 2 and the fourth of length 9.

**Remark 1**: `y` has to be an array, `X` is a list.

**Remark 2**: Don't forget to use your previous function.

In [None]:
def get_X_y(df, n_days, length_of_sequences):
    # TODO 
    return X, np.array(y)


### Question: Draw a dataset `X, y` such that you predict the stock market price within 10 days.

The `length_of_sequences` should be a list of 1000 values, each being included between 8 and 12. This means that you will have 1000 time-series, each which 8 to 12 see days.

Then, use `pad_sequences` to pad you data. Store the padded `X` in `X_pad` and check its shape.

### Question: Split the dataset into train and test set (70/30 ratio). 

### Question: Write a function that gathers all the previous steps

It should generate the X_train, X_test, y_train and y_test directly from the following arguments : 

In [None]:
def generate_data(df, n_days, length_of_sequences):
    
    ### TODO 
    
    return X_train, X_test, y_train, y_test


### Question: Use this function to generate data with a temporal forecast horizon of 2 days, and 1000 time-series of 3 to 5 seen days .

### Question: Initialize a model that has :

- A initial Embedding layer that allows to mask the padded values of the input, with an output size of 3 
- A LSTM layer - add your arguments [(see documentation)](https://keras.io/layers/recurrent/#lstm) 
- A fully connected layer - add your arguments

- The Mean square error as a loss
- You can choose the optimizer that you consider the best 

In a first time, select those parameters to be able to run the model first - you will be able to optimize these parameters later on.

In [None]:
def init_model():
    
    ### TODO 
    
    return model

### Question: Fit the model 

- First initialize the model
- Then initialize the early stopping criterion
- Then, fit on the training data (do not forget the validation split) (batch_size of 32)

**Remark** : Be sure you hit the Stopping Criteria 

### Question: Given the next plot function, look at the loss and the MAE (add the `MAE` to the model if not done yet)

In [None]:
def plot_loss(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Mean Square Error - Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='best')
    plt.show()
    
    plt.plot(history.history['mae'])
    plt.plot(history.history['val_mae'])
    plt.title('Model loss')
    plt.ylabel('Mean Absolute Error')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='best')
    plt.show()
    

### Question : Is this a good or bad final accuracy?

You should (almost) **always** have a **benchmark** value you should compare to. In this case, let's compare to the following benchmark: the constant prediction.

### Question: Write a function that computes the mean absolute error of the constant prediction, i.e. the fact that your prediction corresponds to the last seen value.



In [None]:
def benchmark_prediction(X, y):

    ### TODO 



### Question: Compare the model evaluated on the test set to the accuracy of the constant prediction. Conclusion?

# We will now look at different settings

### Question: Try to run the neural network but with higher number of seen days. You can go to large number ~ 100, even 200 or 300. Try these out.



### Question: Do the same, but now, increase the temporal horizon of prediction