# Deep Learning week - Day 4 - Predict Air Pollution

### Exercise objectives
- Prepare the data
- Further dig into Recurrent Neural Networks
- Stack multiple layers of RNNs

<hr>
<hr>

In this exercise, you will predict the pollution (measured as a number of particles) on the next day given a sequence of weather features, such as the temperature, the pression, etc.

In real-life applications, the data are not as well-prepared as in the previous exercises. For this reason, the first steps of the notebook correspond to the data preparation.

Then, given your new RNN ninja skills and the fact that the exercise is similar to previous challenges, less help is given as how to write a RNN. This can happen in real-life problems where you will always be able to get back to Le Wagon exercise to copy-paste what you have done to start working.


<hr><hr>

# Data

The data here corresponds to hourly measurements of the air pollution (feature: `pm2.5`, which is the concentration of 2.5 millimeter particles) that you will try to predict. Among the other related features, you have:
- TEMP: Temperature
- DEWP: Dew Point
- PRES: Pressure
- Ir: Cumulated hours of rain
- Iws: Cumulated wind speed
- Is: Cumulated hours of snow

❓ **Question** ❓ Load the data `data.txt` - use the first column as the index of a panda Dataframe.
Let's consider only the features presented above (pm2.5, TEMP, DEWP, PRES, Ir, Iws and Is)

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Plot the temporal progression of the different variables

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Let's normalize the variables `pm2.5` and `PRES` as their value can get very high. Just divide their values by a factor 1000.

In [None]:
### YOUR CODE HERE

In the previous exercises, we had multiple independent data sequences. Here, you notice that there is only one. And this is quite often the case. So how to deal with such data? In fact, this long sequence can be separated in a lot of short sequences that we will consider as independant. 

❓ **Question** ❓ Write a function that, given the initial dataframe, return a shorter dataframe sequence of length `length`. This shorter sequence should be selected at random.

In [None]:
def subsample_sequence(df, length):
    ### YOUR CODE HERE
    return df_sample

df_subsample = subsample_sequence(df, length=10)

❓ **Question** ❓ Write a function that given a full dataframe, first subsample this dataset into a shorter sequence, and then splits the shorter dataframe into a training sequence and a value to predict.

Basically, if your sub-sampled dataframe is of size N, you will take the features during the N-1 first days as your variables `X`, and the value of the pollution at day N as your variable `y`.

❗ **Remark**❗ There are missing values in the dataframe. If the value to predict `y` is missing, the function should rerun. If there are missing values in the variables `X`, then it should be replaced by the mean values over the other selected hours. If all the other hours are missing, then they should all be replaced by the mean value of the dataframe.

❗ **Remark**❗ The outputs should be arrays or list, not a dataframe.

In [None]:
def split_subsample_sequence(df, length):
    df_subsample = subsample_sequence(df, length)
    
    ### YOUR CODE HERE
    
    return X_subsample, y_subsample

df_subsample = split_subsample_sequence(df, 30)

❓ **Question** ❓ Thanks to the previous function, write another function that generates an entire dataset $(X, y)$ of multiple subsamples, given an initial dataframe `df`, a number of desired sequences, and a `length` for each sequence.

In [None]:
def get_X_y(df, number_of_sequences, length):
    ### YOUR CODE HERE
    return X, y

❓ **Question** ❓ Generate a dataset $(X, y)$ with consists of 100 sequences, each of 20 observations - the value of the pollution at the 21-st day being the value to predict.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Check the shape of your inputs. `X` should be of size (100, 20, 7) ( => (sequences, length, number of features) )

In [None]:
### YOUR CODE HERE

❗ **IMPORTANT REMARK: POTENTIAL DATA LEAKAGE**❗ If you split this dataset (X, y) into a training and a test set, it is much likely that some data in the train set are in the test set. Especially, you will predict data in test that are input data in train.

To avoid this situation, you should _first_ separate your initial dataframe `df` into a training dataframe  and test dataframe.

❓ **Question** ❓ Separate `df` into `df_train` and `df_test` such that the first 80% of the dataframe are in the training. And the last 20% in the test set.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Now, you can generate (X_train, y_train) from df_train and (X_test, y_test) from df_test.
The training test should correspond to 1000 sequences, each of size 50 (+ the time to predict). The test set should correspond to 200 sequences

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Initialize a model the way you want and compile it within the `init_model` function. _TRY_ to do it, before looking at the previous exercise.
Start here with a simple `LSTM`.

In [None]:
def init_model():
    ### YOUR CODE HERE

❓ **Question** ❓ Fit your model and evaluate it on the test data

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Compare your prediction to a benchmark prediction

In [None]:
# To complete

# Stack RNN layers

❓ **Question** ❓ Now that you know how to write a recurrent architecture, let's see how to stack one.
If you want to stack multiple RNN, LSTM, GRU layers, it is very easy. Do it as if they were Dense (or any other) layers.

But don't forget: All RNN (**EXCEPT** the last one) should have the `return_sequences` set to True so that the entire sequence of predictions of a given layer is given to the next layer. Otherwise, you will only give the last prediction to the next layer.

In [None]:
def init_model_2():
    ### YOUR CODE HERE 

❓ **Question** ❓ Evaluate your new model to the previous prediction and to the baseline model

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Now, let's see how the performance changes depending on the number of seen days we used to sample our initial splits (50 days in the previous example).

For different values of temporal sequence lengths, resplit your data, run your model and evaluate its performance (do not forget to reinitialize your model between each run)

In [None]:
### YOUR CODE HERE