# Deep Learning - Day 4 - Your First RNN

### Exercise objectives:

- Better understand temporal data
- Build your first Recurrent Neural Network


<hr>
<hr>

You will see along the different exercises that temporal data can be of very different type - and thus of different complexity. For that reason, let's start with simple sequences of observations.


# The data


The data describes the evolution of the employment status of a person, year after year: each sequence corresponds to 6 consecutive years, where each year describes a job (let's say for the sake of simplicity that it corresponds to the job on the 1st of January). And each job is described by
- the salary,
- the number of persons under one's responsability,
- the size of the company. 

For instance, if at a given year, you earn 2500 ($, €, ¥, ...), you have 4 persons under your responsibility and the company has 200 employes, then it corresponds to the vector (2.5, 4, 200) - note here that the salary is devided by 1000 to have something normalized. And you have this observation for 10 consecutive years.

So, from this 25000 sequences, each of 10 consecutive observations, the goal is to predict the salary on the 11th year based on the past observations. 

❓ **Question** ❓ Load the data

In [None]:
import numpy as np

X = np.load('X.npy')
y = np.load('y.npy')

Let's check the data here.

❓ **Question** ❓ Take some sequences and plot the evolution of their salaries, of the persons under their responsibility and of the company sizes. You might see some correlation between the three variables.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Plot the distribution of all the salaries, persons under one's responsibility, and company sizes to get a better understanding of the variability of observations.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓  Split your dataset between a train and test set (80/20%)

In [None]:
### YOUR CODE HERE

# The model

Now, you will create your first Recurrent Neural Network.

❓ **Question** ❓ Write a model that has: 
- a `SimpleRNN` layer with 20 `units` - don't forget to choose the `tanh` activation function
- a Dense layer with 10 neurons
- a last Dense layer specific to your task (predict a salary)

In [None]:
# To complete

❓ **Question** ❓ Compile your model. Remember to first use the `rmsprop` optimizer (instead of Adam).

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Run your model on your data. Use a validation split of 20% and an early stopping criterion (patience=5).

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Evaluate your model on the test set

In [None]:
### YOUR CODE HERE

# Baseline model


### Standard problems

As for any model, you should quickly get an idea of the performance of your model. For instance, in case of a 2-class equally-balanced classification, the worst accuracy is of 50%. If the classe is unbalanced, your worst classification consists in always predicting the most present class. Similar ideas go for multiclass classification.

In the case of a regression model, a baseline prediction for `y_test` could be to predict the average of `y_train`.

### Temporal problems

With temporal data, it often happens that you try to predict a value that you have already seen in the past: here, the salary. In that case, a baseline model could be to predict a value based on these past occurencies. For instance, here, you could predict that the 11-th salary is equal to the 10-th salary.

❓ **Question** ❓ Compute the Mean Absolute Error of a model that would predict that the salary remains constant between the 10-th and 11-th year and compare it to your RNN.

In [None]:
### YOUR CODE HERE

You have probably seen that your prediction is a little bit better than the baseline model

# A bit more complex model

❓ **Question** ❓ Write the exact same model, but with a `LSTM` instead of a `SimpleRNN` and evaluate your performance on the test set

In [None]:
### YOUR CODE HERE

# Well done!

## You now know how to run RNN on sequence data!

Note: The sequences you worked with are totally fake. In case you need to train and reproduce similar data, you can find bellow the functions that have been used to simulate this data.

In [None]:
def create_sequences(number):
    X, y = [], []
    
    for i in range(number):
        x_i, y_i = create_individual_sequence(10)
        X.append(x_i)
        y.append(y_i)
        
    return np.array(X), np.array(y)
            
def create_individual_sequence(length):
    company_sizes = []
    nb_persons = []
    salaries = []
    
    
    # Education level
    educ_level = [max(0, int(np.random.normal(10, 2)))]*length
    
    # Company size
    current_size = int(1 + np.random.beta(.4, 4)*500)
    for i in range(length):
        if not np.random.randint(4): # Change 1 out of 3 possibilities 
            current_size = int(max(1, np.random.normal(current_size, 50)))
        company_sizes.append(current_size)
    
    # Number of persons
    nb_iter = np.random.beta(.15, 4)*300
    for i in range(length):
        if not np.random.randint(2): # Change 1 out of 2 possibilities
            R_1 = np.random.beta(0.5, 8)*3
            nb_iter = nb_iter + max(-2, R_1*company_sizes[i] + np.random.randint(-2, 2))
            nb_iter = max(0, nb_iter)
            nb_iter = int(min(company_sizes[i]-1, nb_iter))
        nb_persons.append(nb_iter)
        
    
    # Salary
    salary_iter = max(800, int(np.random.normal(1200, 300)+ 0.05*company_sizes[0] +  np.random.normal(40, 400)))
    salaries.append(salary_iter)
    for i in range(1, length + 1):
        R_1 = np.random.normal(100, 50)
        change_person = nb_persons[i-1] - nb_persons[i-2]
        change_company = max(0, company_sizes[i-1] - company_sizes[i-2])
        salary_iter = salary_iter + 0.05*change_company + change_person*R_1 + np.random.normal(100, 50)
        salary_iter = max(int(salary_iter), 500)
        
        salaries.append(salary_iter)

    y = salaries[-1]/1000
    salaries = [_/1000 for _ in salaries[:-1]]
    
    return np.array([salaries, nb_persons, company_sizes]).T, y

In [None]:
#X, y = create_sequences(25000)

#np.save('X', X.astype(np.float32))
#np.save('y', y)