___

# Machine Learning in Geosciences ] 
Department of Applied Geoinformatics and Carthography, Charles University

Lukas Brodsky lukas.brodsky@natur.cuni.cz


# Basic Recurrent Neural Network

This notebooks introduces solutions of problems when we're trying to explain changes over time. Situation when a predicted value depends on a series of past behaviors. 

This challenge of incorporating a series of measurements over time into the model parameters is addressed by Recurrent Neural Networks. 

**PyTorch** offers a number of RNN layers and options.<br>

* <a href='https://pytorch.org/docs/stable/nn.html#rnn'><tt><strong>torch.nn.RNN()</strong></tt></a> provides a basic model which applies a multilayer RNN with either <em>tanh</em> or <em>ReLU</em> non-linearity functions to an input sequence.<br>
As we learned in the theory lectures, however, this has its limits.<br><br>
* <a href='https://pytorch.org/docs/stable/nn.html#lstm'><tt><strong>torch.nn.LSTM()</strong></tt></a> adds a multi-layer long short-term memory (LSTM) process which greatly extends the memory of the RNN.

To demonstrate the potential of LSTMs, we'll look at a simple sine wave. 
**Goal**: given a value, predict the next value in the sequence. Due to the cyclical nature of sine waves, an typical neural network won't know if it should predict upward or downward, while an LSTM is capable of learning patterns of values.

In [None]:
import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Periodic data set 
For this exercise we'll look at a simple sine wave. We'll take **800 data points** and assign **40 points per full cycle**, for a total of **20 complete cycles**. We'll train our model on all but the last cycle, and use that to evaluate our test predictions.

In [None]:
# Create & plot data points
x = torch.linspace(0,799,steps=800)
y = torch.sin(x*2*3.1416/40)

plt.figure(figsize=(12,4))
plt.xlim(-10,801)
plt.grid(True)
plt.plot(y.numpy());

## Train and test sets
We want to take the first 760 samples in our series as a training sequence, and the last 40 for testing.

NOTE:  We tend to use the terms "series" and "sequence" interchangeably. Usually "series" refers to the entire population of data, or the full time series, and "sequence" refers to some portion of it.

In [None]:
test_size = 40

train_set = y[:-test_size]
test_set = y[-test_size:]

In [None]:
plt.figure(figsize=(12,4))
plt.xlim(-10,801)
plt.grid(True)
plt.plot(y.numpy(), 'w');
plt.plot(train_set.numpy(), 'b')
plt.plot(torch.linspace(train_set.size()[0],799,steps=40), test_set.numpy(), 'r')

## Prepare the training data!!! 

When working with LSTM models, we start by **dividing the training sequence into a series of overlapping "windows"**. Each window consists of a **connected sequence of samples**. The **label used for comparison is equal to the next value in the sequence**. 

Think of moving window!!!

In this way our **network learns what value should follow a given pattern of preceding values**. 

For example, we have a series of 15 records, and a window size of 5. We feed $[x_1,..,x_5]$ into the model, and compare the prediction to $x_6$. Then we backpropagate, update parameters, and feed $[x_2,..,x_6]$ into the model. We compare the new output to $x_7$ and so forth up to $[x_{10},..,x_{14}]$.

To simplify this, we'll define a function called `input_data` that builds a list of <tt>(seq, label)</tt> tuples. Windows overlap, so the first tuple might contain $([x_1,..,x_5],[x_6])$, the second would have $([x_2,..,x_6],[x_7])$, etc. 

Here $k$ is the width of the window. Due to the overlap, we'll have a total number of <tt>(seq, label)</tt> tuples equal to $\textrm{len}(series)-k$

In [None]:
def input_data(seq,ws):  
    """seq .. input sequence 
       ws .. the window size
    """
    out = []
    L = len(seq)
    for i in range(L-ws):
        window = seq[i:i+ws]
        label = seq[i+ws:i+ws+1]
        out.append((window,label))
    return out

NOTE: "Windows" are different from "batches". **In our example we'll feed one window into the model at a time**, so our batch size would be 1. If we passed two windows into the model before we backprop and update weights, our batch size would be 2.</div>

In [None]:
# To train on our sine wave data we'll use a window size of 40 (one entire cycle). 

# test_size = 40
# train_set = y[:-test_size]
# test_set = y[-test_size:]

window_size = 40

# Create the training dataset of sequence/label tuples:
train_data = input_data(train_set, window_size)

len(train_data) # this should equal 760-40

### The Window data

In [None]:
# Display the first (seq/label) tuple in train_data
torch.set_printoptions(sci_mode=False) # to improve the appearance of tensors
train_data[0]

### Step 1

In [None]:
plt.figure(figsize=(12,4))
plt.xlim(-1,43)
plt.grid(True)
plt.plot(train_data[0][0].numpy(), 'b*--')
plt.plot(torch.tensor(40), train_data[0][1].numpy(), 'ro')

### Step 2

In [None]:
plt.figure(figsize=(12,4))
plt.xlim(-1,43)
plt.grid(True)
plt.plot(torch.linspace(1,40,steps=40), train_data[1][0].numpy(), 'b*--')
plt.plot(torch.tensor(41), train_data[1][1].numpy(), 'ro')

### Step 3

In [None]:
plt.figure(figsize=(12,4))
plt.xlim(-1,43)
plt.grid(True)
plt.plot(torch.linspace(2,41,steps=40), train_data[2][0].numpy(), 'b*--')
plt.plot(torch.tensor(42), train_data[2][1].numpy(), 'ro')

### ... and so on! 

## Define the model

* one LSTM layer    
* input size of 1     
* hidden size of 50 (can be changed)     
* outpus as a fully-connected layer to reduce the output to the prediction size of 1.<br>
    
NOTE: You will often see the terms `input_dim` and `hidden_dim` used in place of `input_size` and `hidden_size`. They mean the same thing. Let's stick to `input_size` and `hidden_size` to stay consistent with PyTorch's built-in keywords.

During training we pass three tensors through the LSTM layer - the sequence, the hidden state **$h_0$** and the cell state **$c_0$**.

This means we need to initialize $h_0$ and $c_0$. This can be done with random values, but we'll use zeros instead.

In [None]:
class LSTM(nn.Module):
    def __init__(self, input_size=1, hidden_size=50, out_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        
        # LSTM layer:
        self.lstm = nn.LSTM(input_size, hidden_size)
        
        # Fully-connected layer:
        self.linear = nn.Linear(hidden_size, out_size)
        
        # Initialize h0 and c0:
        self.hidden = (torch.zeros(1, 1, hidden_size),
                       torch.zeros(1, 1, hidden_size))
    
    def forward(self,seq):
        lstm_out, self.hidden = self.lstm(
            seq.view(len(seq), 1, -1), self.hidden)
        pred = self.linear(lstm_out.view(len(seq),-1))

        # return only the last prediction
        return pred[-1]   
    

## Initialisation 

* instantiate the model, 
* define loss 
* an d optimization functions

Since we're comparing single values, we'll use MSEloss: https://pytorch.org/docs/stable/nn.html#mseloss torch.nn.MSELoss.  

Also, we've found that https://pytorch.org/docs/stable/optim.html#torch.optim.SGD torch.optim.SGD converges faster for this application than torch.optim.Adam

In [None]:
torch.manual_seed(42)
# instnce of outr model
model = LSTM()
# model = RNN()
# optimisation criterion 
criterion = nn.MSELoss()
# optimizer 
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

model

In [None]:
model.hidden[0].size()

In [None]:
# [p.numel() for p in model.parameters()]

In [None]:
def count_parameters(model):
    params = [p.numel() for p in model.parameters() if p.requires_grad]
    for item in params:
        print(f'{item:>6}')
    print(f'______\n{sum(params):>6}')
    
count_parameters(model)

In [None]:
# Train the model on 10 epochs 
epochs = 10
future = 40
loss_history = []

for i in range(epochs):
    
    # tuple-unpack the train_data set
    for seq, y_train in train_data:
        
        # reset the parameters and hidden states
        optimizer.zero_grad()
        model.hidden = (torch.zeros(1,1,model.hidden_size),
                        torch.zeros(1,1,model.hidden_size))
        
        y_pred = model(seq)
        
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        
    # print training result
    print(f'Epoch: {i+1:2} Loss: {loss.item():10.8f}')
    loss_history.append(loss.item())

# print('Training finished!)

In [None]:
plt.plot(loss_history)

## Predicting future values
To show how an LSTM model improves after each epoch, we'll run predictions and plot the results. Our goal is to predict the last sequence of 40 values, and compare them to the known data in our test set. However, we have to be careful <em>not</em> to use test data in the predictions - that is, each new prediction derives from previously predicted values.

The task is to take the last known window, predict the next value, then <em>append</em> the predicted value to the sequence and run a new prediction on a window that includes the value we've just predicted. 
In this way, a well-trained model <em>should</em> follow any regular trends/cycles in the data.

## Train and simultaneously evaluate the model


In [None]:
window_size

In [None]:
preds = train_set[-window_size:].tolist()
# preds
seq = torch.FloatTensor(preds[-window_size:])
seq

Note: In the training approach, the gradient is propagated through the hidden states of the LSTM across the time dimension in the batch.

The hidden state stores the internal state of the RNN from predictions made on previous tokens in the current sequence, this allows RNNs to understand context. The hidden state is determined by the output of the previous token.

When you predict for the first token of any sequence, if you were to retain the hidden state from the previous sequence your model would perform as if the new sequence was a continuation of the old sequence which would give worse results. Instead for the first token you initialise an empty hidden state, which will then be filled with the model state and used for the second token. 

Consider hidden states as just outputs, which are not updated during backprop. So, for every new epoch and (every iteration) we re-initialise hidden_state vectors, so as to compute hidden_state vectors for each sequence individually.

In [None]:
# We'll train 10 epochs. For clarity, we'll "zoom in" on the test set, 
# and only display from point 700 to the end.
epochs = 10
future = 40

for i in range(epochs):
    
    # tuple-unpack the train_data set
    for seq, y_train in train_data:
        
        # reset the parameters and hidden states
        optimizer.zero_grad()
        model.hidden = (torch.zeros(1,1,model.hidden_size),
                        torch.zeros(1,1,model.hidden_size))
        
        y_pred = model(seq)
        
        loss = criterion(y_pred, y_train)
        loss.backward()
        optimizer.step()
        
    # print training result
    print(f'Epoch: {i+1:2} Loss: {loss.item():10.8f}')
    
    # MAKE PREDICTIONS after training in each epoch 
    # start with a list of the last 40 training records
    preds = train_set[-window_size:].tolist()
    print(len(preds))

    for f in range(future):  
        seq = torch.FloatTensor(preds[-window_size:])
        with torch.no_grad():
            model.hidden = (torch.zeros(1,1,model.hidden_size),
                            torch.zeros(1,1,model.hidden_size))
            preds.append(model(seq).item())
            # print(len(preds))
            
    loss = criterion(torch.tensor(preds[-window_size:]),y[760:])
    print(f'Loss on test predictions: {loss}')

    # Plot from point 700 to the end
    plt.figure(figsize=(12,4))
    plt.xlim(700,801)
    plt.grid(True)
    plt.plot(y.numpy())
    plt.plot(range(760,800),preds[window_size:])
    plt.show()

# Forecasting into an unknown future
We'll continue to train our model, this time using the entire dataset. Then we'll predict what the <em>next</em> 40 points should be.

## Train the model
Expect this to take a few minutes.

In [None]:
epochs = 10
window_size = 40
future = 40

# Create the full set of sequence/label tuples:
all_data = input_data(y,window_size)
len(all_data)  # this should equal 800-40

In [None]:
import time
start_time = time.time()

for i in range(epochs):
    
    # tuple-unpack the entire set of data
    for seq, y_train in all_data:  
       
        # reset the parameters and hidden states
        optimizer.zero_grad()
        model.hidden = (torch.zeros(1,1,model.hidden_size),
                        torch.zeros(1,1,model.hidden_size))
        
        y_pred = model(seq)
        
        loss = criterion(y_pred, y_train)
        
        loss.backward()
        optimizer.step()
        
    # print training result
    print(f'Epoch: {i+1:2} Loss: {loss.item():10.8f}')
    
print(f'\nDuration: {time.time() - start_time:.0f} seconds')

## Predict future values, plot the result

In [None]:
preds = y[-window_size:].tolist()

for i in range(future):  
    seq = torch.FloatTensor(preds[-window_size:])
    with torch.no_grad():
        # Reset the hidden parameters
        model.hidden = (torch.zeros(1,1,model.hidden_size),
                        torch.zeros(1,1,model.hidden_size))  
        preds.append(model(seq).item())

plt.figure(figsize=(12,4))
plt.xlim(-10,841)
plt.grid(True)
plt.plot(y.numpy())
plt.plot(range(800,800+future),preds[window_size:])
plt.show()