<a href="https://colab.research.google.com/github/xtyangpsp/HarvardEPS_MLclass2019/blob/master/lstm_seq2seq_tutorial_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Building a LSTM Encoder-Decoder using PyTorch to make Sequence-to-Sequence Predictions<h1> 


**Laura Kulowski & Cedric Flamant**<br>
**Fall 2019**<br>


---



<h2> Overview <h2>

There are many instances where we want to predict how a time series will behave in the future. For example, given the behavior of the stock market over the last month, we may want to know how the stock market will behave in the future. 

<center> <img src="https://i.imgur.com/7uqw73X.png" alt="" width="500"> </center> 

Other examples of time series that we may wish to predict future values of include weather conditions (temperature, humidity, etc.), power usage, and traffic volume. The Long Short-Term Memory (LSTM) neural network is well-suited to these problems since the data may have long-term dependencies (i.e., past values may influence future values). 

Our goal in this lab is to make sequence-to-sequence predictions, or predictions where the input and output sequences might be different lengths, using a LSTM. For the stock market, this might involve providing the LSTM with 20 days of stock prices and predicting the next 5 days. To make sequence-to-sequence predictions, we use an LSTM with a special architecture: the LSTM encoder-decoder. 

The LSTM encoder-decoder involves combining two LSTMs. The first LSTM, or the encoder, processes the input sequence and outputs an encoded state (i.e., a summary of the input sequence). The second LSTM, or the decoder, uses the encoded state to generate an output sequence. The LSTM encoder-decoder architecture is shown below. 

<center> <img src="https://i.imgur.com/05OKp8k.png" alt="" width="900"> </center> 

In this lab, we will 

1.   Prepare a time series dataset to input into our LSTM encoder-decoder 
2.   Build the LSTM encoder-decoder using PyTorch
3.   Train the model and make predictions 
4.   Evaulate our model on the train and test data
5.   Experiment with the model







In [0]:
import numpy as np
import matplotlib.pyplot as plt
import random
import os, errno
import sys
import time
from tqdm import tqdm_notebook as tqdm
from tqdm import tnrange

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F



<h2> 1. Preparing the Time Series Dataset <h2>

<h3> Creating a Dataset <h3>

To make training easier, we will use a synthetic time series dataset.

In [0]:
# create synthetic time series dataset
def synthetic_data(Nt = 2000, tf = 80 * np.pi): 
  t = np.linspace(0., tf, Nt)
  y = np.sin(2. * t) + 0.5 * np.cos(t) + np.random.normal(0., 0.2, Nt)
  return t, y

In [0]:
t, y = synthetic_data() 

Let's take a look at our time series data.

In [0]:
# plot time series 
## your code hre 

<h3> Train/Test Split <h3>

Now, we will split our time series into train and test data.

In [0]:
# split time series into train/test sets 
def train_test_split(t, y, split = 0.8):
  '''
  : param t: time data 
  : para y: feature 
  : para split: percent of data to include in training set
  : return: t/y training/test arrays (shape: [# samples, 1]) 
  '''
  
  indx_split = int(split * len(y))
  indx_train = np.arange(0, indx_split)
  indx_test = np.arange(indx_split, len(y))
  
  t_train = t[indx_train]
  y_train = y[indx_train]
  y_train = y_train.reshape(-1, 1)
  
  t_test = t[indx_test]
  y_test = y[indx_test]
  y_test = y_test.reshape(-1, 1)
  
  return t_train, y_train, t_test, y_test 
  

In [0]:
t_train, y_train, t_test, y_test = train_test_split(t, y, split = 0.8)

Let's plot the train/test data.

In [0]:
# plot train/test data
## your code here


<h3> Creating a Windowed Dataset <h3>

We need to organize our data into sequences of $n_{i}$ input values and $n_{o}$ target values. To do this, we slide a moving window over the dataset. We start at the first $y$ value and collect $n_{i}$ values as inputs and the the next $n_{o}$ values as targets. Then, we slide our window to the second (stride = 1) or third (stride = 2) $y$ and repeat the procedure. The windowed dataset for $n_{i}=3$ and $n_{o} = 2$ and stride = 1 is shown below.  

<center> <img src="https://i.imgur.com/nBQUU61.png" alt="" width="900"> </center> 




In order to input our data into the LSTM, the data shape must be [sequence length, batch size, # features]. The size of $X$ should be [$n_{i}$, # of times the window fits data, 1], where the last value is 1 because we only have one feature, $y$. Similarly, the size of $Y$ should be [$n_{o}$ # of times the window fits data, 1]. In the cell below, we window the data and make sure it has the right dimensions for the LSTM. 

In [0]:
def windowed_dataset(y, input_window = 5, output_window = 1, stride = 1, num_features = 1):
    '''
    : param y: time series feature
    : param input_window: number of y samples to give model 
    : param output_window: number of future y samples to predict  
    : param stide: spacing between windows 
    : param num_features: number of features (i.e., 1 for us, but we could have multiple features)
    : return: array with correct dimensions for LSTM (i.e., [# samples, time steps, # features])
    '''
    L = y.shape[0]
    num_samples = (L - input_window - output_window) // stride + 1

    X = np.zeros([input_window, num_samples, num_features])
    Y = np.zeros([output_window, num_samples, num_features])    
    
    for ff in np.arange(num_features):
        for ii in np.arange(num_samples):
            start_x = stride * ii
            end_x = start_x + input_window
            X[:, ii, ff] = y[start_x:end_x, ff]

            start_y = stride * ii + input_window
            end_y = start_y + output_window 
            Y[:, ii, ff] = y[start_y:end_y, ff]

    return X, Y

In [0]:
# set size of input/output windows 
iw = 80 
ow = 20 
s = 5

# generate windowed training/test datasets
Xtrain, Ytrain = windowed_dataset(y_train, input_window = iw, output_window = ow, stride = s)
Xtest, Ytest = windowed_dataset(y_test, input_window = iw, output_window = ow, stride = s)

In [0]:
# check the shape of the data 
print(f'Xtrain has shape {Xtrain.shape} and Ytrain has shape {Ytrain.shape}')
print(f'Xtest has shape {Xtest.shape} and Ytest has shape {Ytest.shape}')

Xtrain has shape (80, 301, 1) and Ytrain has shape (20, 301, 1)
Xtest has shape (80, 61, 1) and Ytest has shape (20, 61, 1)


In [0]:
# plot an example in the windowed data 
## your code here

<h2> 2. Build LSTM Encoder-Decoder in PyTorch<h2>

We will use PyTorch to build the LSTM encoder-decoder. Although we build the encoder and the decoder separately, these two modules work together when we use the model to train and predict. 

<center> <img src="https://i.imgur.com/05OKp8k.png" alt="" width="900"> </center> 

For the encoder, we initialize the cell and hidden states to zero. We define a `forward` function for the encoder and decoder. This represents propogation through the graph. 




In [0]:
class lstm_encoder(nn.Module):
    ''' Encodes time-series sequence '''

    def __init__(self, input_size, hidden_size, num_layers = 1):
        '''
        : param input_size: the number of expected features in the input x
        : param hidden_size: the number of features in the hidden state h
        : param num_layers: number of recurrent layers (i.e., 2 means there are
        :                   2 stacked LSTMs)
        '''
        super(lstm_encoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # define LSTM layer
        self.lstm = nn.LSTM(input_size = input_size, hidden_size = hidden_size,
                            num_layers = num_layers)

    def forward(self, x_input):
        '''
        : param x_input: input of shape (seq_len, # in batch, input_size)
        '''
        lstm_out, self.hidden = self.lstm(x_input.view(x_input.shape[0], x_input.shape[1], self.input_size))
        return lstm_out, self.hidden     
    
    def init_hidden(self, batch_size):
        '''
        initialize hidden state
        : param batch_size: x_input.shape[1]
        '''
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size),
                torch.zeros(self.num_layers, batch_size, self.hidden_size))


class lstm_decoder(nn.Module):
    ''' Decodes hidden state output by encoder '''
    def __init__(self, input_size, hidden_size, num_layers = 1):
        super(lstm_decoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_size = input_size, hidden_size = hidden_size,
                            num_layers = num_layers)
        self.linear = nn.Linear(hidden_size, input_size)           

    def forward(self, x_input, encoder_hidden_states):
        '''
        LSTMCell takes in (batch size, input size/# features)
        : param x_input: should be 2D (batch_size, input_size)
        '''
        lstm_out, self.hidden = self.lstm(x_input.unsqueeze(0), encoder_hidden_states)
        output = self.linear(lstm_out.squeeze(0))     
        
        return output, self.hidden


<h2> 3. Train the LSTM Encoder-Decoder and make Predictions <h2>

There are a few ways we can train the LSTM encoder-decoder. First, we can recurrently feed the predicted decoder outputs into the LSTM decoder until we have an output sequence of the desired length. This is the typical model structure we have considered so far.   

<center> <img src="https://i.imgur.com/v8SYiEM.png" alt="" width="900"> </center> 

Instead of feeding the predicted outputs into the LSTM decoder, we could feed in the true target values. This is called teacher forcing. In training, teacher forcing acts as "training wheels" --- if the model makes a bad prediction, it is put back in place with the true value. 
<center> <img src="https://i.imgur.com/uhdJJFp.png" alt="" width="900"> </center> 

Another alternative is to randomly use teacher forcing (the "training wheels" are on some of the time). 
<center> <img src="https://i.imgur.com/KQoPfmZ.png" alt="" width="900"> </center> 

In each case, we compare the predictions provided by the LSTM decoder to the true values, compute the loss, and update the weights matrices/biases in the LSTM gates (via  backpropogation) to minimize the loss. The function `train_model` carries out this training process, and allows the user to decide how much teacher forcing to use. 

We also define the function `make_prediction`, which allows us to make predictions once the model has been trained. The predictions are recursive. 


In [0]:
class lstm_seq2seq(nn.Module):
    def __init__(self, input_size, hidden_size):
        
        super(lstm_seq2seq, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size

        self.encoder = lstm_encoder(input_size = input_size, hidden_size = hidden_size)
        self.decoder = lstm_decoder(input_size = input_size, hidden_size = hidden_size)


    def train_model(self, input_tensor, target_tensor, n_epochs, target_len, batch_size, teacher_forcing_ratio, learning_rate = 0.01, dynamic_tf = False):
        '''
        train data using teacher forcing
        : param teacher_forcing_ratio: float [0, 1); high means more teacher forcing 
        '''
        # initialize array of losses 
        losses = np.full(n_epochs, np.nan)
        losses_tf = np.full(n_epochs, np.nan)
        losses_no_tf = np.full(n_epochs, np.nan)


        optimizer = optim.Adam(self.parameters(), lr = learning_rate)
        criterion = nn.MSELoss()

        # calculate number of batch iterations
        n_batches = int(input_tensor.shape[1] / batch_size)

        with tnrange(n_epochs) as tr:
            for it in tr:
                
                batch_loss = 0.
                batch_loss_tf = 0.
                batch_loss_no_tf = 0.
                num_tf = 0
                num_no_tf = 0

                for b in range(n_batches):
                    # select data 
                    input_batch = input_tensor[:, b: b + batch_size, :]
                    target_batch = target_tensor[:, b: b + batch_size, :]

                    # outputs tensor
                    outputs = torch.zeros(target_len, batch_size, input_batch.shape[2])

                    # initialize hidden state
                    encoder_hidden = self.encoder.init_hidden(batch_size)

                    # zero the gradient
                    optimizer.zero_grad()

                    # encoder outputs
                    encoder_output, encoder_hidden = self.encoder(input_batch)

                    # decoder with teacher forcing
                    decoder_input = input_batch[-1, :, :]   # shape: (batch_size, input_size)
                    decoder_hidden = encoder_hidden

                    use_teacher_forcing = ( random.random() < teacher_forcing_ratio )

                    if use_teacher_forcing:
                        # teacher forcing: feed the target as the next input
                        for t in range(target_len): 
                            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
                            outputs[t] = decoder_output
                            decoder_input = target_batch[t, :, :]

                    else:
                        # predict recursively
                        for t in range(target_len): 
                            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
                            outputs[t] = decoder_output
                            decoder_input = decoder_output

                    loss = criterion(outputs, target_batch)

                    batch_loss += loss.item()
                    
                    if use_teacher_forcing:
                        num_tf += batch_size
                        batch_loss_tf += loss.item()
                    else:
                        num_no_tf += batch_size
                        batch_loss_no_tf += loss.item()

                    loss.backward()

                    optimizer.step()

                batch_loss /= (n_batches * batch_size)
                losses[it] = batch_loss

                
                if num_no_tf != 0.: 
                    batch_loss_no_tf /= num_no_tf
                    losses_no_tf[it] = batch_loss_no_tf

                    
                if num_tf !=0:
                    batch_loss_tf /= num_tf
                    losses_tf[it] = batch_loss_tf

                if dynamic_tf and teacher_forcing_ratio > 0:
                    teacher_forcing_ratio = teacher_forcing_ratio - 0.02 
                    
                tr.set_postfix(loss="{0:.3f}".format(batch_loss))
                    
        return losses, losses_tf, losses_no_tf

    def predict(self, input_tensor, target_len):
        '''
        : param input_tensor: (seq_len, input_size)
        '''

        # encode input_tensor
        input_tensor = input_tensor.unsqueeze(1)     # add in batch size of 1
        encoder_output, encoder_hidden = self.encoder(input_tensor)

        # initialize tensor for predictions
        outputs = torch.zeros(target_len, input_tensor.shape[2])

        # decode input_tensor
        decoder_input = input_tensor[-1, :, :]
        decoder_hidden = encoder_hidden
        
        for t in range(target_len):
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
            outputs[t] = decoder_output.squeeze(0)
            decoder_input = decoder_output
            
        np_outputs = outputs.detach().numpy()
        
        return np_outputs


<h2>4. Evaluate the LSTM Encoder-Decoder on the Train/Test Sets<h2>

Let's train a model. 

In [0]:
# the data needs to be in a torch format, not np.array
X_train = torch.from_numpy(Xtrain).type(torch.Tensor)
Y_train = torch.from_numpy(Ytrain).type(torch.Tensor)

X_test = torch.from_numpy(Xtest).type(torch.Tensor)
Y_test = torch.from_numpy(Ytest).type(torch.Tensor)
input_size = X_train.shape[2]

# specify model parameters and train 
model = lstm_seq2seq(input_size = input_size, hidden_size = 15)
loss, loss_tf, loss_no_tf = model.train_model(X_train, Y_train, n_epochs = 10, target_len = ow, batch_size = 5, teacher_forcing_ratio = 0.1, learning_rate = 0.01, dynamic_tf = False)


Let's see how the model performs on the train/test data by plotting a few examples. 

In [0]:
def plot_train_test_results(lstm_model, num_rows = 4): 
  num_cols = 2
  num_plots = num_rows * num_cols

  fig, ax = plt.subplots(num_rows, num_cols, figsize = (8, 12))

  for ii in range(num_rows):
      # train set
      X_train_plt = X_train[:, ii, :]
      Y_train_pred = lstm_model.predict(X_train_plt, target_len = ow)
      ax[ii, 0].plot(np.arange(0, iw), Xtrain[:, ii, 0], 'k', label = 'Input')
      ax[ii, 0].plot(np.arange(iw - 1, iw + ow), np.concatenate([[Xtrain[-1, ii, 0]], Ytrain[:, ii, 0]]), 'b', label = 'Target')
      ax[ii, 0].plot(np.arange(iw - 1, iw + ow),  np.concatenate([[Xtrain[-1, ii, 0]], Y_train_pred[:, 0]]), 'r', label = 'Prediction')
      ax[ii, 0].set_xlabel('t')
      ax[ii, 0].set_ylabel('y')


      # test set
      X_test_plt = X_test[:, ii, :]
      Y_test_pred = lstm_model.predict(X_test_plt, target_len = ow)
      ax[ii, 1].plot(np.arange(0, iw), Xtest[:, ii, 0], 'k', label = 'Input')
      ax[ii, 1].plot(np.arange(iw - 1, iw + ow), np.concatenate([[Xtest[-1, ii, 0]], Ytest[:, ii, 0]]), 'b', label = 'Target')
      ax[ii, 1].plot(np.arange(iw - 1, iw + ow), np.concatenate([[Xtest[-1, ii, 0]], Y_test_pred[:, 0]]), 'r', label = 'Prediction')
      ax[ii, 1].set_xlabel('t')
      ax[ii, 1].set_ylabel('y')

      if ii == 0:
        ax[ii, 0].set_title('Train')
        
        ax[ii, 1].legend(bbox_to_anchor=(1, 1))
        ax[ii, 1].set_title('Test')

        
  return 

In [0]:
plot_train_test_results(model)

<h2> Your Turn <h2>

Now you can play around with the LSTM encoder-decoder. You can try changing the


*    function used to generate the dataset 
*    input/output window size
*    number of epochs and the batch size
*    amount of teacher forcing (0 = no teacher forcing, 1 = full teacher forcing)
*    learning rate



In [0]:
# your model 
hidden_size = 15 
n_epochs = 1 
batch_size = 5
tf_ratio = 0.5
learning_rate = 0.01 

model_new = lstm_seq2seq(input_size = input_size, hidden_size = hidden_size)
loss, loss_tf, loss_no_tf = model_new.train_model(X_train, Y_train, n_epochs = n_epochs, target_len = ow, batch_size = batch_size, teacher_forcing_ratio = tf_ratio, learning_rate = learning_rate, dynamic_tf = False)
plot_train_test_results(model_new)