# Chapter 8 -- Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs)

In this tutorial, we will use [PyTorch](https://pytorch.org/) and [Lightning](https://lightning.ai/) to create, optimize and make predictions using the **Long Short-Term Memory (LSTM)** network. We will implement the **LSTM** unit schematized below, that uses predicts sequential data to predict the value of two different companies.

<img src="./images/lstm_image.001.png" alt="A Long Short-Term Memory Unit" style="width: 800px;">

The training data (below) consist of stock prices for two different companies, *Company A* and *Company B*. The goal is to use the data from the first **4** days to predict what the price will be on the **5**th day. If we look closely at the data, we'll see that the only differences in the prices occur on Day **1** and Day **5**. So the LSTM has to remember what happened on Day **1** in order to predict what will happen on Day **5**.

<img src="./images/company_data.png" alt="Data for Companies A and B" style="width: 800px;">

In this tutorial, we will:

- Build a Long Short-Term Memory (LSTM) unit by hand with Lightning
- Train the LSTM unit and use Lightning and TensorBoard to evaluate
- Add additional epochs to the training without starting over
- Build a Long Short-Term Memory Unit with nn.LSTM() and train it with Lightning

In [1]:
import torch
import torch.nn as nn
from torch.optim import Adam

import lightning as L
from torch.utils.data import TensorDataset, DataLoader
# from pytorch_lightning.loggers import TensorBoardLogger  # logger

## Build a Long Short-Term Memory unit by hand

Just like we have done in previous tutorials, building a neural network, and a Long Short-Term Memory (LSTM) unit is a type of neural network, means we need to create a new class. To make it easy to train the LSTM, this class will inherit from `LightningModule` and we'll create the following methods:

- `__init__()` to initialize the Weights and Biases and keep track of a few other house keeping things.
- `lstm_unit()` to do the LSTM math. For example, to calculate the percentage of the long-term memory to remember.
- `forward()` to make a forward pass through the unrolled LSTM. In other words `forward()` calls `lstm_unit()` for each data point.
- `configure_optimizers()` to configure the opimimizer. In this tutorial we'll use `Adam`, another popular algorithm for optimizing the Weights and Biases.
- `training_step()` to pass the training data to `forward()`, calculate the loss and to keep track of the loss values in a log file.

In [2]:
class LSTMbyHand(L.LightningModule):

    def __init__(self):

        super().__init__()
        L.seed_everything(seed=42)

        # nn.LSTM() uses random values from a uniform distribution to initialize the tensors
        # Here we can do it 2 different ways 1) Normal Distribution and 2) Uniform Distribution
        # We'll start with the Normal Distribtion using `torch.normal()`
        mean = torch.tensor(0.0)
        std = torch.tensor(1.0)

        # In this case, we are only using the normal distribution for the Weights.
        # All Biases are initialized to 0.

        # These are the Weights and Biases in the first stage, which determines what percentage
        # of the long-term memory (blue cell, or *forget gate*) the LSTM unit will remember.
        self.wlr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wlr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.blr1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # These are the Weights and Biases in the second stage, which determins the new
        # potential long-term memory (*input gate*) and what percentage will be remembered.
        self.wpr1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wpr2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bpr1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        self.wp1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wp2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bp1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # These are the Weights and Biases in the third stage, which determines the
        # new short-term memory (*output gate*) and what percentage will be sent to the output.
        self.wo1 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.wo2 = nn.Parameter(torch.normal(mean=mean, std=std), requires_grad=True)
        self.bo1 = nn.Parameter(torch.tensor(0.), requires_grad=True)

        # # We can also initialize all Weights and Biases using a uniform distribution:
        # self.wlr1 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.wlr2 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.blr1 = nn.Parameter(torch.rand(1), requires_grad=True)

        # self.wpr1 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.wpr2 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.bpr1 = nn.Parameter(torch.rand(1), requires_grad=True)

        # self.wp1 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.wp2 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.bp1 = nn.Parameter(torch.rand(1), requires_grad=True)

        # self.wo1 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.wo2 = nn.Parameter(torch.rand(1), requires_grad=True)
        # self.bo1 = nn.Parameter(torch.rand(1), requires_grad=True)


    def lstm_unit(self, input_value, long_memory, short_memory):
        # lstm_unit does the math for a single LSTM unit.

        # long term memory is also called "cell state",
        # short term memory is also called "hidden state"

        # 1) The first stage determines what percent of the current
        # long-term memory should be remembered
        long_remember_percent = torch.sigmoid(
            (short_memory * self.wlr1)
            +
            (input_value * self.wlr2)
            +
            self.blr1
        )

        # 2) The second stage creates a new, potential long-term memory
        # and determines what percentage of that to add to the current long-term memory
        potential_remember_percent = torch.sigmoid(
            (short_memory * self.wpr1)
            +
            (input_value * self.wpr2)
            +
            self.bpr1
        )
        
        potential_memory = torch.tanh(
            (short_memory * self.wp1)
            +
            (input_value * self.wp2)
            +
            self.bp1
        )

        # Once we have gone through the first two stages, we can update the long-term memory
        updated_long_memory = (
            (long_memory * long_remember_percent)
            +
            (potential_remember_percent * potential_memory)
        )

        # 3) The third stage creates a new, potential short-term memory and determines
        # what percentage of that should be remembered and used as output.
        output_percent = torch.sigmoid(
            (short_memory * self.wo1)
            +
            (input_value * self.wo2)
            +
            self.bo1
        )

        updated_short_memory = torch.tanh(updated_long_memory) * output_percent

        # Finally, we return the updated long and short-term memories
        return([updated_long_memory, updated_short_memory])


    def forward(self, input):
        # forward() unrolls the LSTM for the training data by calling lstm_unit()
        # for each day of training data that we have. forward() also keeps track of
        # the long and short-term memories after each day and returns the final 
        # short-term memory, which is the 'output' of the LSTM.

        long_memory = 0  # long term memory is also called "cell state" and indexed with c0, c1, ..., cN
        short_memory = 0  # short term memory is also called "hidden state" and indexed with h0, h1, ..., cN
        
        day1 = input[0]
        day2 = input[1]
        day3 = input[2]
        day4 = input[3]

        # Day 1
        long_memory, short_memory = self.lstm_unit(day1, long_memory, short_memory)

        # Day 2
        long_memory, short_memory = self.lstm_unit(day2, long_memory, short_memory)

        # Day 3
        long_memory, short_memory = self.lstm_unit(day3, long_memory, short_memory)

        # Day 4
        long_memory, short_memory = self.lstm_unit(day4, long_memory, short_memory)

        # Now return short_memory, which is the 'output' of the LSTM.
        return short_memory


    def configure_optimizers(self):
        # this configures the optimizer we want to use for backpropagation.

        # Setting the learning rate to 0.1 trains way faster than using the 
        # default learning rate, lr=0.001, which requires a lot more training. 
        # However, if we use the default value, we get the exact same Weights 
        # and Biases that was used in the LSTM Clearly Explained StatQuest video. 
        # So we'll use the default value.
        # return Adam(self.parameters(), lr=0.1)
        return Adam(self.parameters())


    def training_step(self, batch):
        # take a step during gradient descent.
        input_i, label_i = batch  # collect input
        output_i = self.forward(input_i[0])  # run input through the neural network
        loss = (output_i - label_i)**2  # loss = squared residual

        ###################
        ## Logging the loss and the predicted values so we can evaluate the training
        ###################

        self.log("train_loss", loss)
        
        # Our dataset consists of two sequences of values representing Company A and 
        # Company B. For Company A, the goal is to predict that the value on Day 5 = 0, 
        # and for Company B, the goal is to predict that the value on Day 5 = 1. We use 
        # label_i, the value we want to predict, to keep track of which company we just 
        # made a prediction for and log that output value in a company specific file
        if (label_i == 0):
            self.log("out_0", output_i)
        else:
            self.log("out_1", output_i)

        return loss

Once we have created the class that defines an LSTM, we can use it to create a model and print out the randomly initialized Weights and Biases.

Then, just for fun, we'll see what those random Weights and Biases predict for *Company A* and *Company B*. If they are good predictions, then we're done! However, the chances of getting good predictions from random values is very small.

In [3]:
# Create the model object, print out parameters and see how well
# the untrained LSTM can make predictions...
model = LSTMbyHand()

print("Before optimization, the parameters are:")
for name, param in model.named_parameters():
    print(name, param.data)

print("\nNow let's compare the observed and predicted values:")

# To make predictions, we pass in the first 4 days worth of stock 
# values in an array for each company. In this case, the only difference 
# between the input values for Company A and B occurs on the first day. 
# Company A has 0 and Company B has 1.
print(
    "Company A: Observed = 0, Predicted =",
    model(
        torch.tensor([0., 0.5, 0.25, 1.])
    ).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(
        torch.tensor([1., 0.5, 0.25, 1.])
    ).detach())

Seed set to 42


Before optimization, the parameters are:
wlr1 tensor(0.3367)
wlr2 tensor(0.1288)
blr1 tensor(0.)
wpr1 tensor(0.2345)
wpr2 tensor(0.2303)
bpr1 tensor(0.)
wp1 tensor(-1.1229)
wp2 tensor(-0.1863)
bp1 tensor(0.)
wo1 tensor(2.2082)
wo2 tensor(-0.6380)
bo1 tensor(0.)

Now let's compare the observed and predicted values:
Company A: Observed = 0, Predicted = tensor(-0.0377)
Company B: Observed = 1, Predicted = tensor(-0.0383)


With the unoptimized paramters, the predicted value for *Company A*, **-0.0377**, isn't terrible, since it is relatively close to the observed value, **0**. However, the predicted value for *Company B*, **-0.0383**, _is_ terrible, because it is relatively far from the observed value, **1**. So, that means we need to train the LSTM.

## LSTM with Lightning and TensorBoard

### Getting Started

Since we are using **Lightning** training, training the LSTM we created by hand is pretty easy. All we have to do is create the training data and put it into a `DataLoader`.

In [4]:
# Create the training data for the neural network.
inputs = torch.tensor(
    [
        [0., 0.5, 0.25, 1.],
        [1., 0.5, 0.25, 1.]
    ])
labels = torch.tensor([0., 1.])

dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset)

We then create a **Lightning Trainer**, `L.Trainer`, and fit it to the training data. We will be starting with **2000** epochs. This may be enough to successfully optimize all of the parameters, but it might not. We'll find out after we compare the predictions to the observed values.

Since we'll be using TensorBoard to visualize the training progress, we need to ensure the correct logger is used. By default (as long as the `tensorboard` package is installed), PyTorch `Lightning` is smart and **automatically uses the `TensorBoardLogger`**.Though, to have more control over our experiment (like setting a custom name for our run), we can **explicitly** create and pass the `TensorBoardLogger` to the **Trainer** using the `logger` argument.

In [5]:
trainer = L.Trainer(
    max_epochs=2000,  # with default learning rate, 0.001
    # logger=logger,    # pass another logger
)

trainer.fit(model, train_dataloaders=dataloader)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name         | Type | Params | Mode
---------------------------------------------
  | other params | n/a  | 12     | n/a 
---------------------------------------------
12        Trainable params
0         Non-trainable params
12        Total params
0.000     Total estimated model params size (MB)
0         Modules in train mode
0         Modules in eval mode
c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\.env\Lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=2000` reached.


Now that we've trained the model with **2000** epochs, let's see how good the predictions are.

In [6]:
print("New observed and predicted values (after 2000 epochs; lr=0.001):")
print(
    "Company A: Observed = 0, Predicted =",
    model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

New observed and predicted values (after 2000 epochs; lr=0.001):
Company A: Observed = 0, Predicted = tensor(0.4342)
Company B: Observed = 1, Predicted = tensor(0.6171)


### TensorBoard

Unfortunately, these predictions are terrible. So it seems like we'll have to do more training. However, it would be awesome if we could be confident that more training will actually improve the predictions. If not, we can spare ourselves a lot of time, and potentially money, and just give up.

Before we dive into more training, let's look at the loss values and predictions that we saved in log files with **TensorBoard**. **TensorBoard** will graph everything that we logged during training, making it super easy to see if things are headed in the right direction or not.

We first need to make sure that **TensorBoard** is installed. If it isn't, we can install it by following the [instructions to use TensorBoard with PyTorch](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html).

Then, to get TensorBoard working:

- In the Jupyter browser window, go to the **File** menu and select **New**.
- In the submenu, select **Terminal**.
- In the terminal, navigate to the same directory that contains the **lightning_logs** directory, i.e. to `chapter_08/`.
- Then in the terminal, enter `tensorboard --logdir=lightning_logs/` to start the **TensorBoard** server.
- When the **TensorBoard** server starts, it will print out a URL that looks like this `http://localhost:6006/`.
- Copy the URL and paste it into a new browser window and then we are good to go.

Alternatively, we can run the server in a Jupyter notebook cell by entering the following command:

```python
# Load the TensorBoard notebook extension
%load_ext tensorboard

# Start TensorBoard, pointing it to the log directory
%tensorboard --logdir lightning_logs
```

Though currently we may encounter some issue with the connection to the TensorBoard server at kernel restart, [as explained elsewhere](https://github.com/tensorflow/tensorboard/issues/2481).

_If the graphs look messed up and we see a bunch of different lines, instead of just one red line per graph, then check where this notebook is saved for a directory called `lightning_logs`. Delete `lightning_logs` and the re-run everything in this notebook. One source of problems with the graphs is that every time we train a model, a new batch of log files is created and stored in `lightning_logs` and **TensorBoard**, by default, will plot all of them. We can turn off unwanted log files in **TensorBoard**, and we'll do this later on in this notebook, but for now, the easiest thing to do is to start with a clean slate._

Below are the graphs of **loss** (`train_loss`), the predictions for *Company A* (`out_0`), and the predictions for *Company B* (`out_1`). Remember for *Companay A*, we want to predict **0** and for *Company B*, we want to predict **1**.

<img src="./images/train_loss_2000_epochs.png" alt="Loss" style="width: 300px;"> <img src="./images/out_0_2000_epochs.png" alt="out_0" style="width: 300px;"> <img src="./images/out_1_2000_epochs.png" alt="out_1" style="width: 300px;">

If we look at the **loss** (`train_loss`), we see that it is going down, which is good, but it still has further to go. When we look at the predictions for *Company A* (`out_0`), we see that they started out pretty good, close to **0**, but then got really bad early on in training, shooting all the way up to **0.5**, but are starting to get smaller. In contrast, when we look at the predictions for *Company B* (`out_1`), we see that they started out really bad, close to **0**, but have been getting better ever since and look like they could continue to get better if we kept training.

In summary, the graphs seem to suggest that if we continued training our model, the predictions would improve. So let's add more epochs to the training.

## Adding More Epochs

The good news is that because we're using **Lightning**, we can pick up where we left off training without having to start over from scratch. This is because when we train with **Lightning**, it creates _checkpoint_ files that keep track of the Weights and Biases as they change. As a result, all we have to do to pick up where we left off is tell the `Trainer` where the checkpoint files are located. This will save us a lot of time since we don't have to retrain the first **2000** epochs. So let's add an additional **1000** epochs to the training.

In [7]:
# First, find where the most recent checkpoint files are stored
path_to_checkpoint = trainer.checkpoint_callback.best_model_path  # By default, "best" = "most recent"

print(
    "The new trainer will start where the last left off:",
    path_to_checkpoint,
    sep='\n')

The new trainer will start where the last left off:
c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\chapter_08\lightning_logs\version_0\checkpoints\epoch=1999-step=4000.ckpt


This will also create a new subfolder within the `lightning_logs` folder, more precisely a new "version" which will be concatenated in TensorBoard. We can track this added data by changing the graph color for each version. Don't forget to "fit domain to data" when analyzing the graphs on TensorBoard.

In [8]:
# Then create a new Lightning Trainer
trainer = L.Trainer(
    max_epochs=3000,  # by setting it to 3000, we're adding 1000 more
    # logger=logger,
)

# And then call fit() using the path to the most recent checkpoint files
# so that we can pick up where we left off
trainer.fit(
    model,
    train_dataloaders=dataloader,
    ckpt_path=path_to_checkpoint)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\chapter_08\lightning_logs\version_0\checkpoints\epoch=1999-step=4000.ckpt
c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\.env\Lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:445: The dirpath has changed from 'c:\\Users\\SÃ©bastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_08\\lightning_logs\\version_0\\checkpoints' to 'c:\\Users\\SÃ©bastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_08\\lightning_logs\\version_1\

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3000` reached.


Now that we have added **1000** epochs to the training, let's check the predictions.

In [9]:
print("New observed and predicted values (after 3000 total epochs):")
print(
    "Company A: Observed = 0, Predicted =",
    model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

New observed and predicted values (after 3000 total epochs):
Company A: Observed = 0, Predicted = tensor(0.2708)
Company B: Observed = 1, Predicted = tensor(0.7534)


They are much better than before. We can also check the logs with **TensorBoard** to see if it makes sense to add more epochs to the training (also note that we can set logarithmic scale).

Since we already have **TensorBoard** running in a separate browser window, all we have to do is reload that page to update the graphs (below).

<img src="./images/train_loss_3000_epochs.png" alt="Loss" style="width: 300px;"> <img src="./images/out_0_3000_epochs.png" alt="out_0" style="width: 300px;"> <img src="./images/out_1_3000_epochs.png?raw=1" alt="out_1" style="width: 300px;">

The blue lines in each graph represents the values we logged during the extra **1000** epochs. The **loss** is getting smaller and the predictions for both companies are improving!

However, because it looks like there is even more room for improvement, let's add **2000** more epochs to the training.

In [10]:
path_to_checkpoint = trainer.checkpoint_callback.best_model_path

trainer = L.Trainer(
    max_epochs=5000,  # By setting it to 5000, we're adding 2000 more
    # logger=logger,
)

trainer.fit(
    model,
    train_dataloaders=dataloader,
    ckpt_path=path_to_checkpoint)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Restoring states from the checkpoint path at c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\chapter_08\lightning_logs\version_1\checkpoints\epoch=2999-step=6000.ckpt
c:\Users\SÃ©bastien\Documents\data_science\machine_learning\statsquest_neural_networks\.env\Lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:445: The dirpath has changed from 'c:\\Users\\SÃ©bastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_08\\lightning_logs\\version_1\\checkpoints' to 'c:\\Users\\SÃ©bastien\\Documents\\data_science\\machine_learning\\statsquest_neural_networks\\chapter_08\\lightning_logs\\version_2\

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=5000` reached.


Now that we have added **2000** more epochs to the training (for a total of **5000** epochs), let's check the predictions.

In [11]:
print("New observed and predicted values (after 5000 total epochs):")
print(
    "Company A: Observed = 0, Predicted =",
    model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

New observed and predicted values (after 5000 total epochs):
Company A: Observed = 0, Predicted = tensor(0.0022)
Company B: Observed = 1, Predicted = tensor(0.9693)


And they look good! The prediction for *Company A* is super close to **0**, which is exactly what we want, and the prediction for *Company B* is close to **1**, which is also what we want.

Now let's look at the graphs in **TensorBoard** by updating the page (though we can set an automatic reload of the data).

<img src="./images/train_loss_5000_epochs.png" alt="Loss" style="width: 300px;"> <img src="./images/out_0_5000_epochs.png" alt="out_0" style="width: 300px;"> <img src="./images/out_1_5000_epochs.png" alt="out_1" style="width: 300px;">

The dark red lines show how things changed when we added an additional **2000** epochs to the training, for a total of **5000** epochs. Now we see that the **loss** (`train_loss`) and the predictions for each company apper to be tapering off, suggesting that adding more epochs may not improve the predictions much, so we're done.

Lastly, let's print out the final estimates for the Weights and Biases. In theory, they should be the same (within rounding error) as what we used in the **StatQuest** on **Long Short-Term Memory** and seen in the diagram of the **LSTM** unit at the top of this Jupyter notebook.

In [12]:
print("After optimization, the parameters are:")
for name, param in model.named_parameters():
    print(name, param.data)

After optimization, the parameters are:
wlr1 tensor(2.7043)
wlr2 tensor(1.6307)
blr1 tensor(1.6234)
wpr1 tensor(1.9983)
wpr2 tensor(1.6525)
bpr1 tensor(0.6204)
wp1 tensor(1.4122)
wp2 tensor(0.9393)
bp1 tensor(-0.3217)
wo1 tensor(4.3848)
wo2 tensor(-0.1943)
bo1 tensor(0.5935)


## Optimzing the LSTM using `nn.LSTM()`

Now that we know how to create an LSTM unit by hand, train it, and then use it to make good predictions, let's learn how to take advantage of PyTorch's `nn.LSTM()` function. For the most part, using `nn.LSTM()` allows us to simplify the `__init__()` function and the `forward()` function. The other big difference is that this time, we're not going to try and recreate the parameter values we used so far, and that means we can set the learning rate for the `Adam` to **0.1**. This will speed up training a lot. Everything else stays the same.

In [13]:
class LightningLSTM(L.LightningModule):

    def __init__(self): # __init__() is the class constructor function, and we use it to initialize the Weights and Biases.

        super().__init__() # initialize an instance of the parent class, LightningModule.

        L.seed_everything(seed=42)

        # input_size = number of features (or variables) in the data.
        # In our example we only have a single feature (value)
        # hidden_size = this determines the dimension of the output,
        # in other words, if we set hidden_size=1, then we have 1 output node
        # if we set hiddeen_size=50, then we have 50 output nodes (that can then 
        # be 50 input nodes to a subsequent fully connected neural network.
        self.lstm = nn.LSTM(input_size=1, hidden_size=1)

    def forward(self, input):
        input_trans = input.view(len(input), 1)  # transpose the input vector
        lstm_out, _ = self.lstm(input_trans)

        # lstm_out has the short-term memories for all inputs.
        # We make our prediction with the last one
        prediction = lstm_out[-1]
        return prediction

    def configure_optimizers(self):
        return Adam(self.parameters(), lr=0.1)  # we now set the learning rate to 0.1

    def training_step(self, batch):
        input_i, label_i = batch
        output_i = self.forward(input_i[0])
        loss = (output_i - label_i)**2

        self.log("train_loss", loss)

        # if (label_i == 0):
        #     self.log("out_0", output_i)
        # else:
        #     self.log("out_1", output_i)
        self.log("out_1", output_i) if label_i else self.log("out_0", output_i)

        return loss

Now let's create the model and print out the initial Weights and Biases and predictinos.

In [14]:
model = LightningLSTM()

print("Before optimization, the parameters are:")
for name, param in model.named_parameters():
    print(name, param.data)

print("\nObserved and predicted values:")
print(
    "Company A: Observed = 0, Predicted =",
    model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

Seed set to 42


Before optimization, the parameters are:
lstm.weight_ih_l0 tensor([[ 0.7645],
        [ 0.8300],
        [-0.2343],
        [ 0.9186]])
lstm.weight_hh_l0 tensor([[-0.2191],
        [ 0.2018],
        [-0.4869],
        [ 0.5873]])
lstm.bias_ih_l0 tensor([ 0.8815, -0.7336,  0.8692,  0.1872])
lstm.bias_hh_l0 tensor([ 0.7388,  0.1354,  0.4822, -0.1412])

Observed and predicted values:
Company A: Observed = 0, Predicted = tensor([0.6675])
Company B: Observed = 1, Predicted = tensor([0.6665])


As expected, the predictions are bad, so we will train the model.

With the hand made LSTM and the default learning rate, 0.001, it took about 5000 epochs to fully train the model. Now, with the learning rate set to 0.1, we only need 300 epochs. This will influence not only the speed of the training but also the logging steps.

_Remind that one **epoch** is one full pass over all of the data. To complete one epoch, the model must see both *Company A* and *Company B*. Actually, our **DataLoader** is set up with a `batch_size=1`. This means a "step" consists of processing just one row at a time. Therefore in our setup, **2 steps = 1 epoch**._

In [15]:
# Because we are doing so few epochs, we have to tell the trainer to log 
# every 2 steps (or single epoch, since we have two rows of training data), 
# the default updating the log files is every 50 steps.
trainer = L.Trainer(max_epochs=300, log_every_n_steps=2)

trainer.fit(model, train_dataloaders=dataloader)

ðŸ’¡ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name | Type | Params | Mode 
--------------------------------------
0 | lstm | LSTM | 16     | train
--------------------------------------
16        Trainable params
0         Non-trainable params
16        Total params
0.000     Total estimated model params size (MB)
1         Modules in train mode
0         Modules in eval mode


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=300` reached.


In [16]:
print("After optimization, the parameters are:")
for name, param in model.named_parameters():
    print(name, param.data)

After optimization, the parameters are:
lstm.weight_ih_l0 tensor([[3.5364],
        [1.3869],
        [1.5390],
        [1.2488]])
lstm.weight_hh_l0 tensor([[5.2070],
        [2.9577],
        [3.2652],
        [2.0678]])
lstm.bias_ih_l0 tensor([-0.9143,  0.3724, -0.1815,  0.6376])
lstm.bias_hh_l0 tensor([-1.0570,  1.2414, -0.5685,  0.3092])


Now that training is done, let's print out the new predictions...

In [17]:
print("New observed and predicted values (after 300 total epochs; lr=0.1):")
print(
    "Company A: Observed = 0, Predicted =",
    model(torch.tensor([0., 0.5, 0.25, 1.])).detach())
print(
    "Company B: Observed = 1, Predicted =",
    model(torch.tensor([1., 0.5, 0.25, 1.])).detach())

New observed and predicted values (after 300 total epochs; lr=0.1):
Company A: Observed = 0, Predicted = tensor([6.8118e-05])
Company B: Observed = 1, Predicted = tensor([0.9809])


As we can see, after just **300** epochs, the LSTM is making great predictions. The prediction for *Company A* is close to the observed value **0** and the prediction for *Company B* is close to the observed value **1**.

Lastly, let's refresh the **TensorBoard** page to see the latest graphs.

_To make it easier to see what we just did, deselect `version_0`, `version_1` and `version_2` and make sure `version_3` is checked on the left-hand side of the page, under where it says `Runs`. See below. This allows us to just look at the log files from the most rescent training, which only went for **300** epochs._

<img src="./images/selecting_run_version_3.png" alt="Loss" style="width: 150px;">

In all three graphs, the loss (`train_loss`) and the predictions for *Company A* (`out_0`) and *Company B* (`out_1`) started to taper off after **500** steps, or just **250** epochs, suggesting that adding more epochs may not improve the predictions much, so we're done.

<img src="./images/train_loss_nn.lstm_300_epochs.png" alt="Loss" style="width: 300px;"><img src="./images/out_0_nn.lstm_300_epochs.png" alt="Loss" style="width: 300px;"><img src="./images/out_1_nn.lstm_300_epochs.png" alt="Loss" style="width: 300px;">