# LSTM Structure and Hidden State

We know that RNNs are used to maintain a kind of memory by linking the output of one node to the input of the next. In the case of an LSTM, for each piece of data in a sequence (say, for a word in a given sentence), there is a corresponding *hidden state* $h_t$. This hidden state is a function of the pieces of data that an LSTM has seen over time; it contains some weights and, represents both the short term and long term memory components for the data that the LSTM has already seen. 

So, for an LSTM that is looking at words in a sentence, **the hidden state of the LSTM will change based on each new word it sees. And, we can use the hidden state to predict the next word in a sequence** or help identify the type of word in a language model, and lots of other things!


## LSTMs in Pytorch

To create and train an LSTM, you have to know how to structure the inputs, and hidden state of an LSTM. In PyTorch an LSTM can be defined as: `lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)`.

In PyTorch, an LSTM expects all of its inputs to be 3D tensors, with dimensions defined as follows:
>* `input_size` = the number of inputs (a dimension of 20 could represent 20 inputs)
>* `hidden_size` = the size of the hidden state; this will be the number of outputs that each LSTM cell produces at each time step.
>* `num_layers ` = the number of hidden LSTM layers to use; this is typically  a value between 1 and 3; a value of 1 means that each LSTM cell has one hidden state. This has a default value of 1.

<img src='./images/lstm_simple_ex.png'>
    
### Hidden State

Once an LSTM has been defined with input and hidden dimensions, we can call it and retrieve the output and hidden state at every time step.
 `out, hidden = lstm(input.view(1, 1, -1), (h0, c0))`. The `hidden` state comprises of new short term memory (can be same as output, as we'd see shortly) and the new long term memory. When we pass on h0 and c0, we are basically supplying the initial values of these two aspects only. Why the new short term memory can be same as output ? It is because as we'd see in the example below, that a single LSTM layer does not represent a model. It is like a complex hidden layer. And in our examples below we have not said, how to process the activations from it. Hence, `out` and the first part of the `hidden` i.e., new short term memory would be one and same in this example when we look them up at every time step.

The inputs to an LSTM are **`(input, (h0, c0))`**.
>* `input` = a Tensor containing the values in an input sequence; this has values: (seq_len, batch, input_size)
>* `h0` = a Tensor containing the initial hidden state (short term memory) for each element in a batch
>* `c0` = a Tensor containing the initial cell memory (long term memory) for each element in the batch

`h0` and `c0` will default to 0, if they are not specified. Their dimensions are: (n_layers, batch, hidden_dim).

These will become clearer in the example in this notebook. This and the following notebook are modified versions of [this PyTorch LSTM tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#lstm-s-in-pytorch).

Let's take a simple example and say we want to process a single sentence through an LSTM. If we want to run the sequence model over one sentence "Giraffes in a field", our input should look like this `1x4` row vector of individual words:

\begin{align}\begin{bmatrix}
   \text{Giraffes  } 
   \text{in  } 
   \text{a  } 
   \text{field} 
   \end{bmatrix}\end{align}

In this case, we know that we have **4 inputs words** and we decide how many outputs to generate at each time step, say we want each LSTM cell to generate **3 hidden state values**. We'll keep the number of layers in our LSTM at the default size of 1.

The hidden state and cell memory will have dimensions (n_layers, batch, hidden_dim), and in this case that will be (1, 1, 3) for a 1 layer model with one batch/sequence of words to process (this one sentence) and 3 genereated, hidden state values.


### Example Code

Next, let's see an example of one LSTM that is designed to look at a sequence of 4 values (numerical values since those are easiest to create and track) and generate 3 values as output. This is what the sentence processing network from above will look like, and you are encouraged to change these input/hidden-state sizes to see the effect on the structure of the LSTM!

In [27]:
from typing import List
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

%matplotlib inline

torch.manual_seed(2) # so that random variables will be consistent and repeatable for testing

<torch._C.Generator at 0x21fb0103f90>

### Define a simple LSTM


**A note on hidden and output dimensions**

The `hidden_dim` and size of the output will be the same unless you define your own LSTM and change the number of outputs by adding a linear layer at the end of the network, ex. fc = nn.Linear(hidden_dim, output_dim).

> We would eventually understand, that the nn.LSTM layer spans over the entirety of the timestamps in an example sequence i.e., intuitively it represents an unfolded model.

###  Note about the weights of the LSTM

~LSTM.weight_ih_l[k] – the learnable input-hidden weights of the $k^{th}$ layer (W_ii|W_if|W_ig|W_io), of shape (4 \* hidden_size, input_size) for k = 0. Otherwise, the shape is (4*hidden_size, num_directions * hidden_size). If proj_size > 0 was specified, the shape will be (4*hidden_size, num_directions * proj_size) for k > 0

~LSTM.weight_hh_l[k] – the learnable hidden-hidden weights of the $k^{th}$ layer (W_hi|W_hf|W_hg|W_ho), of shape (4 \* hidden_size, hidden_size). If proj_size > 0 was specified, the shape will be (4*hidden_size, proj_size).

The term $k$ can be greater than 0, when we stack LSTMs. In this notebook we are not doing that. Moreover, we notice, that the weights matrices are stacked vertically to compute their compact dimensions. In our example in this notebook,  k = 0.

In [28]:
# All of the Tensors, which we'd use here are enabled with AutoGrad because, that's what we need 
# and we don't need deprecated Variable wrapper type for that anymore. When a normal Tensor needs 
# to be converted to a differentiable Tensor, we can use requires_grad_ method of the Tensor class, 
# which accepts a boolean value True as default.

# Defines an LSTM with an input dim of 4 and hidden dim of 3
# This expects to see 4 values as input and generates 3 values as output
batch_size: int = 1  # Because we have a single example input sequence
input_size: int = 4  # Input size denotes the input vector at each time step of the sequence
hidden_size: int = 3 # Hidden size is analogous to having that number of neurons in a hidden layer
num_layers: int = 1  # We have a single LSTM layer in the vertical direction rather than stacking LSTMs

lstm: nn.LSTM = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers) 



# Since the four weight matrices are stacked on vertical axis to come up with a compact dimension 
# we'd see the shape (3,4) | (3,4) | (3,4) | (3,4) -> (12, 4) for weights connecting the input 
# to hidden state. Input has size 4 and hidden size is 3.
print(f'LSTM layer input to hidden parameters shape (W_x): {lstm.weight_ih_l0.shape}')
# Similarly in this case the weights are connecting hidden state from previous time step to the hidden 
# state of current time step, we'd see shape (3,3) | (3,3) | (3,3) | (3,3) -> (12, 3) for the weights 
print(f'LSTM layer hidden to hidden parameters shape (W_x): {lstm.weight_hh_l0.shape}')

# Makes an input sequence having length 5 i.e., it has 5 time steps, where each time step has an 
# input of size R^input_size. This is important to remember, that we have a single input sequence 
# or in other words, a single training example in a single batch, hence batch_size = 1.
input_sequence: List[torch.Tensor] = [torch.randn(1, input_size, requires_grad=True) for _ in range(5)]
print('inputs: \n', input_sequence)
print('\n')


# Next we'd initialize the initial hidden state h0 and c0. Here, we need to keep in mind, what 
# these vectors really are. The h0 or generally h_t is our short term memory (STM), which  
# also leads to the output for a given time step. The other component c0 or generally c_t is the 
# the long term memory (LTM).

# We'd see, that the LSTM object yields two tensors, which we refer as out and hidden. The 
# out part can be called short term memory or h_t for time t. The hidden part encapsulates 
# two things the final hidden state at time t and long term memory at time t. Now in this 
# example we have a single LSTM layer and we have not mentioned, what to do with the final 
# hidden state (think of it as dense hidden layer activation). Hence, we'd see for the 
# following iteration over our five time steps the out and the first part and the final 
# hidden state are one and same. We have various options here, which we'd explore later.
# For now, we'd have to keep these semantics in mind.

# Initialize the hidden state
# (1 layer, 1 batch_size, 3 outputs)
# first tensor is the hidden state, h0
# second tensor initializes the cell memory, c0
h0: torch.Tensor = torch.randn(1, 1, hidden_size, requires_grad=True)
c0: torch.Tensor = torch.randn(1, 1, hidden_size, requires_grad=True)

# step through the sequence one element at a time.
for timestep_item in input_sequence:
    # after each step, hidden contains the hidden state
    out, hidden = lstm(timestep_item.view(1, 1, -1), (h0, c0))
    print('out: \n', out)
    print('hidden: \n', hidden)

LSTM layer input to hidden parameters shape (W_x): torch.Size([12, 4])
LSTM layer hidden to hidden parameters shape (W_x): torch.Size([12, 3])
inputs: 
 [tensor([[1.4934, 0.4987, 0.2319, 1.1746]], requires_grad=True), tensor([[-1.3967,  0.8998,  1.0956, -0.5231]], requires_grad=True), tensor([[-0.8462, -0.9946,  0.6311,  0.5327]], requires_grad=True), tensor([[-0.8454,  0.9406, -2.1224,  0.0233]], requires_grad=True), tensor([[ 0.4836,  1.2895,  0.8957, -0.2465]], requires_grad=True)]


out: 
 tensor([[[-0.4372,  0.2583,  0.2947]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.4372,  0.2583,  0.2947]]], grad_fn=<StackBackward0>), tensor([[[-0.7344,  0.6209,  0.4191]]], grad_fn=<StackBackward0>))
out: 
 tensor([[[-0.2836,  0.1314,  0.4133]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.2836,  0.1314,  0.4133]]], grad_fn=<StackBackward0>), tensor([[[-0.5041,  0.2672,  0.6370]]], grad_fn=<StackBackward0>))
out: 
 tensor([[[-0.3404,  0.4880,  0.1949]]], grad_fn=<StackBackward0>)

You should see that the output and hidden Tensors are always of length 3, which we specified when we defined the LSTM with `hidden_size`. 

### All at once

A for loop is not very efficient for large sequences of data, so we can also, **process all of these inputs at once.** This means instead of processing each time step one by one. We can prepare the full form of the input sequence by combining the parts for all time steps.

1. Concatenate all our time step input i.e. $x_t$ into one big tensor, with a defined batch_size, which in this case would still be 1.
2. Define the shape of our hidden state, which would also be the same. 
3. Get the outputs and the *most recent* hidden state (created after the last word in the sequence has been seen).


The outputs may look slightly different due to our differently initialized hidden state.

In [29]:
# Turn inputs into a tensor with 5 rows of data
# add the extra 2nd dimension (1) for batch_size
# This inputs array represents a complete sequence spread over 5 time steps together.
inputs: torch.Tensor = torch.cat(input_sequence).view(len(input_sequence), batch_size, -1).requires_grad_()

# print out our inputs and their shape
# you should see (number of sequences, batch size, input_dim)
print('inputs size: \n', inputs.size())
print('\n')

print('inputs: \n', inputs)
print('\n')

# initialize the hidden state
h0: torch.Tensor = torch.randn(1, 1, hidden_size, requires_grad=True)
c0: torch.Tensor = torch.randn(1, 1, hidden_size, requires_grad=True)


# get the outputs and hidden state
out, hidden = lstm(inputs, (h0, c0))

print('out: \n', out)
print('hidden: \n', hidden)

inputs size: 
 torch.Size([5, 1, 4])


inputs: 
 tensor([[[ 1.4934,  0.4987,  0.2319,  1.1746]],

        [[-1.3967,  0.8998,  1.0956, -0.5231]],

        [[-0.8462, -0.9946,  0.6311,  0.5327]],

        [[-0.8454,  0.9406, -2.1224,  0.0233]],

        [[ 0.4836,  1.2895,  0.8957, -0.2465]]], grad_fn=<ViewBackward0>)


out: 
 tensor([[[ 0.1611,  0.2200,  0.2213]],

        [[ 0.0364, -0.0390,  0.2638]],

        [[-0.1425, -0.0174,  0.1504]],

        [[-0.1583,  0.1264,  0.1709]],

        [[-0.2007, -0.1559,  0.2489]]], grad_fn=<StackBackward0>)
hidden: 
 (tensor([[[-0.2007, -0.1559,  0.2489]]], grad_fn=<StackBackward0>), tensor([[[-0.4429, -0.2975,  0.3252]]], grad_fn=<StackBackward0>))


In context of the `out` and `hidden` states we now see, that the last row of the `out` tensor is exactly same as first part of the final hidden state. This is expected, because first part of the final hidden state is the short term memory from the last time step. And as we mentioned before since we have a single LSTM layer the `out` from a given time step is same as the short term memory from that time step. In this case last row of the out naturally belongs to the last time step.

Now, we'd brings this knowledge to the next notebook, where we'd consider a concrete LSTM training scenario.

### Next: Hidden State and Gates

This notebooks shows you the structure of the input and output of an LSTM in PyTorch. Next, you'll learn more about how exactly an LSTM represents long-term and short-term memory in it's hidden state, and you'll reach the next notebook exercise.

#### Part of Speech

In the notebook that comes later in this lesson, you'll see how to define a model to tag parts of speech (nouns, verbs, determinants), include an LSTM and a Linear layer to define a desired output size, *and* finally train our model to create a distribution of class scores that associates each input word with a part of speech.