### SEQUENCE MODELS AND LONG-SHORT TERM MEMORY NETWORKS

Within feed-forward networks, there is no state maintained by the network at all. The classical examples of a sequence model are the Hidden Markov Model for part-of-speech tagging and conditional random field.

A recurrent neural network is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state $h_t$, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

#### LSTM’s in Pytorch

Pytorch’s LSTM expects all of its inputs to be 3D tensors. 
* The first axis is the sequence itself.
* The second indexes instances in the mini-batch.
* The third indexes elements of the input.

> Isn't it like [batch_size, num_rows, num_cols]?????????????


We haven’t discussed mini-batching, so lets just ignore that and assume we will always have just 1 dimension on the second axis. If we want to run the sequence model over the sentence “The cow jumped”, our input should look like

$$
\begin{bmatrix}
The \\
row \\
jumped
\end{bmatrix}
$$

Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which case the 1st axis will have size 1 also.

Let’s see a quick example.

In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f6f586003d0>

In [24]:
lstm = nn.LSTM(3, 3) # input dim is 3, output dim is 3

In [25]:
# make inputs which have 5 (1, 3) shapes' sequencei
inputs = [torch.randn(1, 3) for _ in range(5)]
inputs

[tensor([[-0.1473,  0.3482,  1.1371]]),
 tensor([[-0.3339, -1.4724,  0.7296]]),
 tensor([[-0.1312, -0.6368,  1.0429]]),
 tensor([[ 0.4903,  1.0318, -0.5989]]),
 tensor([[ 1.6015, -1.0735, -1.2173]])]

In [26]:
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
hidden

(tensor([[[ 0.6472, -0.0412, -0.1775]]]),
 tensor([[[-0.5000,  0.8673, -0.2732]]]))

In [27]:
inputs[0].view(1, 1, -1)

tensor([[[-0.1473,  0.3482,  1.1371]]])

In [28]:
for in_val in inputs:
    # step through the sequence one element at a time (batch_size=1)
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(in_val.view(1, 1, -1), hidden)
    # pytorch's lstm expect the input dimension to be 3D
    

In [29]:
out

tensor([[[-0.1077,  0.0289, -0.0487]]], grad_fn=<StackBackward>)

alternatively, we can do the entire sequence all at once. The first value returned by LSTM is all of the hidden states throughout the sequence. The second is just the most recent hidden state (compare the last slice of "out" with "hidden" below, they are the same)

The reason for this is that "out" will give you access to all hidden states in the sequence and "hidden" will allow you to continue the sequence and backpropagate by passing it as an argument  to the lstm at a later time 


In [30]:
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
inputs

tensor([[[-0.1473,  0.3482,  1.1371]],

        [[-0.3339, -1.4724,  0.7296]],

        [[-0.1312, -0.6368,  1.0429]],

        [[ 0.4903,  1.0318, -0.5989]],

        [[ 1.6015, -1.0735, -1.2173]]])

In [31]:
hidden = (torch.randn(1, 1, 3), 
          torch.randn(1, 1, 3))

In [32]:
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[ 0.2486, -0.0525, -0.2524]],

        [[ 0.1750, -0.0048, -0.1143]],

        [[-0.0102,  0.0536, -0.1400]],

        [[-0.0357,  0.0877, -0.0192]],

        [[ 0.2145,  0.0192, -0.0337]]], grad_fn=<StackBackward>)
(tensor([[[ 0.2145,  0.0192, -0.0337]]], grad_fn=<StackBackward>), tensor([[[ 0.2984,  0.0952, -0.1647]]], grad_fn=<StackBackward>))
