In [1]:
import torch as tc
from torch.autograd import Variable

# Sequence model and Long-Short Term Memory Networks

We'll learn how to use Long-Short Term Memory networks, in short, **LSTM.**

Let $h_t$ be the hidden state at time `t`, 

$c_t$ be the cell state at time `t`,

$x_t$ be the hidden state of the previous layer at time `t` or $input_t$ for the first layer, and

$i_t$, $f_t$, $g_t$, $o_t$ be the input, forget, update, and out gates, respectively. Then,


### 1. Forget gate: 
$$f_t = \mathrm{sigmoid}(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf})$$
### 2.1 Input gate 1: Decide which one to update
$$i_t = \mathrm{sigmoid}(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi})$$
### 2.2 Input gate 2: Update
$$g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t-1)} + b_{hg})$$
### 3. Update the cell
$$c_t = f_t * c_{(t-1)} + i_t * g_t$$
### 4. Output
$$o_t = \mathrm{sigmoid}(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho})\\
        h_t = o_t * \tanh(c_t)$$
        
#### Note: $\mathrm{sigmoid}$  determines specific element/elements from a vector (i.e., filters a vector), and $\tanh$ pushes the output between -1 and 1

<http://excelsior-cjh.tistory.com/entry/RNN-LSTMLong-Short-Term-Memory-networks>

## 1. Learn how to use `tc.nn.LSTM`

Pytorch’s LSTM expects all of its inputs to be **3D tensors.** The semantics of the axes of these tensors is important. **The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input.** Here, we'll use minibatch size 1 (second dimension size is 1).

In [243]:
tc.manual_seed(1)
inputs = [Variable(tc.randn(1,3)) for i in range(5)]
hidden = (Variable(tc.randn(1,1,3)), Variable(tc.randn(1,1,3))) # hidden has two elements: out, hidden.

In [244]:
for i in inputs:
    out, hidden = lstm(i.view(1,1,-1), hidden)

You can find out that `out` is the first element of `hidden`.

In [245]:
out == hidden[0]

Variable containing:
(0 ,.,.) = 
  1  1  1
[torch.ByteTensor of size 1x1x3]

It is because `hidden` is composed of two elements: `out` and `hidde` that is passed onto the next hidden layer.

You can calculate the entire sequence at once (without using `for loop`) as the following:

In [246]:
tc.manual_seed(1)
inputs = [Variable(tc.randn(1,3)) for i in range(5)]
hidden = [Variable(tc.randn(1,3)) for i in range(2)]

inputs = tc.cat(inputs).view(len(inputs), 1, -1)
hidden = tc.cat(hidden).view(len(hidden), 1, -1)

In [247]:
out, hidden = lstm(inputs, hidden)

## 2. LSTM for Part-of-Speech Tagging

In this section, we will use an LSTM to get part of speech tags.

In [232]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

word_to_idx = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}


EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [233]:
print(word_to_idx)

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [None]:
class LSTMTagger(tc.nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = tc.nn.Embedding(vocab_size, embedding_dim)
        
        self.lstm = tc.nn.LSTM(embedding_dim, hidden_dim)
        self.hidden2tag = tc.nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        return (Variable(tc.zeros(1,1,self.hidden_dim)), Variable(tc.zeros(1,1,self.hidden_dim)))
    
    def forward(self, x):
        self.word_embeddings(x)
        out, self.hidden = self.lstm(x, self.hidden)
        
        tc.nn.functional.log_softmax(x)
        return 