[This Pytorch tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html) shows how to train an LSTM for predicting a part-of-speech tag (noun, verb, etc) for words in a sentence. The inputs are words and the outputs are tags.

Below is a summary.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"]),
]

Both the sentence and the tags are first encoded as integers:  

- inputs: `"The dog ate the apple"` --> `["The", "dog", "ate", "the", "apple"]` --> `[0, 1, 2, 3, 4]`
- targets: `["DET", "NN", "V", "DET", "NN"]` --> `[0, 1, 2, 0, 1]`

The number of unique encoded words gives the `vocab_size` and the unique encoded tags give the `target_size`.

The LSTM model gives a tag to a word as follows: 

1. word as a string
2. encode word as integer in `[0, vocab_size]` 
3. embed to `embedding_dim` real numbers 
4. lstm hidden state: `hidden_dim` real numbers 
5. linear layer: `tagset_size` real numbers
6. scores: `tagset_size` log-probabilities
7. tag = argmax(scores)

In practice, the above pipeline takes in a sentence of `sequence_length` words. All operations are vectorized over the words, except the LSTM layer that has a recursive operation (reduce?): information about each word is passed through the hidden state to the next word. 

In [8]:
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"DET": 0, "NN": 1, "V": 2}


def prepare_sequence(seq, to_ix):
    """Convert a word or tag to its index in the integer encoding."""
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

In [10]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        # like onehot encode but to real numbers
        # In: integers in [0, vocab_size)
        # Out: numbers in R^embedding_dim.
        # These are `vocab_size * embedding_dim` learnable parameters,
        # they are inside model.parameters() which is passed to the optimizer.
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Word embeddings to hidden states
        # In: R^embedding_dim,
        # Out: R^hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # Hidden states to tag space
        # In: R^hidden_dim
        # Out: R^tagset_size
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence_in):
        embeds = self.word_embeddings(sentence_in)
        lstm_out, _ = self.lstm(embeds.view(len(sentence_in), 1, -1))
        tag_space = self.hidden2tag(
            lstm_out.view(len(sentence_in), -1)
        )  # tagset_size numbers
        tag_scores = F.log_softmax(tag_space, dim=1)  # log-probabilities
        return tag_scores

In [31]:
model = LSTMTagger(
    embedding_dim=4,
    hidden_dim=6,
    vocab_size=len(word_to_ix),
    tagset_size=len(tag_to_ix),
)
loss_function = nn.NLLLoss()
sentence, tags = training_data[0]
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = prepare_sequence(tags, tag_to_ix)

embeds = model.word_embeddings(sentence_in)
lstm_out = model.lstm(embeds)[0]
tag_space = model.hidden2tag(lstm_out)

In [37]:
print("1. Words and tags\n", sentence, tags)
print("2. Encoded words and tags\n", sentence_in, targets)
print("3. Embedding\n", embeds)
print("4. LSTM\n", lstm_out)
print("5. Linear layer\n", tag_space)
print("6. Scores\n", F.softmax(tag_space, dim=1))
print("7. argmax\n", torch.argmax(F.softmax(tag_space, dim=1), axis=1))

1. Words and tags
 ['The', 'dog', 'ate', 'the', 'apple'] ['DET', 'NN', 'V', 'DET', 'NN']
2. Encoded words and tags
 tensor([0, 1, 2, 3, 4]) tensor([0, 1, 2, 0, 1])
3. Embedding
 tensor([[-0.7992, -0.8164, -0.1888, -1.1254],
        [ 0.0066,  3.7635, -1.4803, -1.3627],
        [-1.9548,  0.6337,  1.7014,  1.2115],
        [ 0.5464, -1.6631,  0.1975, -2.0238],
        [-0.0664,  0.5113,  1.8811, -0.5149]], grad_fn=<EmbeddingBackward0>)
4. LSTM
 tensor([[-0.1046,  0.0785,  0.0024, -0.0414, -0.1322,  0.0199],
        [ 0.0335, -0.4348,  0.0360, -0.2240, -0.3378, -0.3356],
        [ 0.2729, -0.4089, -0.0230, -0.3448, -0.1964, -0.1101],
        [ 0.0023, -0.0265,  0.0568, -0.0335, -0.2466,  0.0274],
        [ 0.0989, -0.3172,  0.0404, -0.1156, -0.1845,  0.1310]],
       grad_fn=<SqueezeBackward1>)
5. Linear layer
 tensor([[-0.0543,  0.3550,  0.0950],
        [-0.0334,  0.4461,  0.0407],
        [ 0.1042,  0.4463, -0.0926],
        [-0.0433,  0.3911,  0.0404],
        [ 0.1101,  0.4131, -0.0