###  An LSTM for Part-of-Speech Tagging

let our input sentence be $w_1, ..., w_M, where w_i \in V$, our vocab. Also, let $T$ be our tag set, and $y_i$ the tag of word $w_i$. Denote our prediction of the tag of word $w_i$ by $y^i$.

This is a structure prediction, model, where our output is a sequence $y^1, ..., y^M, where y^i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden state at timestep $i$ as $h_i$. Also, assign each tag a unique index (like how we had word_to_$i_x$ in the word embeddings section). Then our prediction rule for $y^i$ is

$$ y^i = argmax_j (\log Softmax(Ah_i+b))_j$$

> A, b는 linear layer의 weights?

That is, take the log softmax of the affine map of the hidden state, and the predicted tag is the tag that has the maximum value in this vector. Note this implies immediately that the dimensionality of the target space of $A$ is $|T|$.

In [36]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

In [20]:
# prepare data:
def word2idx(word, to_ix):
    idxs = [to_ix[w] for w in word]

    return torch.tensor(idxs, dtype=torch.long)


In [21]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
training_data

[(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN']),
 (['Everybody', 'read', 'that', 'book'], ['NN', 'V', 'DET', 'NN'])]

In [22]:
dict_word2idx = {}

for sentence, tag in training_data:
    for word in sentence:
        if word not in dict_word2idx:
            dict_word2idx[word] = len(dict_word2idx)
print(dict_word2idx)

dict_tag2idx = {"DET": 0, "NN": 1, "V": 2}

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [23]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

> how to decide the dim?

> does both of tem have to be same?

In [28]:
class Model(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(Model, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.fc = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeddings = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeddings.view(len(sentence), 1, -1))
        tag_space = self.fc(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        
        return tag_scores

In [29]:
model = Model(EMBEDDING_DIM, HIDDEN_DIM, len(dict_word2idx), len(dict_tag2idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [41]:
# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.

with torch.no_grad():
    inputs = word2idx(training_data[0][0], dict_word2idx)
    print(inputs)
    tag_scores = model(inputs)
    print(tag_scores)
    for i in range(len(tag_scores)):
        print(np.argmax(tag_scores[i]))
        
    print("------------------------------------------------------------")
    embeddings = nn.Embedding(6, 6)
    embeds = embeddings(inputs)
    print(embeds)
    for i in range(len(embeds)):
        print(np.argmax(embeds[i]))

tensor([0, 1, 2, 3, 4])
tensor([[-0.0407, -3.7853, -4.0645],
        [-4.1793, -0.0685, -2.9776],
        [-3.9639, -3.1709, -0.0629],
        [-0.0136, -4.6197, -5.6138],
        [-4.5342, -0.0339, -3.7879]])
tensor(0)
tensor(1)
tensor(2)
tensor(0)
tensor(1)
------------------------------------------------------------
tensor([[ 0.3923, -0.0048, -1.3002, -1.2137, -0.6849, -0.1871],
        [-0.4928, -0.0336, -0.4821,  1.0056, -1.3088,  0.1777],
        [ 0.6634, -0.7987,  2.5142,  1.4382,  2.5683,  0.7818],
        [ 1.0874,  1.6349, -0.6058,  1.3048,  0.0794, -0.2434],
        [ 0.6373, -0.6516,  0.2432,  0.1030, -2.0268, -1.4972]])
tensor(0)
tensor(3)
tensor(4)
tensor(1)
tensor(0)


In [32]:
for epoch in range(300):  
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        sentence_in = word2idx(sentence, dict_word2idx)
        targets = word2idx(tags, dict_tag2idx)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

In [38]:
# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores)
    for i in range(len(tag_scores)):
        print(np.argmax(tag_scores[i]))

tensor([[-0.0407, -3.7853, -4.0645],
        [-4.1793, -0.0685, -2.9776],
        [-3.9639, -3.1709, -0.0629],
        [-0.0136, -4.6197, -5.6138],
        [-4.5342, -0.0339, -3.7879]])
tensor(0)
tensor(1)
tensor(2)
tensor(0)
tensor(1)
