# BiLSTM-CRF 

References:  
1. [ADVANCED: MAKING DYNAMIC DECISIONS AND THE BI-LSTM CRF](https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html)  
2. [Implementing a linear-chain Conditional Random Field (CRF) in PyTorch](https://towardsdatascience.com/implementing-a-linear-chain-conditional-random-field-crf-in-pytorch-16b0b9c4b4ea)


## Linear-Chain Conditional Random Field (CRF) 

The source sequence is $x = \{x_1, x_2, \dots, x_T \}$, and the target sequence is $y = \{y_1, y_2, \dots, y_T \}$.  
If we ignore the dependence between elements in $y$, we can model as:
$$
\begin{aligned}
P(y|x) &= \prod_{t=1}^T \frac{\exp \left( U(x_t, y_t) \right)}{\sum_{y'_t} \exp \left( U(x_t, y'_t) \right)} \\
&= \prod_{t=1}^T \frac{\exp \left( U(x_t, y_t) \right)}{Z(x_t)} \\
&= \frac{\exp \left( \sum_{t=1}^T U(x_t, y_t) \right)}{\prod_{t=1}^T Z(x_t)} \\
&= \frac{\exp \left( \sum_{t=1}^T U(x_t, y_t) \right)}{Z(x)}
\end{aligned}
$$
where $U(x_t, y_t)$ is *emissions* or *unary scores*, $Z(x_t)$ is *partition function* (a normalization factor).  
In a *linear-chain CRF*, we add *transition scores* $T(y_t, y_{t+1})$ to the above equation:  
$$
P(y|x) = \frac{\exp \left( \sum_{t=1}^T U(x_t, y_t) + \sum_{t=0}^T T(y_t, y_{t+1}) \right)}{Z(x)}
$$
where $y_0$ and $y_{T+1}$ are the starting and stopping tags, respectively; their values are fixed.  
The partition function should sum over all possible combinations over the label set at each timestep: 
$$
Z(x) = \sum_{y'_1} \sum_{y'_2} \dots \sum_{y'_T} \exp \left( \sum_{t=1}^T U(x_t, y'_t) + \sum_{t=0}^T T(y'_t, y'_{t+1}) \right)
$$

The *negative log-likelihood loss (NLL-Loss)* is: 
$$
\begin{aligned}
L &= -\log \left( P(y|x) \right) \\
&= \log \left( Z(x) \right) - \left( \sum_{t=1}^T U(x_t, y_t) + \sum_{t=0}^T T(y_t, y_{t+1}) \right)
\end{aligned}
$$

## Forward Algorithm: Dynamic Programing for Computing the Partition Function

The time complexity of computing $Z(x)$ would be $O(\vert y \vert^T)$... but we can use *dynamic programing* to reduce it.  
Specifically, we define the state:
$$
\alpha_s (y_s) = \sum_{y'_1} \sum_{y'_2} \dots \sum_{y'_{s-1}} \exp \left( \sum_{t=1}^{s-1} U(x_t, y'_t) + \sum_{t=0}^{s-2} T(y'_t, y'_{t+1}) + T(y'_{s-1}, y_s) \right)
$$
where $\alpha_s (y_s)$ may be regarded as the sum of scores reaching $y_s$. Note:  
$$
\begin{aligned}
\alpha_1(y_1) &= \exp \left( T(y_0, y_1) \right) \\
\alpha_{T+1}(y_{T+1}) &= Z(x)
\end{aligned}
$$

When computing $\alpha_{s+1}(y_{s+1})$, we only require the information of $\alpha_s(y'_s)$ for different $y'_s$, instead of the information before step $s$ (i.e., the paths reaching each $y'_s$). Hence, this is a dynamic programing problem. 
In the log-space, we have the *state transition equation*:  
$$
\begin{aligned}
\log( \alpha_{s+1}(y_{s+1}) ) &= \log \left( \sum_{y'_1} \sum_{y'_2} \dots \sum_{y'_s} \exp \left( \sum_{t=1}^s U(x_t, y'_t) + \sum_{t=0}^{s-1} T(y'_t, y'_{t+1}) + T(y'_s, y_{s+1}) \right) \right) \\
&= \log \left( \sum_{y'_s} \exp \left( U(x_s, y'_s) + T(y'_s, y_{s+1}) \right) \sum_{y'_1} \sum_{y'_2} \dots \sum_{y'_{s-1}} \exp \left( \sum_{t=1}^{s-1} U(x_t, y'_t) + \sum_{t=0}^{s-2} T(y'_t, y'_{t+1}) + T(y'_{s-1}, y'_s) \right) \right) \\
&= \log \left( \sum_{y'_s} \exp \left( U(x_s, y'_s) + T(y'_s, y_{s+1}) \right) \alpha_s(y'_s) \right) \\
&= \log \left( \sum_{y'_s} \exp \left( U(x_s, y'_s) + T(y'_s, y_{s+1}) + \log (\alpha_s(y'_s)) \right) \right)
\end{aligned}
$$

**Note** that it is equivalent to add the *unary scores* before or after `logsumexp`:  
$$
\begin{aligned}
\log( \alpha_{s+1}(y_{s+1}) ) + U(x_{s+1}, y_{s+1}) &= \log \left( \sum_{y'_s} \exp \left( U(x_s, y'_s) + T(y'_s, y_{s+1}) + \log (\alpha_s(y'_s)) \right) \right) + U(x_{s+1}, y_{s+1}) \\
&= \log \left( \sum_{y'_s} \exp \left( U(x_s, y'_s) + T(y'_s, y_{s+1}) + \log (\alpha_s(y'_s)) \right) \right) + \log( \exp ( U(x_{s+1}, y_{s+1}) )) \\
&= \log \left( \sum_{y'_s} \exp \left( U(x_{s+1}, y_{s+1}) + U(x_s, y'_s) + T(y'_s, y_{s+1}) + \log (\alpha_s(y'_s)) \right) \right)
\end{aligned}
$$
Hence, for each step, it is equivalent to deal with $U(x_s, y_s)$ and $T(y_{s-1}, y_s)$ together, or deal with $U(x_s, y_s)$ and $T(y_s, y_{s+1})$ together.  

## Viterbi Algorithm: Finding the Best Sequence of Labels 

Similarly, we use dynamic programing to find the best sequence of labels (i.e., decoding).  
Specifically, we define the state: 
$$
\beta_s (y_s) = \max_{y'_1, y'_2, \dots, y'_{s-1}} \sum_{t=1}^{s-1} U(x_t, y'_t) + \sum_{t=0}^{s-2} T(y'_t, y'_{t+1}) + T(y'_{s-1}, y_s) 
$$
where $\beta_s (y_s)$ may be regarded as the max score reaching $y_s$. Note: 
$$
\begin{aligned}
\beta_1 (y_1) &= T(y_0, y_1) \\
\beta_{T+1}(y_{T+1}) &= \max_{y'_1, y'_2, \dots, y'_T} \sum_{t=1}^T U(x_t, y'_t) + \sum_{t=0}^{T-1} T(y'_t, y'_{t+1}) + T(y'_T, y_{T+1}) 
\end{aligned} 
$$

Aparently, this is again a dynamic programing problem. We have the *state transition equation*: 
$$
\begin{aligned}
\beta_{s+1} (y_{s+1}) &= \max_{y'_1, y'_2, \dots, y'_s} \sum_{t=1}^s U(x_t, y'_t) + \sum_{t=0}^{s-1} T(y'_t, y'_{t+1}) + T(y'_s, y_{s+1}) \\
&= \max_{y'_s} U(x_s, y'_s) + T(y'_s, y_{s+1}) + \max_{y'_1, y'_2, \dots, y'_{s-1}} \sum_{t=1}^{s-1} U(x_t, y'_t) + \sum_{t=0}^{s-2} T(y'_t, y'_{t+1}) + T(y'_{s-1}, y'_s) \\
&= \max_{y'_s} U(x_s, y'_s) + T(y'_s, y_{s+1}) + \beta_s (y'_s)
\end{aligned}
$$

In [1]:
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

SEED = 515
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Preparing Data

In [2]:
training_data = [("the wall street journal reported today that apple corporation made money".split(),
                  "B I I I O O O B I O O".split()), 
                 ("georgia tech is a university in georgia".split(),
                  "B I O O O O B".split())]
training_data

[(['the',
   'wall',
   'street',
   'journal',
   'reported',
   'today',
   'that',
   'apple',
   'corporation',
   'made',
   'money'],
  ['B', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'O']),
 (['georgia', 'tech', 'is', 'a', 'university', 'in', 'georgia'],
  ['B', 'I', 'O', 'O', 'O', 'O', 'B'])]

In [3]:
START_TAG = "<START>"
STOP_TAG = "<STOP>"

word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

## Testing Algorithms

In [4]:
EMB_DIM = 5
HID_DIM = 4
VOC_DIM = len(word_to_ix)
TAG_DIM = len(tag_to_ix)

emb = nn.Embedding(VOC_DIM, EMB_DIM)
rnn = nn.LSTM(EMB_DIM, HID_DIM//2, num_layers=1, bidirectional=True)
hid2tag = nn.Linear(HID_DIM, TAG_DIM)

In [5]:
ex_idx = 0
# Use the first several steps as illustration
# sent/tags: (step, batch=1)
sent = torch.tensor([word_to_ix[w] for w in training_data[ex_idx][0]], dtype=torch.long).unsqueeze(1)[:5]
tags = torch.tensor([tag_to_ix[t] for t in training_data[ex_idx][1]], dtype=torch.long).unsqueeze(1)[:5]
print(sent)
print(tags)

tensor([[0],
        [1],
        [2],
        [3],
        [4]])
tensor([[0],
        [1],
        [1],
        [1],
        [2]])


In [6]:
embbed = emb(sent)
rnn_outs, _ = rnn(embbed)

# feats: (step, batch=1, tag_dim)
feats = hid2tag(rnn_outs)
print(feats)

tensor([[[ 0.1154,  0.1013, -0.2363, -0.1659, -0.3134]],

        [[-0.1415,  0.1352, -0.2847, -0.2100, -0.2518]],

        [[-0.0351,  0.0659, -0.1660, -0.1903, -0.2150]],

        [[ 0.0703, -0.0114,  0.0622,  0.0429, -0.1247]],

        [[-0.0787,  0.0378, -0.0107,  0.0106, -0.1368]]],
       grad_fn=<AddBackward0>)


In [7]:
# transitions[i, j] is the score of transitioning *to* i *from* j.
transitions = nn.Parameter(torch.randn(TAG_DIM, TAG_DIM))
transitions.data[tag_to_ix[START_TAG], :] = -1e4
transitions.data[:, tag_to_ix[STOP_TAG]] = -1e4
transitions

Parameter containing:
tensor([[-1.1290e+00, -1.2998e+00, -1.6684e+00, -8.6616e-01, -1.0000e+04],
        [-4.6060e-01,  4.6536e-01, -1.5174e+00, -1.5891e+00, -1.0000e+04],
        [-1.2220e+00, -2.4080e-01, -8.2866e-01,  7.9692e-01, -1.0000e+04],
        [-1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04],
        [-5.8353e-01, -6.6203e-01,  1.1687e-01,  1.1368e+00, -1.0000e+04]],
       requires_grad=True)

In [8]:
# The numerator: score
# The original implementation
def _score_sentence(feats, tags):
    score = torch.zeros(1)
    tags = torch.cat([torch.tensor([tag_to_ix[START_TAG]], dtype=torch.long), tags])
    for i, feat in enumerate(feats):
        score = score + transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
    score = score +transitions[tag_to_ix[STOP_TAG], tags[-1]]
    return score

_score_sentence(feats.squeeze(1), tags.squeeze(1))

tensor([-0.2256], grad_fn=<AddBackward0>)

In [9]:
# Vectorized implementation
def _score_sentence_vec(feats, tags):
    feat_scores = feats.gather(dim=-1, index=tags.unsqueeze(-1)).squeeze(-1)
    # print(feat_scores.size())

    from_tags = torch.cat([torch.full((1, tags.size(1)), fill_value=tag_to_ix[START_TAG], dtype=torch.long), tags], dim=0)
    to_tags = torch.cat([tags, torch.full((1, tags.size(1)), fill_value=tag_to_ix[STOP_TAG], dtype=torch.long)], dim=0)
    trans_scores = transitions[to_tags, from_tags]
    # print(trans_scores.size())

    return feat_scores.sum(dim=0) + trans_scores.sum(dim=0)

_score_sentence_vec(feats, tags)

tensor([-0.2256], grad_fn=<AddBackward0>)

In [10]:
import itertools
scores = []
for potential_tags in itertools.product(*[range(TAG_DIM) for _ in range(feats.size(0))]):
    potential_tags = torch.tensor(potential_tags, dtype=torch.long).unsqueeze(1)
    potential_score = _score_sentence_vec(feats, potential_tags)
    scores.append(potential_score)

torch.logsumexp(torch.cat(scores), dim=-1)

tensor(2.4575, grad_fn=<LogsumexpBackward>)

In [11]:
# The denominator: partition function
# The original implementation
def log_sum_exp(vec):
    max_score = vec[0, torch.argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

def _forward_alg(feats):
    # Do the forward algorithm to compute the partition function
    init_alphas = torch.full((1, TAG_DIM), -1e4)
    # START_TAG has all of the score.
    init_alphas[0][tag_to_ix[START_TAG]] = 0.

    # Wrap in a variable so that we will get automatic backprop
    forward_var = init_alphas

    # Iterate through the sentence
    for feat in feats:
        alphas_t = []  # The forward tensors at this timestep
        for next_tag in range(TAG_DIM):
            # broadcast the emission score: it is the same regardless of
            # the previous tag
            emit_score = feat[next_tag].view(1, -1).expand(1, TAG_DIM)
            # the ith entry of trans_score is the score of transitioning to
            # next_tag from i
            trans_score = transitions[next_tag].view(1, -1)
            # The ith entry of next_tag_var is the value for the
            # edge (i -> next_tag) before we do log-sum-exp
            next_tag_var = forward_var + trans_score + emit_score
            # The forward variable for this tag is log-sum-exp of all the
            # scores.
            alphas_t.append(log_sum_exp(next_tag_var).view(1))
        forward_var = torch.cat(alphas_t).view(1, -1)
    terminal_var = forward_var + transitions[tag_to_ix[STOP_TAG]]
    alpha = log_sum_exp(terminal_var)
    return alpha

_forward_alg(feats.squeeze(1))

tensor(2.4575, grad_fn=<AddBackward0>)

In [12]:
# Vectorized implementation
alphas = torch.full((feats.size(1), TAG_DIM), fill_value=-1e4)
alphas[:, tag_to_ix[START_TAG]] = 0
print(alphas)

for t in range(feats.size(0)):
    # alphas: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
    # feats[t]: (batch=1, tag_dim) -> (batch=1, tag_dim, 1)
    alphas = torch.logsumexp(alphas.unsqueeze(1) + feats[t].unsqueeze(2) + transitions, dim=-1)

alphas = alphas + transitions[tag_to_ix[STOP_TAG]]
print(torch.logsumexp(alphas, dim=-1))

tensor([[-10000., -10000., -10000.,      0., -10000.]])
tensor([2.4575], grad_fn=<LogsumexpBackward>)


In [13]:
# Note: It is equivalent to add the unary scores before or after `logsumexp`. 
alphas = transitions[:, tag_to_ix[START_TAG]].unsqueeze(0)
print(alphas)

for t in range(feats.size(0)):
    # alphas: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
    # feats[t]: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
    alphas = torch.logsumexp((alphas + feats[t]).unsqueeze(1) + transitions, dim=-1)

print(alphas[:, tag_to_ix[STOP_TAG]])

tensor([[-8.6616e-01, -1.5891e+00,  7.9692e-01, -1.0000e+04,  1.1368e+00]],
       grad_fn=<UnsqueezeBackward0>)
tensor([2.4575], grad_fn=<SelectBackward>)


In [14]:
# Remove the starting and stopping tags in computation
alphas = transitions[:-2, tag_to_ix[START_TAG]].unsqueeze(0)
print(alphas)

for t in range(feats.size(0)-1):
    alphas = torch.logsumexp((alphas + feats[t, :, -2]).unsqueeze(1) + transitions[:-2, :-2], dim=-1)

# Only transitioning to the stopping tag
alphas = torch.logsumexp(alphas + feats[-1, :, -2] + transitions[tag_to_ix[STOP_TAG], :-2], dim=-1)
print(alphas)

tensor([[-0.8662, -1.5891,  0.7969]], grad_fn=<UnsqueezeBackward0>)
tensor([1.9266], grad_fn=<LogsumexpBackward>)


In [15]:
# Viterbi decoding
# The original implementation
def _viterbi_decode(feats):
    backpointers = []

    # Initialize the viterbi variables in log space
    init_vvars = torch.full((1, TAG_DIM), -10000.)
    init_vvars[0][tag_to_ix[START_TAG]] = 0

    # forward_var at step i holds the viterbi variables for step i-1
    forward_var = init_vvars
    for feat in feats:
        bptrs_t = []  # holds the backpointers for this step
        viterbivars_t = []  # holds the viterbi variables for this step

        for next_tag in range(TAG_DIM):
            # next_tag_var[i] holds the viterbi variable for tag i at the
            # previous step, plus the score of transitioning
            # from tag i to next_tag.
            # We don't include the emission scores here because the max
            # does not depend on them (we add them in below)
            next_tag_var = forward_var + transitions[next_tag]
            best_tag_id = torch.argmax(next_tag_var)
            bptrs_t.append(best_tag_id)
            viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
        # Now add in the emission scores, and assign forward_var to the set
        # of viterbi variables we just computed
        forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
        backpointers.append(bptrs_t)

    # Transition to STOP_TAG
    terminal_var = forward_var + transitions[tag_to_ix[STOP_TAG]]
    best_tag_id = torch.argmax(terminal_var)
    path_score = terminal_var[0][best_tag_id]

    # Follow the back pointers to decode the best path.
    best_path = [best_tag_id]
    for bptrs_t in reversed(backpointers):
        best_tag_id = bptrs_t[best_tag_id]
        best_path.append(best_tag_id)
    # Pop off the start tag (we dont want to return that to the caller)
    start = best_path.pop()
    assert start == tag_to_ix[START_TAG]  # Sanity check
    best_path.reverse()
    return path_score, best_path

_viterbi_decode(feats.squeeze(1))

(tensor(0.0290, grad_fn=<SelectBackward>),
 [tensor(2), tensor(1), tensor(1), tensor(1), tensor(2)])

In [16]:
# Vectorized implementation
alphas = transitions[:, tag_to_ix[START_TAG]].unsqueeze(0)
# best_paths: (step=1, batch, tag_dim)
best_paths = torch.arange(TAG_DIM).repeat(feats.size(1), 1).unsqueeze(0)
print(alphas)

for t in range(feats.size(0)):
    # alphas: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
    # feats[t]: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
    # indices: (batch=1, tag_dim)
    # indices tell that given the current tag, which is the best tag for last step?
    alphas, indices = torch.max((alphas + feats[t]).unsqueeze(1) + transitions, dim=-1)

    # selected_paths: (step, batch=1, tag_dim) 
    # The paths selected according to indices
    selected_paths = torch.cat([best_paths[:, i, indices[i]].unsqueeze(1) for i in range(feats.size(1))], dim=1)
    this_step = torch.arange(TAG_DIM).repeat(feats.size(1), 1).unsqueeze(0)
    best_paths = torch.cat([selected_paths, this_step], dim=0)

print(alphas[:, tag_to_ix[STOP_TAG]])
print(best_paths[:, :, tag_to_ix[STOP_TAG]])

tensor([[-8.6616e-01, -1.5891e+00,  7.9692e-01, -1.0000e+04,  1.1368e+00]],
       grad_fn=<UnsqueezeBackward0>)
tensor([0.0290], grad_fn=<SelectBackward>)
tensor([[2],
        [1],
        [1],
        [1],
        [2],
        [4]])


## Building the Model

In [17]:
class BiLSTMCRF(nn.Module):
    def __init__(self, voc_dim, emb_dim, hid_dim, tag_dim, start_idx, stop_idx):
        super().__init__()
        self.start_idx = start_idx
        self.stop_idx = stop_idx

        self.emb = nn.Embedding(voc_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim//2, num_layers=1, bidirectional=True)
        self.hid2tag = nn.Linear(hid_dim, tag_dim)

        # transitions[i, j] is the score of transitioning from j to i
        self.transitions = nn.Parameter(torch.randn(tag_dim, tag_dim))
        self.transitions.data[self.start_idx, :] = -1e4
        self.transitions.data[:, self.stop_idx] = -1e4

    def _get_features(self, src):
        embbed = self.emb(src)
        rnn_outs, _ = self.rnn(embbed)

        # feats: (step, batch=1, tag_dim)
        feats = self.hid2tag(rnn_outs)
        return feats

    def _score_sentence(self, feats, tags):
        feat_scores = feats.gather(dim=-1, index=tags.unsqueeze(-1)).squeeze(-1)

        from_tags = torch.cat([torch.full((1, tags.size(1)), fill_value=self.start_idx, dtype=torch.long), tags], dim=0)
        to_tags = torch.cat([tags, torch.full((1, tags.size(1)), fill_value=self.stop_idx, dtype=torch.long)], dim=0)
        trans_scores = self.transitions[to_tags, from_tags]

        return feat_scores.sum(dim=0) + trans_scores.sum(dim=0)

    def _forward_alg(self, feats):
        alphas = self.transitions[:, self.start_idx].unsqueeze(0)
        
        for t in range(feats.size(0)):
            # alphas: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
            # feats[t]: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
            alphas = torch.logsumexp((alphas + feats[t]).unsqueeze(1) + self.transitions, dim=-1)

        return alphas[:, self.stop_idx]

    def _viterbi_decode(self, feats):
        alphas = self.transitions[:, self.start_idx].unsqueeze(0)
        # best_paths: (step=1, batch, tag_dim)
        best_paths = torch.arange(feats.size(-1)).repeat(feats.size(1), 1).unsqueeze(0)

        for t in range(feats.size(0)):
            # alphas: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
            # feats[t]: (batch=1, tag_dim) -> (batch=1, 1, tag_dim)
            # indices: (batch=1, tag_dim)
            # indices tell that given the current tag, which is the best tag for last step?
            alphas, indices = torch.max((alphas + feats[t]).unsqueeze(1) + self.transitions, dim=-1)

            # selected_paths: (step, batch=1, tag_dim) 
            # The paths selected according to indices
            selected_paths = torch.cat([best_paths[:, i, indices[i]].unsqueeze(1) for i in range(feats.size(1))], dim=1)
            this_step = torch.arange(feats.size(-1)).repeat(feats.size(1), 1).unsqueeze(0)
            best_paths = torch.cat([selected_paths, this_step], dim=0)

        return alphas[:, self.stop_idx], best_paths[:, :, self.stop_idx]

    def neg_log_likelihood(self, src, tags):
        feats = self._get_features(src)
        partitions = self._forward_alg(feats)
        scores = self._score_sentence(feats, tags)
        return partitions - scores

    def forward(self, src):
        feats = self._get_features(src)

        scores, tags = self._viterbi_decode(feats)
        return scores, tags

In [18]:
model = BiLSTMCRF(VOC_DIM, EMB_DIM, HID_DIM, TAG_DIM, tag_to_ix[START_TAG], tag_to_ix[STOP_TAG])
model.emb = emb
model.rnn = rnn
model.hid2tag = hid2tag
model.transitions = transitions


feats = model._get_features(sent)
print(model._score_sentence(feats, tags))
print(model._forward_alg(feats))
print(model._viterbi_decode(feats))

print(model.neg_log_likelihood(sent, tags))
print(model(sent))

tensor([-0.2256], grad_fn=<AddBackward0>)
tensor([2.4575], grad_fn=<SelectBackward>)
(tensor([0.0290], grad_fn=<SelectBackward>), tensor([[2],
        [1],
        [1],
        [1],
        [2],
        [4]]))
tensor([2.6831], grad_fn=<SubBackward0>)
(tensor([0.0290], grad_fn=<SelectBackward>), tensor([[2],
        [1],
        [1],
        [1],
        [2],
        [4]]))


In [19]:
src1 = torch.tensor([word_to_ix[w] for w in training_data[0][0]], dtype=torch.long).unsqueeze(1)[:5]
src2 = torch.tensor([word_to_ix[w] for w in training_data[1][0]], dtype=torch.long).unsqueeze(1)[:5]
batch_src = torch.cat([src1, src2], dim=1)
print(batch_src)

tags1 = torch.tensor([tag_to_ix[t] for t in training_data[0][1]], dtype=torch.long).unsqueeze(1)[:5]
tags2 = torch.tensor([tag_to_ix[t] for t in training_data[1][1]], dtype=torch.long).unsqueeze(1)[:5]
batch_tags = torch.cat([tags1, tags2], dim=1)
print(batch_tags)

tensor([[ 0, 11],
        [ 1, 12],
        [ 2, 13],
        [ 3, 14],
        [ 4, 15]])
tensor([[0, 0],
        [1, 1],
        [1, 2],
        [1, 2],
        [2, 2]])


In [20]:
print(model.neg_log_likelihood(batch_src, batch_tags))
print(model(batch_src))

tensor([2.6831, 5.0988], grad_fn=<SubBackward0>)
(tensor([0.0290, 0.3021], grad_fn=<SelectBackward>), tensor([[2, 2],
        [1, 1],
        [1, 1],
        [1, 1],
        [2, 2],
        [4, 4]]))


## Training the Model

In [21]:
scores, decoded = model(batch_src)
print(scores)
print(decoded)
print(batch_tags)

tensor([0.0290, 0.3021], grad_fn=<SelectBackward>)
tensor([[2, 2],
        [1, 1],
        [1, 1],
        [1, 1],
        [2, 2],
        [4, 4]])
tensor([[0, 0],
        [1, 1],
        [1, 2],
        [1, 2],
        [2, 2]])


In [22]:
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

for epoch in range(300): 
    # Forward pass & Calculate loss
    batch_losses = model.neg_log_likelihood(batch_src, batch_tags)
    loss = batch_losses.sum()
    # Backward propagation
    optimizer.zero_grad()
    loss.backward()
    # Update weights
    optimizer.step()

    if (epoch + 1) % 50 == 0:
        print(epoch+1, loss.item())

50 3.9643068313598633
100 2.418396472930908
150 1.6392817497253418
200 1.1952505111694336
250 0.9155898094177246
300 0.7282357215881348


In [23]:
scores, decoded = model(batch_src)
print(scores)
print(decoded)
print(batch_tags)

tensor([6.4811, 6.9006], grad_fn=<SelectBackward>)
tensor([[0, 0],
        [1, 1],
        [1, 2],
        [1, 2],
        [2, 2],
        [4, 4]])
tensor([[0, 0],
        [1, 1],
        [1, 2],
        [1, 2],
        [2, 2]])


In [24]:
model.transitions

Parameter containing:
tensor([[-1.5686e+00, -1.8517e+00, -2.1087e+00,  6.7935e-01, -9.9971e+03],
        [ 6.5971e-01,  3.3785e-01, -2.2905e+00, -2.0786e+00, -9.9971e+03],
        [-1.5681e+00,  7.6895e-01, -2.7771e-01, -2.5855e-01, -9.9971e+03],
        [-9.9971e+03, -9.9971e+03, -9.9971e+03, -9.9971e+03, -9.9971e+03],
        [-8.0492e-01, -1.2626e+00,  9.3918e-01,  1.1364e+00, -9.9971e+03]],
       requires_grad=True)