# chunker: default program

In [None]:
from default import *
import os

## Run the default solution on dev

In [4]:
chunker = LSTMTagger(os.path.join('data', 'train.txt.gz'), os.path.join('data', 'chunker'), '.tar')
decoder_output = chunker.decode('data/input/dev.txt')

100%|██████████| 1027/1027 [00:02<00:00, 459.66it/s]


## Evaluate the default output

In [5]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11672 phrases; correct: 8568.
accuracy:  84.35%; (non-O)
accuracy:  85.65%; precision:  73.41%; recall:  72.02%; FB1:  72.71
             ADJP: precision:  36.49%; recall:  11.95%; FB1:  18.00  74
             ADVP: precision:  71.36%; recall:  39.45%; FB1:  50.81  220
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  70.33%; recall:  76.80%; FB1:  73.42  6811
               PP: precision:  92.40%; recall:  87.14%; FB1:  89.69  2302
              PRT: precision:  65.00%; recall:  57.78%; FB1:  61.18  40
             SBAR: precision:  84.62%; recall:  41.77%; FB1:  55.93  117
               VP: precision:  63.66%; recall:  58.25%; FB1:  60.83  2108


(73.40644276901988, 72.02420981842637, 72.70875763747455)

## Documentation

Write some beautiful documentation of your program here.

## Baseline Implementing Character Level Representation

### Create a character level representation of word
1. Create a one-hot vector v1 for the first character of the word.
2. Create a vector v2 where the index of a character has the count of that character in the word.
3. Create a one-hot vector v3 for the last character of the word.

With the conditions given, we can generate and store the character level representation for each word in the `torch.tensor` matrix.

In [1]:
from chunker import *
import os, string
import torch

def char_rep(seq, to_ix):
    chars_vec = []
    size = len(string.printable)
    for word in seq:
        v1 = torch.zeros(size)
        v2 = torch.zeros(size)
        v3 = torch.zeros(size)
        if len(word) > 0:
            v1[to_ix[word[0]]] += 1
            for c in word[1:-1]:
                v2[to_ix[c]] += 1
            v3[to_ix[word[-1]]] += 1
        chars_vec.append(torch.cat((v1, v2, v3), 0))
    return torch.tensor(torch.stack(chars_vec),dtype=torch.long)

### For the Second RNN Implementation
We got a dev score of 76.7605 with the implemented baseline method. 

To better leverage the semi-character representation, a separate RNN is dedicated to learning from only the character-level data.
Here, the semi-character representation remains the same as previously defined, but is instead passed into a LSTM layer which generates the hidden state output.
This hidden state output is then concatenated with the word embedding representation rather than the raw semi-character representation to be passed forward into the layers from the baseline model.

We define a second RNN model that takes the input as character level representation, and use the hidden layers to concatenate the word embiddings to create new input for the original chunker RNN.

In [None]:
class CharLSTMTaggerModel(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, char_hidden_dim, vocab_size, tagset_size):
        torch.manual_seed(1)
        super(CharLSTMTaggerModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.char_hidden_dim = char_hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.char_lstm = nn.LSTM(300, char_hidden_dim, bidirectional = False)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim+char_hidden_dim, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, char_rep):
        char_lstm_out,_ = self.char_lstm(char_rep.view(len(char_rep), 1, -1))
        char_lstm_out = char_lstm_out.reshape(len(char_rep),self.char_hidden_dim)
        # print(char_lstm_out.shape)
        # print(self.word_embeddings(sentence).shape)
        embeds = torch.cat([self.word_embeddings(sentence),char_lstm_out],1)
        # print(embeds.shape)
        
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

Inside the class `CharLSTMTagger` we initialize the CharLSTM model, and train the second RNN using the stochastic gradient edscent and a learning rate of 0.01 as the optimizer.

In [None]:
sself.model = CharLSTMTaggerModel(self.embedding_dim, self.hidden_dim, self.char_hidden, len(self.word_to_ix), len(self.tag_to_ix))
self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

def train(self):
    loss_function = nn.NLLLoss()

    self.model.train()
    loss = float("inf")
    for epoch in range(self.epochs):
        for sentence, tags in tqdm.tqdm(self.training_data):
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            self.model.zero_grad()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.
            sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
            char_in = semi_char(sentence)
            targets = prepare_sequence(tags, self.tag_to_ix, self.unk)
            
            # Step 3. Run our forward pass.
            tag_scores = self.model(sentence_in, char_in)
            
            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(tag_scores, targets)
            loss.backward()
            self.optimizer.step()

        if epoch == self.epochs-1:
            epoch_str = '' # last epoch so do not use epoch number in model filename
        else:
            epoch_str = str(epoch)
        savefile = self.modelfile + epoch_str + self.modelsuffix
        print("saving model file: {}".format(savefile), file=sys.stderr)
        torch.save({
                    'epoch': epoch,
                    'model_state_dict': self.model.state_dict(),
                    'optimizer_state_dict': self.optimizer.state_dict(),
                    'loss': loss,
                    'unk': self.unk,
                    'word_to_ix': self.word_to_ix,
                    'tag_to_ix': self.tag_to_ix,
                    'ix_to_tag': self.ix_to_tag,
                }, savefile)

## Implementating the GRU Model Combining the Character Level Representations of Words

As we discussed in lecture the LSTM and GRU model have very similar architectures with the consensus being start with LSTM, but if you need quicker computations switch to GRU.
To experiment with this and whether it changes the dev-out score, we switched the LSTM layer which takes the word embedding and the hidden state output for the semi-character RNN to a GRU layer from the ``Pytorch`` Library.

Below is the `CharGRUTaggerModel` class, where we took word embeddings as inputs and output hidden states

In [None]:
class CharGRUTaggerModel(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, char_hidden_dim, vocab_size, tagset_size):
        torch.manual_seed(1)
        super(CharGRUTaggerModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.char_hidden_dim = char_hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # Still use an LSTM network to learn from the semi-character representation
        self.char_lstm = nn.LSTM(300, char_hidden_dim, bidirectional = False)

        # The GRU takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.gru = nn.GRU(embedding_dim+char_hidden_dim, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, char_rep):
        char_lstm_out,_ = self.char_lstm(char_rep.view(len(char_rep), 1, -1))
        char_lstm_out = char_lstm_out.reshape(len(char_rep),self.char_hidden_dim)

        #Concatenate the word embeddings with the hidden state output.
        embeds = torch.cat([self.word_embeddings(sentence),char_lstm_out],1)
        
        #CHANGED: GRU output instead of LSTM.
        gru_out, _ = self.gru(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(gru_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [None]:
self.model = CharGRUTaggerModel(self.embedding_dim, self.hidden_dim, self.char_hidden, len(self.word_to_ix), len(self.tag_to_ix))
self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

## Results
We got a dev score of 77.1090 implementing the baseline method. 

We got a dev score of 76.7605 implementing LSTM second RNN (option 2) method. 

We got a dev score of 77.4384 implementing GRU model method. 


## Analysis
### For the Baseline Method
We got a dev score of 77.1090 with the implemented baseline method. 

In [3]:
chunker = LSTMTagger(os.path.join('../data', 'train.txt.gz'), os.path.join('../data', 'chunker'), '.tar')
print(chunker.model)
decoder_output = chunker.decode('../data/input/dev.txt')
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

LSTMTaggerModel(
  (word_embeddings): Embedding(9675, 128)
  (lstm): LSTM(428, 64)
  (hidden2tag): Linear(in_features=64, out_features=22, bias=True)
)


  return torch.tensor(torch.stack(chars_vec),dtype=torch.long)
100%|██████████| 1027/1027 [00:04<00:00, 213.38it/s]


processed 23663 tokens with 11896 phrases; found: 11930 phrases; correct: 9186.
accuracy:  86.95%; (non-O)
accuracy:  87.91%; precision:  77.00%; recall:  77.22%; FB1:  77.11
             ADJP: precision:  45.56%; recall:  18.14%; FB1:  25.95  90
             ADVP: precision:  68.38%; recall:  46.73%; FB1:  55.52  272
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  75.38%; recall:  80.52%; FB1:  77.87  6662
               PP: precision:  91.37%; recall:  88.45%; FB1:  89.88  2363
              PRT: precision:  70.27%; recall:  57.78%; FB1:  63.41  37
             SBAR: precision:  86.29%; recall:  45.15%; FB1:  59.28  124
               VP: precision:  69.06%; recall:  71.40%; FB1:  70.21  2382


(76.99916177703268, 77.21923335574982, 77.10904054394359)


## Analysis

Between the suggested options, the baseline provides the better performance.
Yet, by changing the LSTM layer to a GRU layer gives a small score increase of approximately 0.3.
This is an interesting result due to the fact that a GRU layer is usually considered a more simple approach than an LSTM layer.
This could perhaps be indicative that the LSTM-based model tends to overfit a bit more.

We could also tweak the values for the learning rate and the number of `epochs` when training the models to see if the task performance will improve for implementing the second RNN.

 However, running these models took a very long time to do so, which reflects training them are quite expensive time-wise. With the same simple dataset, GRU model processed had a slightly better performance considering its properties of less training parameters and simple architectures. LSTM on the otherhand is more complex to implement, but would work better than GRU if the training data are larger. 