# chunker: default program

In [1]:
from default import *
import os

## Run the default solution on dev

In [4]:
chunker = LSTMTagger(os.path.join('data', 'train.txt.gz'), os.path.join('data', 'chunker'), '.tar')
decoder_output = chunker.decode('data/input/dev.txt')

100%|██████████| 1027/1027 [00:02<00:00, 459.66it/s]


## Evaluate the default output

In [5]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11672 phrases; correct: 8568.
accuracy:  84.35%; (non-O)
accuracy:  85.65%; precision:  73.41%; recall:  72.02%; FB1:  72.71
             ADJP: precision:  36.49%; recall:  11.95%; FB1:  18.00  74
             ADVP: precision:  71.36%; recall:  39.45%; FB1:  50.81  220
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  70.33%; recall:  76.80%; FB1:  73.42  6811
               PP: precision:  92.40%; recall:  87.14%; FB1:  89.69  2302
              PRT: precision:  65.00%; recall:  57.78%; FB1:  61.18  40
             SBAR: precision:  84.62%; recall:  41.77%; FB1:  55.93  117
               VP: precision:  63.66%; recall:  58.25%; FB1:  60.83  2108


(73.40644276901988, 72.02420981842637, 72.70875763747455)

## Documentation

Write some beautiful documentation of your program here.

### Some explanation of our code

***chunker.py*** is the script that we get the best score.

***chunkerBase.py*** is the script that we only implement the baseline method

### Implement the character level representation

Concatenate to the word embedding input to the chunker RNN an input vector that is the character level representation of the word.

We seperate the character level representations into three conditions:

1. **len(word) == 1**

    In this conidtion, only v1 is not all zeros while v2 and v3 only contain zeros.


2. **len(word) == 2**
    
    In this condition, v1 and v3 have ones in their vectors and v2 has all zeros.


3. **len(word) >= 3**

    In this condition, v1, v2 and v3 all contain ones and zeros.
    
And then we create the character level representation for each word.
    


In [None]:
def prepare_char(seq, to_ix):
    one_hot = []
    length = len(string.printable)
    for w in seq:
        start = torch.zeros(length)
        mid = torch.zeros(length)
        end = torch.zeros(length)
        if len(w) == 1:
            start[to_ix[w]] += 1
        elif len(w) == 2:
            start[to_ix[w[0]]] += 1
            end[to_ix[w[-1]]] += 1
        else: # >= 3 letters
            start[to_ix[w[0]]] += 1
            end[to_ix[w[-1]]] += 1
            for l in w[1:-1]:
                mid[to_ix[l]] += 1
        vector = torch.cat((start, mid, end), dim=-1)
        one_hot.append(vector)
    one_hot = torch.stack(one_hot)
    return torch.tensor(one_hot, dtype=torch.long)

### Implement a second RNN model

Use a second RNN that takes as input the character level representation and use the hidden layer of this second RNN and concatenate with the word embedding to form the new input to the chunker RNN.

In [7]:
class SecondRNN(nn.Module):

    def __init__(self, input_dim = 300, hidden_dim = 128, output_dim = 64, target_size = 22):
        torch.manual_seed(1)
        super(SecondRNN, self).__init__()
        self.hidden_dim = hidden_dim

        self.lstm = nn.LSTM(input_dim, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden1 = nn.Linear(hidden_dim, output_dim)
        self.hidden2tag = nn.Linear(output_dim, target_size)

    def forward(self, chars):
        # print(chars.view(len(sentence), 1, -1).type())
        chars = chars.float()
        length = chars.shape[0]
        lstm_out, _ = self.lstm(chars.view(length, 1, -1))
        hidden_out = self.hidden1(lstm_out.view(length, -1))
        tag_space = self.hidden2tag(hidden_out.view(length, -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

### Result summary

**Implementing the baseline method gives us a dev score of 77.01555275325767.**

**With a second RNN, we get a dev score of 79.95543819112065.**

### Explanation

After implementing the baseline method, we decide to use more useful information to concatenate with the word embedding to form the new input to the chunker RNN. 

So we decide to use Option 2, which is implementing a second RNN to get more useful information from the character level representation. 

**The second RNN has the following structure:**

A **LSTM** takes as input the character level representation of size 300 and output a vector of size 128.

**"hidden1"** layer is used to output a vector that will concatenate with the word embedding to form the new input to the chunker RNN.

**"hidden2tag"** layer is the same as the hidden layer in the chunker RNN and is used to train the second RNN model.

In [9]:
RNN = SecondRNN()
print(RNN)

SecondRNN(
  (lstm): LSTM(300, 128)
  (hidden1): Linear(in_features=128, out_features=64, bias=True)
  (hidden2tag): Linear(in_features=64, out_features=22, bias=True)
)


We use **tag_scores** as the result, which is the same as the chunker model, to train our model and we use **stochastic gradient descent** with a **learning rate of 0.02** as the optimizer. And we train the model with **20 epochs**.

In [None]:
self.optimizer_rnn = optim.SGD(self.rnn.parameters(), lr=0.02)
def train_RNN(self):
    loss_function = nn.NLLLoss()

    # Train a second RNN
    self.rnn.train()
    loss = float("inf")
    for epoch in range(20):
        for sentence, tags in tqdm.tqdm(self.training_data):
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            self.rnn.zero_grad()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.
            sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
            targets = prepare_sequence(tags, self.tag_to_ix, self.unk)
            char_in = prepare_char(sentence, self.char_to_ix)
            # Step 3. Run our forward pass.
            rnn_scores = self.rnn(char_in)
            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(rnn_scores, targets)
            loss.backward()
            self.optimizer_rnn.step()

        if epoch == self.epochs-1:
            epoch_str = '' # last epoch so do not use epoch number in model filename
        else:
            epoch_str = str(epoch)
        savernnfile = self.rnnfile + epoch_str + self.modelsuffix
        print("saving model file: {}".format(savernnfile), file=sys.stderr)
        torch.save({
                    'epoch': epoch,
                    'model_state_dict': self.rnn.state_dict(),
                    'optimizer_state_dict': self.optimizer_rnn.state_dict()
                }, savernnfile)

After training our RNN model, we just get rid of the last layer, which is the **"hidden2tag"** layer, since we only need this layer to train our model. Then, we get the output of the second last layer, which is the **"hidden1"** layer. This layer will output a vector of **size 64** which is used to concatenate with the word embedding to form the new input to the chunker RNN. Thus, we get a input of **size 192 (128 + 64)** to the chunker RNN. 

And then we train the chunker model with the hyperparameters used in the default solution.

**The following is the new structure of the chunker model**

In [6]:
from chunker import *
import os
chunker = LSTMTagger(os.path.join(os.path.dirname(os.getcwd()), 'data', 'train.txt.gz'), os.path.join(os.path.dirname(os.getcwd()), 'data', 'chunker'), os.path.join(os.path.dirname(os.getcwd()), 'data', 'rnn'), '.tar')
print(chunker.model)

LSTMTaggerModel(
  (word_embeddings): Embedding(9675, 128)
  (lstm): LSTM(192, 64)
  (hidden2tag): Linear(in_features=64, out_features=22, bias=True)
)


## Comparison of results between only baseline and baseline + second RNN

### Result of the baseline method

In [4]:
from chunkerBase import *
import os

In [5]:
chunker = LSTMTagger(os.path.join(os.path.dirname(os.getcwd()), 'data', 'train.txt.gz'), os.path.join(os.path.dirname(os.getcwd()), 'data', 'chunkerBase'), '.tar')
decoder_output = chunker.decode(os.path.join(os.path.dirname(os.getcwd()), 'data/input/dev.txt'))

  return torch.tensor(one_hot, dtype=torch.long)
100%|██████████| 1027/1027 [00:02<00:00, 343.76it/s]


In [6]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11894 phrases; correct: 9161.
accuracy:  86.83%; (non-O)
accuracy:  87.82%; precision:  77.02%; recall:  77.01%; FB1:  77.02
             ADJP: precision:  46.07%; recall:  18.14%; FB1:  26.03  89
             ADVP: precision:  66.79%; recall:  46.48%; FB1:  54.81  277
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  75.30%; recall:  80.28%; FB1:  77.71  6649
               PP: precision:  91.55%; recall:  88.37%; FB1:  89.93  2356
              PRT: precision:  69.44%; recall:  55.56%; FB1:  61.73  36
             SBAR: precision:  85.60%; recall:  45.15%; FB1:  59.12  125
               VP: precision:  69.39%; recall:  71.14%; FB1:  70.25  2362


(77.02202791323356, 77.00907868190988, 77.01555275325767)

### Result of baseline + second RNN

With the second RNN we get a dev score of 79.95543819112065.

In [1]:
from chunker import *
import os

In [2]:
chunker = LSTMTagger(os.path.join(os.path.dirname(os.getcwd()), 'data', 'train.txt.gz'), os.path.join(os.path.dirname(os.getcwd()), 'data', 'chunker'), os.path.join(os.path.dirname(os.getcwd()), 'data', 'rnn'), '.tar')
decoder_output = chunker.decode(os.path.join(os.path.dirname(os.getcwd()), 'data/input/dev.txt'))

  return torch.tensor(one_hot, dtype=torch.long)
100%|██████████| 1027/1027 [00:05<00:00, 200.94it/s]


In [3]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join(os.path.dirname(os.getcwd()), 'data','reference','dev.out')) as r:
    for sent in conlleval.read_file(r):
        true_seqs += sent.split()
conlleval.evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 12340 phrases; correct: 9689.
accuracy:  88.33%; (non-O)
accuracy:  89.30%; precision:  78.52%; recall:  81.45%; FB1:  79.96
             ADJP: precision:  45.69%; recall:  23.45%; FB1:  30.99  116
             ADVP: precision:  61.05%; recall:  55.53%; FB1:  58.16  362
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  78.01%; recall:  84.14%; FB1:  80.96  6727
               PP: precision:  91.42%; recall:  89.96%; FB1:  90.69  2402
              PRT: precision:  71.05%; recall:  60.00%; FB1:  65.06  38
             SBAR: precision:  77.56%; recall:  51.05%; FB1:  61.58  156
               VP: precision:  71.80%; recall:  79.12%; FB1:  75.28  2539


(78.51701782820098, 81.44754539340954, 79.95543819112065)