# chunker

Phrasal chunking is the task of finding non-recursive syntactic groups of words.

By implementing the Baseline solution, we have achieved a FB1 score of 77.11; By making further improvement, we reached a FB1 score of 78.73

## 1.Baseline Implementation

To deal with noisy input, we implemented a semi-character RNN as guided by the homework instruction

### 1.1 Character Level Representation

First, we create a character level representation of the word. To achive this, we add a function create_char_level_vector into the default solution. And the final character level representation is the concatenation of the three vectors. 

In [23]:
from chunker import *

In [24]:
def create_char_level_vectors(sentence, word_to_ix, char_to_ix):
    width = len(char_to_ix)
    vectors = torch.tensor(0)
    first = True
    for word in sentence:
        v1 = torch.zeros([width])
        v2 = torch.zeros([width])
        v3 = torch.zeros([width])
        "1. Create a one-hot vector v1 for the first character of the word."
        v1[char_to_ix[word[0]]] = 1

        "2. Create a vector v2 where the index of a character has the count of that character in the word."
        for c in word[1:-1]:
            v2[char_to_ix[c]] = v2[char_to_ix[c]] + 1

        "3. Create a one-hot vector v3 for the last character of the word."
        v3[char_to_ix[word[-1]]] = 1

        "4. Concatenate three vectors to get final character level representation."
        v = torch.cat((v1,v2,v3),dim=0).view(1,300)
        if first:
            vectors = v
            first = False
        else:
            vectors = torch.cat((vectors,v),dim=0)
    return vectors

### 1.2 Concatenate to the word embedding

After we created a character level representation of the word, we use the first method to combine the semi-Character RNN idea with phrasal chunker, i.e.concatenate to the word embedding input to the chunker RNN an input vector that is the character level representation of the word. Since the character level representation vector length (char_vector_dim) is 300 (v1,v2,v3), so we need to increase the length from word_embedding + char_vector_dim in LSTM input. Now, the sentence object now contains both sentence[0] word vector and sentence[1] char level vector for the senetences

To achive this, we modified the class LSTMTaggerModel:

In [25]:
class LSTMTaggerModel(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        torch.manual_seed(1)
        super(LSTMTaggerModel, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # concatenate vector before the LSTM, so the size increase from 128 to 128 + 3 * 100 = 428
        # where 128 is the embedding_dim and number of len(string.printable) which is 100 characters with 3 vectors v1, v2, v3
        self.lstm = nn.LSTM(embedding_dim+300, hidden_dim, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence, char_vectors):
        embeds = self.word_embeddings(sentence)

        # concatenate embedding vector with character level vector before lstm
        embeds = torch.cat((embeds,char_vectors), dim=1)

        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

After we have concatenated to the word embedding, we have to make corresponding changes in the class LSTMTagger, which use the create_char_level_vector() and the input of self.model() in train():

In [26]:
class LSTMTagger:

    def __init__(self, trainfile, modelfile, modelsuffix, unk="[UNK]", epochs=10, embedding_dim=128, hidden_dim=64):
        self.unk = unk
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.epochs = epochs
        self.modelfile = modelfile
        self.modelsuffix = modelsuffix
        self.training_data = []
        if trainfile[-3:] == '.gz':
            with gzip.open(trainfile, 'rt') as f:
                self.training_data = read_conll(f)
        else:
            with open(trainfile, 'r') as f:
                self.training_data = read_conll(f)

        self.word_to_ix = {} # replaces words with an index (one-hot vector)
        self.char_to_ix = {} # replaces character with an index
        self.tag_to_ix = {} # replace output labels / tags with an index
        self.ix_to_tag = [] # during inference we produce tag indices so we have to map it back to a tag

        for sent, tags in self.training_data:
            for word in sent:
                if word not in self.word_to_ix:
                    self.word_to_ix[word] = len(self.word_to_ix)
            for tag in tags:
                if tag not in self.tag_to_ix:
                    self.tag_to_ix[tag] = len(self.tag_to_ix)
                    self.ix_to_tag.append(tag)

        for c in string.printable:
            self.char_to_ix[c] = string.printable.find(c)

        logging.info("char_to_ix:", self.char_to_ix)
        logging.info("word_to_ix:", self.word_to_ix)
        logging.info("tag_to_ix:", self.tag_to_ix)
        logging.info("ix_to_tag:", self.ix_to_tag)

        self.model = LSTMTaggerModel(self.embedding_dim, self.hidden_dim, len(self.word_to_ix), len(self.tag_to_ix))
        self.optimizer = optim.SGD(self.model.parameters(), lr=0.01)

    def argmax(self, seq):
        output = []
        with torch.no_grad():
            inputs = prepare_sequence(seq, self.word_to_ix, self.unk)
            char_vectors = create_char_level_vectors(seq, self.word_to_ix, self.char_to_ix)
            tag_scores = self.model(inputs,char_vectors)
            for i in range(len(inputs)):
                output.append(self.ix_to_tag[int(tag_scores[i].argmax(dim=0))])
        return output

    def train(self):
        loss_function = nn.NLLLoss()

        self.model.train()
        loss = float("inf")
        for epoch in range(self.epochs):
            # random.shuffle(generate_noise(self.training_data))
            for sentence, tags in tqdm.tqdm(self.training_data):
                # Step 1. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()

                # Step 2. Get our inputs ready for the network, that is, turn them into
                # Tensors of word indices.
                sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
                targets = prepare_sequence(tags, self.tag_to_ix, self.unk)

                # Step 2.1. Create character level vector representation of size n x 300
                char_vectors = create_char_level_vectors(sentence, self.word_to_ix, self.char_to_ix)

                # Step 3. Run our forward pass.
                tag_scores = self.model(sentence_in, char_vectors)

                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores, targets)
                loss.backward()
                self.optimizer.step()

            if epoch == self.epochs-1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print("saving model file: {}".format(savefile), file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.optimizer.state_dict(),
                        'loss': loss,
                        'unk': self.unk,
                        'word_to_ix': self.word_to_ix,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                    }, savefile)

    def decode(self, inputfile):
        if inputfile[-3:] == '.gz':
            with gzip.open(inputfile, 'rt') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)
        else:
            with open(inputfile, 'r') as f:
                input_data = read_conll(f, input_idx=0, label_idx=-1)

        if not os.path.isfile(self.modelfile + self.modelsuffix):
            raise IOError("Error: missing model file {}".format(self.modelfile + self.modelsuffix))

        saved_model = torch.load(self.modelfile + self.modelsuffix)
        self.model.load_state_dict(saved_model['model_state_dict'])
        self.optimizer.load_state_dict(saved_model['optimizer_state_dict'])
        epoch = saved_model['epoch']
        loss = saved_model['loss']
        self.unk = saved_model['unk']
        self.word_to_ix = saved_model['word_to_ix']
        self.tag_to_ix = saved_model['tag_to_ix']
        self.ix_to_tag = saved_model['ix_to_tag']
        self.model.eval()
        decoder_output = []
        for sent in tqdm.tqdm(input_data):
            decoder_output.append(self.argmax(sent))
        return decoder_output

### 1.3 Run the Baseline Solution on dev

Download the chunker.tar from below wget

In [29]:
!wget --no-check-certificate -O chunker77.11.tar "https://onedrive.live.com/download?cid=1ED3E57B6F717CEC&resid=1ED3E57B6F717CEC%21123632&authkey=AOBFktG3ATadiE0"

--2019-11-07 21:07:51--  https://onedrive.live.com/download?cid=1ED3E57B6F717CEC&resid=1ED3E57B6F717CEC%21123632&authkey=AOBFktG3ATadiE0
Resolving onedrive.live.com (onedrive.live.com)... 13.107.42.13
Connecting to onedrive.live.com (onedrive.live.com)|13.107.42.13|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nihy2w.by.files.1drv.com/y4mxtrAf6Pq8M4QTE3vAD0EzrGhwRdxgCQS1eZfB-SxfnUqnwLUJyS5asCpU3eJH20cLVwGAvuYTZC0gD0VViFHretsnaXMiETnZ2AiLc4Yq8EVAkxqDFxcmkwnkU8DqCL1kQi2VwScoregOKJAi8PCN_0TRmfu0gbo3hoKbnv8NiV3hFLIrzQZyg0TrXNfh0BWobV-wgwFG2-872k-0w4vLw/chunker77.11.tar?download&psid=1 [following]
--2019-11-07 21:07:52--  https://nihy2w.by.files.1drv.com/y4mxtrAf6Pq8M4QTE3vAD0EzrGhwRdxgCQS1eZfB-SxfnUqnwLUJyS5asCpU3eJH20cLVwGAvuYTZC0gD0VViFHretsnaXMiETnZ2AiLc4Yq8EVAkxqDFxcmkwnkU8DqCL1kQi2VwScoregOKJAi8PCN_0TRmfu0gbo3hoKbnv8NiV3hFLIrzQZyg0TrXNfh0BWobV-wgwFG2-872k-0w4vLw/chunker77.11.tar?download&psid=1
Resolving nihy2w.by.files.1drv.com (nihy2w.by.file

After training, we get the chunker file in the data directory

In [30]:
chunker = LSTMTagger(os.path.join('../data', 'train.txt.gz'), 'chunker77.11', '.tar')
decoder_output = chunker.decode('../data/input/dev.txt')

100%|██████████| 1027/1027 [00:02<00:00, 364.99it/s]


### 1.4 Evaluate the baseline Output

To evaluate the result with the reference dev.out. Use the functions provides in conlleval.py. (As we encounter some issues of importing the colleval module, we directly copied the code into the below section)

In [31]:
"""
Author: https://github.com/sighsmile/conlleval
Modified by: Anoop Sarkar (anoopsarkar.github.io)

This script applies to IOB2 or IOBES tagging scheme.
If you are using a different scheme, please convert to IOB2 or IOBES.

"""

import sys, re
from collections import defaultdict

def split_tag(chunk_tag):
    """
    split chunk tag into IOBES prefix and chunk_type
    e.g. 
    B-PER -> (B, PER)
    O -> (O, None)
    """
    if chunk_tag == 'O':
        return ('O', None)
    return chunk_tag.split('-', maxsplit=1)

def is_chunk_end(prev_tag, tag):
    """
    check if the previous chunk ended between the previous and current word
    e.g. 
    (B-PER, I-PER) -> False
    (B-LOC, O)  -> True

    Note: in case of contradicting tags, e.g. (B-PER, I-LOC)
    this is considered as (B-PER, B-LOC)
    """
    prefix1, chunk_type1 = split_tag(prev_tag)
    prefix2, chunk_type2 = split_tag(tag)

    if prefix1 == 'O':
        return False
    if prefix2 == 'O':
        return prefix1 != 'O'

    if chunk_type1 != chunk_type2:
        return True

    return prefix2 in ['B', 'S'] or prefix1 in ['E', 'S']

def is_chunk_start(prev_tag, tag):
    """
    check if a new chunk started between the previous and current word
    """
    prefix1, chunk_type1 = split_tag(prev_tag)
    prefix2, chunk_type2 = split_tag(tag)

    if prefix2 == 'O':
        return False
    if prefix1 == 'O':
        return prefix2 != 'O'

    if chunk_type1 != chunk_type2:
        return True

    return prefix2 in ['B', 'S'] or prefix1 in ['E', 'S']


def calc_metrics(tp, p, t, percent=True):
    """
    compute overall precision, recall and FB1 (default values are 0.0)
    if percent is True, return 100 * original decimal value
    """
    precision = tp / p if p else 0
    recall = tp / t if t else 0
    fb1 = 2 * precision * recall / (precision + recall) if precision + recall else 0
    if percent:
        return 100 * precision, 100 * recall, 100 * fb1
    else:
        return precision, recall, fb1


def count_chunks(true_seqs, pred_seqs):
    """
    true_seqs: a list of true tags
    pred_seqs: a list of predicted tags

    return: 
    correct_chunks: a dict (counter), 
                    key = chunk types, 
                    value = number of correctly identified chunks per type
    true_chunks:    a dict, number of true chunks per type
    pred_chunks:    a dict, number of identified chunks per type

    correct_counts, true_counts, pred_counts: similar to above, but for tags
    """
    correct_chunks = defaultdict(int)
    true_chunks = defaultdict(int)
    pred_chunks = defaultdict(int)

    correct_counts = defaultdict(int)
    true_counts = defaultdict(int)
    pred_counts = defaultdict(int)

    prev_true_tag, prev_pred_tag = 'O', 'O'
    correct_chunk = None

    for true_tag, pred_tag in zip(true_seqs, pred_seqs):
        if true_tag == pred_tag:
            correct_counts[true_tag] += 1
        true_counts[true_tag] += 1
        pred_counts[pred_tag] += 1

        _, true_type = split_tag(true_tag)
        _, pred_type = split_tag(pred_tag)

        if correct_chunk is not None:
            true_end = is_chunk_end(prev_true_tag, true_tag)
            pred_end = is_chunk_end(prev_pred_tag, pred_tag)

            if pred_end and true_end:
                correct_chunks[correct_chunk] += 1
                correct_chunk = None
            elif pred_end != true_end or true_type != pred_type:
                correct_chunk = None

        true_start = is_chunk_start(prev_true_tag, true_tag)
        pred_start = is_chunk_start(prev_pred_tag, pred_tag)

        if true_start and pred_start and true_type == pred_type:
            correct_chunk = true_type
        if true_start:
            true_chunks[true_type] += 1
        if pred_start:
            pred_chunks[pred_type] += 1

        prev_true_tag, prev_pred_tag = true_tag, pred_tag
    if correct_chunk is not None:
        correct_chunks[correct_chunk] += 1

    return (correct_chunks, true_chunks, pred_chunks, 
        correct_counts, true_counts, pred_counts)

def get_result(correct_chunks, true_chunks, pred_chunks,
    correct_counts, true_counts, pred_counts, verbose=True):
    """
    if verbose, print overall performance, as well as preformance per chunk type;
    otherwise, simply return overall prec, rec, f1 scores
    """
    # sum counts
    sum_correct_chunks = sum(correct_chunks.values())
    sum_true_chunks = sum(true_chunks.values())
    sum_pred_chunks = sum(pred_chunks.values())

    sum_correct_counts = sum(correct_counts.values())
    sum_true_counts = sum(true_counts.values())

    nonO_correct_counts = sum(v for k, v in correct_counts.items() if k != 'O')
    nonO_true_counts = sum(v for k, v in true_counts.items() if k != 'O')

    chunk_types = sorted(list(set(list(true_chunks) + list(pred_chunks))))

    # compute overall precision, recall and FB1 (default values are 0.0)
    prec, rec, f1 = calc_metrics(sum_correct_chunks, sum_pred_chunks, sum_true_chunks)
    res = (prec, rec, f1)
    if not verbose:
        return res

    # print overall performance, and performance per chunk type
    
    print("processed %i tokens with %i phrases; " % (sum_true_counts, sum_true_chunks), end='')
    print("found: %i phrases; correct: %i.\n" % (sum_pred_chunks, sum_correct_chunks), end='')
        
    print("accuracy: %6.2f%%; (non-O)" % (100*nonO_correct_counts/nonO_true_counts))
    print("accuracy: %6.2f%%; " % (100*sum_correct_counts/sum_true_counts), end='')
    print("precision: %6.2f%%; recall: %6.2f%%; FB1: %6.2f" % (prec, rec, f1))

    # for each chunk type, compute precision, recall and FB1 (default values are 0.0)
    for t in chunk_types:
        prec, rec, f1 = calc_metrics(correct_chunks[t], pred_chunks[t], true_chunks[t])
        print("%17s: " %t , end='')
        print("precision: %6.2f%%; recall: %6.2f%%; FB1: %6.2f" %
                    (prec, rec, f1), end='')
        print("  %d" % pred_chunks[t])

    return res
    # you can generate LaTeX output for tables like in
    # http://cnts.uia.ac.be/conll2003/ner/example.tex
    # but I'm not implementing this

def evaluate(true_seqs, pred_seqs, verbose=True):
    (correct_chunks, true_chunks, pred_chunks,
        correct_counts, true_counts, pred_counts) = count_chunks(true_seqs, pred_seqs)
    result = get_result(correct_chunks, true_chunks, pred_chunks,
        correct_counts, true_counts, pred_counts, verbose=verbose)
    return result

def evaluate_conll_file(fh):
    true_seqs, pred_seqs = [], []
    
    for line in fh:
        cols = line.strip().split()
        # each non-empty line must contain >= 3 columns
        if not cols:
            true_seqs.append('O')
            pred_seqs.append('O')
        elif len(cols) < 3:
            raise IOError("conlleval: too few columns in line %s\n" % line)
        else:
            # extract tags from last 2 columns
            true_seqs.append(cols[-2])
            pred_seqs.append(cols[-1])
    return evaluate(true_seqs, pred_seqs)

def read_file(handle):
    contents = re.sub(r'\n\s*\n', r'\n\n', handle.read())
    contents = contents.rstrip()
    return contents.split('\n\n')



We can use the evaluate function to see the FB1 score for the implementation:

In [32]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in read_file(r):
        true_seqs += sent.split()
    evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11930 phrases; correct: 9186.
accuracy:  86.95%; (non-O)
accuracy:  87.91%; precision:  77.00%; recall:  77.22%; FB1:  77.11
             ADJP: precision:  45.56%; recall:  18.14%; FB1:  25.95  90
             ADVP: precision:  68.38%; recall:  46.73%; FB1:  55.52  272
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  75.38%; recall:  80.52%; FB1:  77.87  6662
               PP: precision:  91.37%; recall:  88.45%; FB1:  89.88  2363
              PRT: precision:  70.27%; recall:  57.78%; FB1:  63.41  37
             SBAR: precision:  86.29%; recall:  45.15%; FB1:  59.28  124
               VP: precision:  69.06%; recall:  71.40%; FB1:  70.21  2382


## 2.Improvement 

### 2.1 Shuffle the training data

First, we tried to shuffle the training data to see if there is an improvement for generalization. We modify the train function inside class LSTMTagger by adding a shuffling fucntion inside every epoch:

The improvement is not significant. So we tried another appraoch.

### 2.2 Modify Loss Value

To improve the FB1 score, we try to increase the loss value multiple by a value scale_factor (where scale_factor is set to 2.0 below)

Inside the class LSTMTagger->train funtion, step4: Compute the loss, gradients, and update the parameters by calling optimizer.step(), we scale the loss function as below:

In [33]:
    def train(self):
        loss_function = nn.NLLLoss()
        scale_factor = 2.0
        self.model.train()
        loss = float("inf")
        for epoch in range(self.epochs):
            # random.shuffle(generate_noise(self.training_data))
            for sentence, tags in tqdm.tqdm(self.training_data):
                # Step 1. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()

                # Step 2. Get our inputs ready for the network, that is, turn them into
                # Tensors of word indices.
                sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
                targets = prepare_sequence(tags, self.tag_to_ix, self.unk)

                # Step 2.1. Create character level vector representation of size n x 300
                char_vectors = create_char_level_vectors(sentence, self.word_to_ix, self.char_to_ix)

                # Step 3. Run our forward pass.
                tag_scores = self.model(sentence_in, char_vectors)

                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores, targets) * scale_factor
                loss.backward()
                self.optimizer.step()

            if epoch == self.epochs-1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print("saving model file: {}".format(savefile), file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.optimizer.state_dict(),
                        'loss': loss,
                        'unk': self.unk,
                        'word_to_ix': self.word_to_ix,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                    }, savefile)

#### 2.2.1 Run the Improved Solution on dev

Download the chunker78.73.tar for this result from below wget

In [37]:
!wget --no-check-certificate -O chunker78.73.tar "https://onedrive.live.com/download?cid=1ED3E57B6F717CEC&resid=1ED3E57B6F717CEC%21123630&authkey=AMc9o1GHbrKlywA"

--2019-11-07 21:10:30--  https://onedrive.live.com/download?cid=1ED3E57B6F717CEC&resid=1ED3E57B6F717CEC%21123630&authkey=AMc9o1GHbrKlywA
Resolving onedrive.live.com (onedrive.live.com)... 13.107.42.13
Connecting to onedrive.live.com (onedrive.live.com)|13.107.42.13|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ampbiq.by.files.1drv.com/y4mWhvVxS-IhruHigEm7G-aXRy1V54DPYLu2ufQEcKrkJD4llaqtSaDS2rsimRV_r45NAA3vyA9QUapXsAffrwC7O6yTdz-U6XejSlNILDC87eR5-1QBzngZGPmhD9s1gc72bK9DmUhRbQqUCWiEOaDmW4AUzfPQXgYQdpITzjO3lJ27dSB79O3M4_J23ZJgCtBcbcTR99jbKhZlZKBJF62yQ/chunker78.72.tar?download&psid=1 [following]
--2019-11-07 21:10:31--  https://ampbiq.by.files.1drv.com/y4mWhvVxS-IhruHigEm7G-aXRy1V54DPYLu2ufQEcKrkJD4llaqtSaDS2rsimRV_r45NAA3vyA9QUapXsAffrwC7O6yTdz-U6XejSlNILDC87eR5-1QBzngZGPmhD9s1gc72bK9DmUhRbQqUCWiEOaDmW4AUzfPQXgYQdpITzjO3lJ27dSB79O3M4_J23ZJgCtBcbcTR99jbKhZlZKBJF62yQ/chunker78.72.tar?download&psid=1
Resolving ampbiq.by.files.1drv.com (ampbiq.by.file

After training, we get the chunkerLoss file in the data directory

In [38]:
chunker = LSTMTagger(os.path.join('../data', 'train.txt.gz'), 'chunker78.73', '.tar')
decoder_output = chunker.decode('../data/input/dev.txt')

100%|██████████| 1027/1027 [00:02<00:00, 353.75it/s]


#### 2.2.2 Evaluate the improved Output

In [39]:
flat_output = [ output for sent in decoder_output for output in sent ]
import conlleval
true_seqs = []
with open(os.path.join('../data','reference','dev.out')) as r:
    for sent in read_file(r):
        true_seqs += sent.split()
    evaluate(true_seqs, flat_output)

processed 23663 tokens with 11896 phrases; found: 11986 phrases; correct: 9401.
accuracy:  87.81%; (non-O)
accuracy:  88.69%; precision:  78.43%; recall:  79.03%; FB1:  78.73
             ADJP: precision:  46.34%; recall:  25.22%; FB1:  32.66  123
             ADVP: precision:  66.25%; recall:  53.27%; FB1:  59.05  320
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  77.02%; recall:  81.74%; FB1:  79.31  6619
               PP: precision:  92.81%; recall:  88.32%; FB1:  90.51  2323
              PRT: precision:  77.78%; recall:  62.22%; FB1:  69.14  36
             SBAR: precision:  78.62%; recall:  48.10%; FB1:  59.69  145
               VP: precision:  71.74%; recall:  75.35%; FB1:  73.50  2420


By scaling the loss funtion, wo have reached a FB1 score of 78.73

### 2.3 Try with Data Aggregation

To further improve the score, we have implement a data aggergation function.
The function is simple which randomly swap the character and drop a character from the words of the sentense. However, the final result does not change much after the implementation, accuary on dev is similar to section 2.2

In [74]:
def data_shuffle(training_data, random_seed = 0):
    new_training_data = []
    for sentence, tags in training_data:
        new_sentence = []
        for word in sentence:
            if(len(word) >= 3):
                random.seed(random_seed) # Adding seed making sure get the same result for next time with same training dataset
                rand = random.randint(0, 2)
                random_seed += 1

                random.seed(random_seed)
                target_index = random.randint(1, len(word)-2)
                random_seed += 1

                # Swap
                if rand == 2:
                    new_word = word[:target_index] + word[target_index + 1] + word[target_index] 
                # Drop  
                if rand == 1:
                    new_word = word[:target_index] + word[target_index+1:]
                # Normal
                if rand == 0:
                    new_word = word
            else:
                    new_word = word
            new_sentence.append(new_word)
            # adding noise to the word in sentence
        new_training_data.append((new_sentence, tags))
    random.shuffle(new_training_data)
    return new_training_data


In the LSTMTagger, we use the data_shuffle() inside the LSTMTagger train() to shuffle and add noise on training
We referenced the new train() below.

In [75]:
    def train(self):
        loss_function = nn.NLLLoss()
        lamb = 2.5

        self.model.train()
        loss = float("inf")
        for epoch in range(self.epochs):
            training_data = data_shuffle(self.training_data)
            for sentence, tags in tqdm.tqdm(training_data):
                # Step 1. Remember that Pytorch accumulates gradients.
                # We need to clear them out before each instance
                self.model.zero_grad()
                # Step 2. Get our inputs ready for the network, that is, turn them into
                # Tensors of word indices.
                sentence_in = prepare_sequence(sentence, self.word_to_ix, self.unk)
                
                # get character level vector
                sentence_char_in = prepare_char_sequence(sentence, self.word_to_char_ix, self.unk)
                targets = prepare_sequence(tags, self.tag_to_ix, self.unk)

                # Step 3. Run our forward pass. calling forward in LSTMTaggerModel above
                tag_scores = self.model((sentence_in, sentence_char_in))

                # Step 4. Compute the loss, gradients, and update the parameters by
                #  calling optimizer.step()
                loss = loss_function(tag_scores, targets) * lamb
                logging.info("loss:", loss)

                loss.backward()
                self.optimizer.step()

            if epoch == self.epochs-1:
                epoch_str = '' # last epoch so do not use epoch number in model filename
            else:
                epoch_str = str(epoch)
            savefile = self.modelfile + epoch_str + self.modelsuffix
            print("saving model file: {}".format(savefile), file=sys.stderr)
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': self.model.state_dict(),
                        'optimizer_state_dict': self.optimizer.state_dict(),
                        'loss': loss,
                        'unk': self.unk,
                        'word_to_ix': self.word_to_ix,
                        'tag_to_ix': self.tag_to_ix,
                        'ix_to_tag': self.ix_to_tag,
                    }, savefile)


Unfortunately, incooperating data aggregation to our model does not offer any performance improvement, which produced 78.73 accuary on dev data set


## 3. Conclusion

In this task, we aim to find non-recursive syntactic groups of words.
We first implement the Baseline solution with semi-Character RNN idea with phrasal chunker and reach a FB1 score of 77.1.

Then, we improve the solution by scaling the loss function, since we expects that the loss penalty is not sufficient enough in only 10 epoches. After scaling the loss, the accuary of dev dataset has improved significaly from 77.1 to 78.73.

Also, We tried to further improve our score using data aggregation method. However, the result does not improve the score, which produce 78.73 as the scaling methond.

To sum up, scaling the loss function in the Baseline solution with semi-Character RNN idea with phrasal chunker will allows us to achieve 78.73.