# Homework: Phrasal Chunking

## Group: Wisefish 

* Wenhao Zhang, wenhaoz 
* Graeme Milne, gmilne 
* Mitchell McCormack, mmccorma
* Jonathan Lo, jcl60

### Development process

In order to be able to best use knowledge and time, we decided to work together at the same time to understand and implement the base algorithm.  
So we got togther in a lecture room for a few hours and worked together to understand the actual algorithm, the expected data structures to contain the data, and implement at least a baseline functional algorithm. 

The result of that session was a baseline implementation of the algorithm, but with some remaining issues in how we updated the feature vector at each epoch (described below).  
After some continued adjustments and helping each other continue to understand how the algorithm worked, we arrived at our final implementation.

## The baseline perceptron algorithm:

In [1]:
import inspect

In [30]:
from chunk_baseline import perc_train

print(inspect.getsource(perc_train))

def perc_train(train_data, tagset, numepochs):
    """
    Perceptron training algorithm

    We run the provided training data and tagset for `numepoch` training runs over the entire training set.
    """
    feat_vec = defaultdict(int)
    z = []
    truth = []
    previous_error_count = 0

    for i in range(numepochs):
        final_z = []
        final_truth = []
        for (labeled_list, feat_list) in train_data:
            truth = [x.split(" ")[2] for x in labeled_list]
            z = perc.perc_test(feat_vec, labeled_list,
                               feat_list, tagset, tagset[0])

            final_z.append(z)
            final_truth.append(truth)

            # the resulting prediction is not the same (ordered list comparison) as the truth,
            # update the feature vector
            if z != truth:
                feat_vec = update_feat_vect(feat_vec, feat_list, z, truth)

        error_count = count_errors(final_truth, final_z)
        print("number of mistakes: 

## The averaged perceptron algorithm:

In [29]:
from chunk import perc_train

print(inspect.getsource(perc_train))

def perc_train(train_data, tagset, numepochs):

    """
    Perceptron training algorithm

    We run the provided training data and tagset for `numepoch` training runs over the entire training set.
    """   

    feat_vec = defaultdict(int)
    avg_feat_vec = defaultdict(int)
    z = []
    truth = []
    last_update = {}
    train_data_index = 0
    train_data_size = len(train_data)

    previous_error_count = 0

    for i in range(numepochs):
        final_z = []
        final_truth = []
        for (labeled_list, feat_list) in train_data:
            truth = [x.split(" ")[2] for x in labeled_list]
            z = perc.perc_test(feat_vec, labeled_list,
                               feat_list, tagset, tagset[0])

            final_z.append(z)
            final_truth.append(truth)

            if z != truth:
                feat_vec, avg_feat_vec, last_update = update_feat_vect(feat_vec, avg_feat_vec, i, numepochs,
                                            last_update, train_data_

### Updating the feature vector(Baseline)

If the predicted chunking labels for a given sentence do not match the training data values then the feature vector is updated using the `update_feat_vect` function. If an element's predicted tag is the same as in the true set then it is skipped over. In the mispredicted case, the incorrect tag is punished and the correct one is rewarded. Example: 

* Predicted: `NN -> B-NP`
* Truth:     `NN -> B-PP`
* `(U14:NN, B-PP)`: (value += 1)
* `(U14:NN, B-NP)`: (value -= 1)

In our earliest implementations we made a mistake in our punish and reward logic such that a feature vector involved rewarding correct prediction and punishing in the incorrect one.
```
if prediction != truth: 
    (feature_function, prediction) - 1 
else: 
    (feature_function, truth)  + 1
```

This small detail with a large impact was easily overlooked at first. However, after a much closer reading of the assignment outline and baseline algorithm description, the error was discovered. 

In [28]:
from chunk_baseline import update_feat_vect

print(inspect.getsource(update_feat_vect))

def update_feat_vect(feat_vec, feat_list, z, truth):
    """
    Update the provided weighting feature vector based on the given feature list, predicted output, and the truth reference.
    """
    for i in range(len(z)):
        if z[i] != truth[i]:

            # Penalize the incorrect features and reward the actual correct features
            for j in range(19):
                feat_vec[(feat_list[20*i+j], z[i])] -= 1
                feat_vec[(feat_list[20*i+j], truth[i])] += 1

            for j in range(2):
                if (i == 0 and j == 0) or (i == len(z) - 1 and j == 1):
                    continue

                feat_vec[(feat_list[20*i+19] + ":" + z[i-1+j], z[i+j])] -= 1
                feat_vec[(feat_list[20*i+19] + ":" + truth[i-1+j], truth[i+j])] += 1

    return feat_vec



### Updating the feature vector(Average perceptron)

To improve upon the baseline, we implemented the algorithim from Sarkar 2011 page 38. The averaged perceptron algorithim averages the weights. During each interval we update the weights by pushing the features that are not in the truth and rewarding the feature in the truth. We then updated the average weight feature vector. We store the average features in a list called avg_feat_vec and the not averaged features in a list called feat_vec. To update the average weights. We multiple the value of a feature in feat_vc and then store it in avg_feat_vec with the same key. The updating is lazying updating. We only update those weights that need to updated instead of all the weights. 

In [27]:
from chunk import update_feat_vect

print(inspect.getsource(update_feat_vect))

def update_feat_vect(feat_vec, avg_feat_vec, iter_num, epoch, last_update, train_data_index, train_data_size, feat_list, z, truth):

    """
    Update the provided weighting feature vector based on the given feature list, predicted output, and the truth reference.
    """

    if iter_num != epoch-1 or train_data_index != train_data_size-1:
        for i in range(len(z)):
            if z[i] != truth[i]:
                for j in range(19):
                    yprime = (feat_list[20*i+j], z[i])
                    y = (feat_list[20*i+j], truth[i])

                    feat_vec[yprime] -= 1
                    feat_vec[y] += 1

                    if yprime in last_update:
                        scale = iter_num * train_data_size + train_data_index - last_update[yprime][1] * train_data_size - last_update[yprime][0]
                        avg_feat_vec[yprime] += scale * feat_vec[yprime]
                    else:
                        avg_feat_vec[yprime] -= 1
                        

### Evaluating errors at each epoch

In the `count_errors` function, the total number of mispredicted output labels are counted in each epoch. By counting the number of errors occuring we are able to see the trend over time as the weights are adjusted and the model is trained.

The weight adjustments should always result in an improved prediction of the output labels and decreasing error count, converging to the best weighting function vector. If the number of errors ever increases, we know we have reached the best weightings possible.

After each epoch, the number of errors is compared to the previous epoch's count. If the number of errors is greater than or equal to the previous epoch, the algorithm stops.

In [5]:
from chunk import count_errors
print(inspect.getsource(count_errors))

def count_errors(test, truth):
    """
    Helper to count the amount of errors between the predicted output and the known truth reference.
    """
    count = 0
    for idx, item in enumerate(test):
        if item != truth[idx]:
            count += 1
    return count



# Results of our Implementation

**NOTE:** For performance, we are executing these commands in a separate shell with `!`.

We run our perc_train on the specified training data and write the resulting model to the file `baseline.model`. For grading purposes, we run it on the `train.txt.gz` / `train.feats.gz` data set for 10 iterations / epochs.

After each epoch, we output the number of mistakes that were made by the perceptron using the feature vector provided to it. As the training runs, the number of mistakes should decrease, and stop when we detect the best possible training output or we reach the number of epochs specified.

In [31]:
! python3 chunk.py -m data/baseline.model -t data/tagset.txt -i data/train.txt.gz -f data/train.feats.gz -e 10

reading data ...
done.
number of mistakes:  5630
number of mistakes:  4019
number of mistakes:  3059
number of mistakes:  2332
number of mistakes:  1877
number of mistakes:  1448
number of mistakes:  1201
number of mistakes:  1046
number of mistakes:  836
number of mistakes:  711


We now take our trained and saved model to use on a testing data set (`dev.txt` and `dev.feats`). 

We save the output predictions for the testing dataset and save it to the file `output`.

In [34]:
! python3 perc.py -m data/baseline.model -t data/tagset.txt -i data/dev.txt -f data/dev.feats > output 

reading data ...
done.


Finally we take the output of our test predictions and feed it to `score_chunks.py` to evaluate.

This program takes the given output from `perc.py` and evaluates it against the gold reference `reference500.txt`, which is for `dev.txt`.

The final line of output contains the evaluation, which from our test runs is an F1 score of `92.73`.

In [35]:
! python3 score_chunks.py -t output -r data/reference500.txt

processed 500 sentences with 10375 tokens and 5783 phrases; found phrases: 5771; correct phrases: 5357
             ADJP: precision:  72.73%; recall:  72.73%; F1:  72.73; found:     99; correct:     99
             ADVP: precision:  76.12%; recall:  75.74%; F1:  75.93; found:    201; correct:    202
            CONJP: precision: 100.00%; recall:  60.00%; F1:  75.00; found:      3; correct:      5
             INTJ: precision:   0.00%; recall:   0.00%; F1:   0.00; found:      0; correct:      1
               NP: precision:  93.82%; recall:  93.29%; F1:  93.55; found:   3009; correct:   3026
               PP: precision:  96.29%; recall:  97.79%; F1:  97.03; found:   1240; correct:   1221
              PRT: precision:  77.78%; recall:  63.64%; F1:  70.00; found:     18; correct:     22
             SBAR: precision:  80.56%; recall:  81.31%; F1:  80.93; found:    108; correct:    107
               VP: precision:  92.50%; recall:  91.91%; F1:  92.20; found:   1093; correct:   11