Problem Set 4: Sequence labeling
=====================

This project focuses on sequence labeling, in the target domain of Twitter part-of-speech tagging.
Part (b) focuses on *discriminative* approaches, mainly averaged perceptron and structured perceptron.


###Submission guidelines:###

Here are some submission guidelines for the problem set submission on t-square. Please try to adhere to them as it makes grading simpler.

* Submit these 3 things on tsquare: 

   * compressed gtnlplib folder containing all your code. Please don't attach all python files separately to t-square. 
    
   * pset4.ipynb to present all your explanation answers and results.
    
   * There will be multiple response files that will be generated throughout the assignment. 4 for your normal models on dev data and 1 for bake off on test data. Use createSubmission.sh script to compress these files and submit the generated response_files.tar on Tsquare.


   * For 'Error Analysis' part write your answers in the notebook only. If you want to point to any code/functions that you have written separately, please point the location of code in the notebook file.

* Please don't modify any of the relative paths to data. You can copy the 'data' folder according to the given relatove path in the 'gtnlplib/constants.py' while working through the assignment.

In [1]:
import operator
from collections import defaultdict, Counter
import matplotlib.pyplot as plt
%pylab --no-import-all inline

import gtnlplib.preproc
import gtnlplib.viterbi
import gtnlplib.clf_base
import gtnlplib.scorer
import gtnlplib.constants
import gtnlplib.features
import gtnlplib.tagger_base
import gtnlplib.avg_perceptron
import gtnlplib.str_perceptron
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


In [2]:
reload(gtnlplib.preproc)

<module 'gtnlplib.preproc' from 'gtnlplib/preproc.pyc'>

In [3]:
## Define the file names
trainfile = gtnlplib.constants.TRAIN_FILE
devfile = gtnlplib.constants.DEV_FILE
testfile = gtnlplib.constants.TEST_FILE # You do not have this for now
offset = gtnlplib.constants.OFFSET

In [4]:
# for convenience
tr_all = []
for i,(words,tags) in enumerate(gtnlplib.preproc.conllSeqGenerator(trainfile)):
    tr_all.append((words,tags))

In [5]:
## Demo
alltags = set()
for i,(words, tags) in enumerate(gtnlplib.preproc.conllSeqGenerator(trainfile)):    
    for tag in tags:
        alltags.add(tag)
print alltags


set(['!', '#', '$', '&', ',', 'A', '@', 'E', 'D', 'G', 'M', 'L', 'O', 'N', 'P', 'S', 'R', 'U', 'T', 'V', 'Y', 'X', 'Z', '^', '~'])


# 1. Classification-based tagging #

First, you will perform tagging as classification problem.

Recall that in structured prediction, we have the feature function decompose:

\begin{align*}
\renewcommand{\vec}[1]{\mathbf{#1}}
\vec{f}(\vec{w},\vec{y}) & = \sum_m \vec{f}(\vec{w},y_m, y_{m-1}, m)
\end{align*}

You will explicitly define your feature functions in this way -- even for the classification-based tagger, which won't consider $y_{m-1}$. The features themselves are defined as tuples, as in pset 3.

Here is a simple example:

In [6]:
def wordFeatures(words,tag,prev_tag,m):
    '''
    :param words: a list of words
    :type words: list
    :param tag: a tag
    :type tag: string
    :type prev_tag: string
    :type m: int
    '''
    out = {(offset,tag):1}
    if m < len(words): #we can have m = M, for the transition to the end state
        out[(gtnlplib.constants.EMIT,tag,words[m])]=1
    return out

In [7]:
sent = 'they can can fish'.split()

In [8]:
wordFeatures(sent,'V','V',0)

{('**OFFSET**', 'V'): 1, ('--EMISSION--', 'V', 'they'): 1}

**Deliverable 1a** (1 point) Complete feature function 'wordCharFeatures' in gtnlplib/features.py, which includes the final character of the current word, and the final character of the preceding word (if $m > 1$) along with above features. The names for these features are defined in gtnlplib.constants.

In [9]:
reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [10]:
# sanity check desired output
print gtnlplib.features.wordCharFeatures(sent,'V','V',1)
# no prev-suff feature in this one, because m=0
print gtnlplib.features.wordCharFeatures(sent,'V','V',0)

{('--curr-suff--', 'V', 'n'): 1, ('--EMISSION--', 'V', 'can'): 1, ('**OFFSET**', 'V'): 1, ('--prev-suff--', 'V', 'y'): 1}
{('--curr-suff--', 'V', 'y'): 1, ('**OFFSET**', 'V'): 1, ('--EMISSION--', 'V', 'they'): 1}


Now you will define a classification-based tagger. To get you started, here are some test weights.

In [11]:
test_weights = defaultdict(float)
test_tags = ['N','V','V','N']
for i in range(len(sent)):
    for feat in wordFeatures(sent,test_tags[i],'X',i): # for key in dict: iterate the keys
        test_weights[feat] = 1
    for feat in wordFeatures(sent,'X','X',i):
        test_weights[feat] = 1
print test_weights

defaultdict(<type 'float'>, {('--EMISSION--', 'X', 'fish'): 1, ('--EMISSION--', 'N', 'fish'): 1, ('--EMISSION--', 'X', 'they'): 1, ('--EMISSION--', 'V', 'can'): 1, ('**OFFSET**', 'V'): 1, ('--EMISSION--', 'N', 'they'): 1, ('**OFFSET**', 'N'): 1, ('--EMISSION--', 'X', 'can'): 1, ('**OFFSET**', 'X'): 1})


In [12]:
# use this to find the highest-scoring label
argmax = lambda x : max(x.iteritems(),key=operator.itemgetter(1))[0]

**Deliverable 1b** (1 point): Complete the function classifierTagger in gtnlplib/tagger_base.py that takes a list of words, feature function, dict of weights, and a tagset, and outputs a list of predicted tags (one per word).

You should use featfunc to get the features and return the list of tags with highest score for each word.

In [13]:
gtnlplib.tagger_base.classifierTagger(sent,wordFeatures,test_weights,alltags)

['N', 'V', 'V', 'N']

In [14]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,wordFeatures,test_weights,alltags),'test')
print gtnlplib.scorer.accuracy(confusion)

0.139539705577


**Deliverable 1c** (3 points): Apply your averaged perceptron from pset 2 to do part-of-speech tagging. Start by adapting your oneItAvgPerceptron function. You'll have to make some changes:

- Replace your call to the predict() function with a call to classifierTagger()
- The instanceGenerator now produces word lists and tag lists as instances, instead of feature counts.
- You can treat entire sentences as instances, if you want -- this may be slightly easier. This means that you only update the weights after seeing an entire sentence, sort of like a minibatch.
- You'll want to add the feature function as an extra argument to both oneItAvgPerceptron and trainAvgPerceptron
- return the training accuracy rather than the number of errors

Complete oneItAvgPerceptron function from gtnlplib/avg_perceptron.py for this part.

In [15]:
reload(gtnlplib.avg_perceptron)

<module 'gtnlplib.avg_perceptron' from 'gtnlplib/avg_perceptron.pyc'>

In [16]:
weights,wsum,tr_acc,i = gtnlplib.avg_perceptron.oneItAvgPerceptron(tr_all,wordFeatures,defaultdict(float),defaultdict(float),alltags)

In [17]:
#sanity check. The weight sum numbers might be different if you don't treat sentences as instances, which is what I do.
print weights[gtnlplib.constants.EMIT,'D','the'], wsum[gtnlplib.constants.EMIT,'D','the']
print weights[gtnlplib.constants.EMIT,'N','the'], wsum[gtnlplib.constants.EMIT,'N','the']
print weights[gtnlplib.constants.EMIT,'V','like'], wsum[gtnlplib.constants.EMIT,'V','like']
print weights[gtnlplib.constants.EMIT,'P','like'], wsum[gtnlplib.constants.EMIT,'P','like']

16.0 2611.0
-1.0 -212.0
2.0 587.0
5.0 942.0


**Deliverable 1d** (2 points): Now adapt trainAvgPerceptron function in gtnlplib/avg_perceptron.py to do tagging. This should require fewer changes than oneItAvgPerceptron, but you will have to:

- take a feature function as an argument
- call evalTagger instead of evalClassifier to get the confusion pmatrix
- don't forget you've modified oneItAvgPerceptron to return the training set accuracy, not the number of errors

In [18]:
w, tr_acc, dv_acc =  gtnlplib.avg_perceptron.trainAvgPerceptron(15,tr_all,gtnlplib.features.wordCharFeatures,alltags)

0 dev: 0.576819407008 train: 0.496545591354
1 dev: 0.66618287373 train: 0.664956563376
2 dev: 0.643375492432 train: 0.780149121007
3 dev: 0.637777317023 train: 0.831725836241
4 dev: 0.643168152602 train: 0.862439291333
5 dev: 0.687538876218 train: 0.879403515972
6 dev: 0.637984656853 train: 0.886791162186
7 dev: 0.633630520423 train: 0.898146248033
8 dev: 0.645241550902 train: 0.902045283535
9 dev: 0.676342525399 train: 0.905807510774
10 dev: 0.658303960191 train: 0.911895478487
11 dev: 0.6659755339 train: 0.910458991723
12 dev: 0.685880157578 train: 0.91415281483
13 dev: 0.66742691271 train: 0.907928038854
14 dev: 0.651461745801 train: 0.911690266092


In [19]:
#You will get the test file later (48 hours before the deadline)
gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,gtnlplib.features.wordCharFeatures,w,alltags),'avg_perceptron.response',testfile=devfile)

defaultdict(int,
            {('!', '!'): 45,
             ('!', '#'): 2,
             ('!', ','): 2,
             ('!', '@'): 10,
             ('!', 'D'): 1,
             ('!', 'N'): 4,
             ('!', 'P'): 13,
             ('!', 'R'): 14,
             ('!', 'V'): 8,
             ('#', '#'): 6,
             ('#', '$'): 1,
             ('#', '@'): 12,
             ('#', 'A'): 1,
             ('#', 'N'): 3,
             ('#', 'P'): 1,
             ('#', 'R'): 13,
             ('#', 'V'): 14,
             ('#', '^'): 1,
             ('$', '$'): 41,
             ('$', '@'): 25,
             ('$', 'A'): 1,
             ('$', 'G'): 1,
             ('$', 'O'): 1,
             ('$', 'P'): 11,
             ('$', 'V'): 5,
             ('$', '^'): 1,
             ('&', '&'): 87,
             ('&', 'D'): 2,
             ('&', 'P'): 1,
             ('&', 'R'): 1,
             (',', ','): 453,
             (',', '@'): 3,
             (',', 'G'): 8,
             (',', 'P'): 4,
             (',',

**Deliverable 1e** (3 points): Make it better! Design a killer feature set that improves performance on the devset.

I'm able to get above 84% on the dev set, without going too crazy. Warning: my additional features slow things down considerably.


Please complete yourFeatures function from gtnlplib/features.py for this.
In order to pass unit tests for this you should be able to get at least 81%.

In [20]:
reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [21]:
w, tr_acc, dv_acc = gtnlplib.avg_perceptron.trainAvgPerceptron(15,tr_all,gtnlplib.features.yourFeatures,alltags)

0 dev: 0.729628861704 train: 0.679731855804
1 dev: 0.77669500311 train: 0.861960462412
2 dev: 0.790172092059 train: 0.914836856146
3 dev: 0.803441841178 train: 0.940009576578
4 dev: 0.803649181008 train: 0.955537314454
5 dev: 0.806759278457 train: 0.966481975511
6 dev: 0.803441841178 train: 0.976810999384
7 dev: 0.810284055567 train: 0.979068335727
8 dev: 0.808625336927 train: 0.984198645598
9 dev: 0.808003317437 train: 0.985635132362
10 dev: 0.808832676757 train: 0.987550448047
11 dev: 0.817333609786 train: 0.989534167864
12 dev: 0.817126269956 train: 0.990628633969
13 dev: 0.815260211487 train: 0.990286613311
14 dev: 0.811735434377 train: 0.992885970313


In [22]:
confusion_clf = gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,gtnlplib.features.yourFeatures,w,alltags),'avg_perceptron_custom.response',testfile=devfile)

# 2. Discriminative Structure Prediction #

Now you will implement a Structured Perceptron, which is trained to find the optimal *sequence* $\vec{y} = \text{arg}\max_\vec{y} \theta^{\top} \vec{f}(\vec{w},\vec{y})$

A key difference from the classification-based setting is that we compute features over the entire sequence.

**Deliverable 2a** (1 point): Implement a function seqFeatures in gtnlplib/features.py , which takes a list of words, a list of tags, and a feature function, and returns a dictionary of features and their counts.

In [23]:
reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [24]:
gtnlplib.features.seqFeatures(sent,['N','V','V','N'],wordFeatures)

defaultdict(float,
            {('**OFFSET**', '--END--'): 1.0,
             ('**OFFSET**', 'N'): 2.0,
             ('**OFFSET**', 'V'): 2.0,
             ('--EMISSION--', 'N', 'fish'): 1.0,
             ('--EMISSION--', 'N', 'they'): 1.0,
             ('--EMISSION--', 'V', 'can'): 2.0})

**Deliverable 2b** (1 point): now complete the function wordTransFeatures in gtnlplib/features.py, which adds tag-to-tag transition features to wordFeatures. Note that this feature set is identical to what the HMM uses.

In [25]:
gtnlplib.features.seqFeatures(sent,['N','V','V','N'],gtnlplib.features.wordTransFeatures)

defaultdict(float,
            {('**OFFSET**', '--END--'): 1.0,
             ('**OFFSET**', 'N'): 2.0,
             ('**OFFSET**', 'V'): 2.0,
             ('--EMISSION--', 'N', 'fish'): 1.0,
             ('--EMISSION--', 'N', 'they'): 1.0,
             ('--EMISSION--', 'V', 'can'): 2.0,
             ('--TRANS--', '--END--', 'N'): 1.0,
             ('--TRANS--', 'N', '--START--'): 1.0,
             ('--TRANS--', 'N', 'V'): 1.0,
             ('--TRANS--', 'V', 'N'): 1.0,
             ('--TRANS--', 'V', 'V'): 1.0})

**Deliverable 2c** (1 point): copy in your viterbiTagger from problem set 3. If you implemented it correctly, you should be able to use it without modification here.

In [26]:
reload(gtnlplib.viterbi)

<module 'gtnlplib.viterbi' from 'gtnlplib/viterbi.pyc'>

In [27]:
gtnlplib.viterbi.viterbiTagger(['they','can','can','fish'],gtnlplib.features.wordTransFeatures,test_weights,alltags)

(['N', 'V', 'V', 'N'], 8.0)

**Deliverable 2d** (3 points): Complete the function oneItAvgStructPerceptron in gtnlplib/str_perceptron.py, which performs a single iteration of averaged structured perceptron. It should be similar to your oneItAvgPerceptron, but will have to be different in some ways to reflect the structured prediction scenario.

- To make predictions, you must call your viterbiTagger function
- To compute the features for a given sequence of words and tags, you must call your seqFeatures function
- As above, output the training accuracy, not the number of training errors

In [28]:
reload(gtnlplib.str_perceptron)

<module 'gtnlplib.str_perceptron' from 'gtnlplib/str_perceptron.pyc'>

Speed is important here. Use this line to benchmark your code.
- My "optimized" implementation takes 1.1 seconds per iteration. 
- My "less optimized" implementation takes 1.6 seconds per iteration.

In [29]:
%%timeit
weights,wsum,tr_acc,i = gtnlplib.str_perceptron.oneItAvgStructPerceptron(tr_all[:100],
                                                                         gtnlplib.features.wordTransFeatures,
                                                                         defaultdict(float),
                                                                         defaultdict(float),
                                                                         alltags)
# careful, the %%timeit magic means that this block doesn't change the notebook state 

1 loops, best of 3: 1.22 s per loop


In [30]:
weights,wsum,tr_acc,i = gtnlplib.str_perceptron.oneItAvgStructPerceptron(tr_all[:100],gtnlplib.features.wordTransFeatures,defaultdict(float),defaultdict(float),alltags)

In [31]:
for tag1 in list(alltags)[:7]:
    for tag2 in list(alltags)[:7]:
        if weights[gtnlplib.constants.TRANS,tag1,tag2] != 0:
            print tag1,tag2,weights[(gtnlplib.constants.TRANS,tag1,tag2)],wsum[gtnlplib.constants.TRANS,tag1,tag2]

! ! -29.0 18.0
! , 2.0 -49.0
! @ 5.0 -75.0
# # -3.0 -130.0
# , -3.0 -329.0
$ $ -14.0 -194.0
$ , 1.0 -164.0
& @ 2.0 127.0
, ! 3.0 -196.0
, # -4.0 -393.0
, $ 3.0 68.0
, & -3.0 -150.0
, , -1.0 -43.0
, A 3.0 158.0
, @ 1.0 43.0
A ! 2.0 108.0
A , -1.0 -36.0
A A -11.0 -400.0
A @ -1.0 -33.0
@ ! 1.0 95.0
@ & 1.0 45.0
@ @ -9.0 -412.0


**Deliverable 2e** (2 points): Implement trainAvgStructPerceptron in gtnlplib/str_perceptron.py. This will be quite similar to your trainAvgPerceptron from ps2, but will have to take slightly different arguments to handle the structured prediction case. Don't forget to use evalTagger to produce output.

In [32]:
# your code should roughly reproduce this sanity check. It may be a little slow, so we'll just test on the first 50 instances.
# While you're debugging your code, you can run on even smaller datasets.
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(5,tr_all[:50],gtnlplib.features.wordTransFeatures,alltags)

0 dev: 0.340037321169 train: 0.431638418079
1 dev: 0.230354551109 train: 0.567914438503
2 dev: 0.458635703919 train: 0.734010152284
3 dev: 0.408666804893 train: 0.846376811594
4 dev: 0.467965996268 train: 0.861751152074


In [33]:
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(10,tr_all,gtnlplib.features.wordTransFeatures,alltags)

0 dev: 0.55525606469 train: 0.513501203722
1 dev: 0.651669085631 train: 0.700996271622
2 dev: 0.618909392494 train: 0.782924300035
3 dev: 0.66473149492 train: 0.845881455655
4 dev: 0.631764461953 train: 0.886489279256
5 dev: 0.680074642339 train: 0.910095799558
6 dev: 0.66162139747 train: 0.926761882463
7 dev: 0.696039809247 train: 0.937910981883
8 dev: 0.728592162554 train: 0.946606722329
9 dev: 0.693759071118 train: 0.952247998358


In [34]:
confusion = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.wordTransFeatures,theta,alltags)[0],'str_avg_perceptron.response',testfile=devfile)
print gtnlplib.scorer.accuracy(confusion)

0.693759071118


**Deliverable 2f** (3 points): Implement a better feature set for structured prediction by completing yourHMMFeatures function in gtnlplib/features.py. For speed reasons, you might not want to use all the features you used in 4e, but try to get as good an accuracy as you can. Last year I was able to get my structured perceptron to work a little better than my best classifier, but this year my classifier is (very slightly) better!

In [35]:
reload(gtnlplib.features)

<module 'gtnlplib.features' from 'gtnlplib/features.pyc'>

In [36]:
theta,tr_acc,dv_acc = gtnlplib.str_perceptron.trainAvgStructPerceptron(15,tr_all,gtnlplib.features.yourHMMFeatures,alltags)

0 dev: 0.710138917686 train: 0.809820548215
1 dev: 0.77773170226 train: 0.897476759628
2 dev: 0.772133526851 train: 0.927508682142
3 dev: 0.764254613311 train: 0.944418973726
4 dev: 0.787476674269 train: 0.951626639414
5 dev: 0.789550072569 train: 0.959578323176
6 dev: 0.800953763218 train: 0.963610269494
7 dev: 0.797221646278 train: 0.969061968784
8 dev: 0.802612481858 train: 0.972687752961
9 dev: 0.796806966618 train: 0.973890263426
10 dev: 0.804271200498 train: 0.974232781785
11 dev: 0.785403275969 train: 0.97791225416
12 dev: 0.799709724238 train: 0.980350689332
13 dev: 0.795148247978 train: 0.979924390562
14 dev: 0.799917064068 train: 0.979875492314


In [40]:
confusion_str = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.yourHMMFeatures,theta,alltags)[0],'str_avg_perceptron_custom.response',testfile=devfile)
print gtnlplib.scorer.accuracy(confusion_str)

0.799917064068


# 3. Error analysis #

(3 points; 7650 only). The scorer.py script produces a confusion matrix, which shows the most common types of errors. Consider your best tagger in any part of the assignment, and identify the three most frequent errors (e.g., N classified as V). Find an example sentence in your tagger has made each type of error, and explain why you think it made the mistake, and how it could be fixed. (If you are feeling competitive, you can then use this information to go back and try to improve your features.)

In [41]:
import operator
confusion_sorted = sorted(confusion_clf.iteritems(), key=operator.itemgetter(1))[::-1]
confusion_errs = [err for err in confusion_sorted if err[0][0]!= err[0][1]]
confusion_errs[0:3]

[(('^', 'N'), 100), (('V', 'N'), 66), (('A', 'N'), 40)]

My best tagger is as in part 1e, i.e. classifier-based tagger with my own features defined in features.yourFeatures(). According to the confusion matrix displayed above, the three most common types of errors are 
- 1). '^' classified as 'N', proper noun tagged as normal noun. 
  + Example: I LOVE MY CELTICS, !!!
  + True tags:       ['O', 'V', 'D', '^', ',']
  + Predicted tags:  ['O', 'V', 'D', 'N', ',']
  + Here 'CELTICS' should be tagged as proper noun but was tagged as noun. For this type of errors, according this [seminal paper](http://dl.acm.org/citation.cfm?id=2002747), we can add a tagger specialized in recognizing proper nouns through the likelihood of capitolizatoin of a token. But due to the inconsistency of capitalization in micro bloggers, the improvement is limited: the recall of proper nouns is only 71%, as mentioned in the paper.
- 2). 'V' classified as 'N', verb tagged as noun.
  + Example: @BettyMcSweets that hurts!
  + True tags:       ['@', 'O', 'V', ',']
  + Predicted tags:  ['@', 'O', 'N', ',']
  + Here 'hurts' should be tagged as verb, but was tagged as noun. This error is clearly due to the inherent drawback of classifier taggers that it ignores the context and tags each word independently. This can be greatly improved by structured prediction as implemented in part 2. By considering local context features, such as local transition probabilities of tags, clearly a two-word sentence with tags ['N', 'V'] is much more common than that with tags ['N','N'], and this can be captured very well by the transition probabilitiy( or weights in discriminative prediction here), which is much higher for the first case. 
  + The reason i didn't use the AvgStructPerceptron tagger here because the overall accuracy is slightly worse than the classifier one. I think part of the reason is that I only added half of the features used in classifer tagger in AvgStructPerceptron tagger, due to the speed limit--the speed is already much slower even with half of the features, which is expected because viterbi algorithm is proportional to the square of number of features. 
- 3). 'A' classified as 'N', adjective tagged as noun.
  + Example: It's time to make big changes ...
  + True tags:       ['L', 'N', 'P', 'V', 'A','N',',']
  + Predicted tags:  ['L', 'N', 'P', 'V', 'N','N',',']
  + Here big should be tagged as adjective but was tagged as noun. This type of error has the same root as the previous example: lack of context features, clearly ['V','A','N'] is more likely than ['V','N','N']. Thus this can be improved by structured prediction as described in part 2. Particularly, for this specific case, a tri-gram HMM tagger will perform significantly better since there're quite some verbs that can be followed by two nouns or adjective + noun or noun + adverb etc. 

# 4. Bakeoff! #

48 hours before the assignment is due, we will send you unlabeled test data. Your job is to produce a response file that I can evaluate. I'll present the results in class and give the best scorers a chance to explain what they did.


** Deliverable 4 ** (3 points) Run your best system from any part of the
assignment on the test data using the `generateKaggleSubmission()` function. Submit
your response file to the class [Kaggle bakeoff](https://inclass.kaggle.com/c/gt-book-review-sentiment-analysis). Also **submit your Kaggle response file to T-Square as 'lastname-firstname.response'.** The top
scores will be announced in class.


#### I was late for joining the kaggle bakeoff, so I submitted the response file Shi-Chao.response in the tar-ball of response files, using classifier tagger in 1e

In [39]:
#confusion = gtnlplib.tagger_base.evalTagger(lambda words, alltags : gtnlplib.viterbi.viterbiTagger(words,gtnlplib.features.yourFeatures,theta,alltags)[0],'lastname-firstname.response',testfile=devfile)
confusion_clf = gtnlplib.tagger_base.evalTagger(lambda words,alltags : gtnlplib.tagger_base.classifierTagger(words,gtnlplib.features.yourFeatures,w,alltags),'Shi-Chao.response',testfile=devfile)