### NLP Lab Session Week 9
### More on Features and Evaluation for Classification
### Part 1:  Bigram Features

Getting Started

For this lab session download the examples:  LabWk9.bigramsPOSeval.py and put it in your class folder for copy/pasting examples.  Start your jupyter notebook session.


In [2]:
import nltk

In this week’s lab, we show two more types of features sometimes used in classification and how to use more classifier evaluation measures and methods.  After this week’s lab, you should be able to use a variety of features to test with your final project data, and also be able to report better evaluation measures with cross-validation.

#### Bigram Features

One more important source of features often used in sentiment and other document or sentence-level classifications is bigram features.  Typically these features are added to word level features.

First, we restart by loading the movie review sentences and getting the baseline performance of the unigram features.  This is a repeat from last week in order to get started, except that we will change the size of the feature sets.


In [1]:
from nltk.corpus import sentence_polarity
import random


The movie review documents are not labeled individually, but are separated into file directories by category.  We first create the list of documents/sentences where each is paired with its label.  

In [2]:
documents = [(sent, cat) for cat in sentence_polarity.categories() 
    for sent in sentence_polarity.sents(categories=cat)]


In this list, each item is a pair (d,c) where d is a list of words from a sentence and c is its label, either ‘pos’ or ‘neg’.

Since the documents are in order by label, we mix them up for later separation into training and test sets.


In [3]:
random.shuffle(documents)

In [5]:
documents[:5]

[(['the',
   'movie',
   'keeps',
   'coming',
   'back',
   'to',
   'the',
   'achingly',
   'unfunny',
   'phonce',
   'and',
   'his',
   'several',
   'silly',
   'subplots',
   '.'],
  'neg'),
 (["it's",
   'clear',
   'the',
   'filmmakers',
   "weren't",
   'sure',
   'where',
   'they',
   'wanted',
   'their',
   'story',
   'to',
   'go',
   ',',
   'and',
   'even',
   'more',
   'clear',
   'that',
   'they',
   'lack',
   'the',
   'skills',
   'to',
   'get',
   'us',
   'to',
   'this',
   'undetermined',
   'destination',
   '.'],
  'neg'),
 (['at',
   'heart',
   'the',
   'movie',
   'is',
   'a',
   'deftly',
   'wrought',
   'suspense',
   'yarn',
   'whose',
   'richer',
   'shadings',
   'work',
   'as',
   'coloring',
   'rather',
   'than',
   'substance',
   '.'],
  'pos'),
 (['falls',
   'neatly',
   'into',
   'the',
   'category',
   'of',
   'good',
   'stupid',
   'fun',
   '.'],
  'pos'),
 (['big',
   'fat',
   'liar',
   'is',
   'little',
   'more',
  

In [6]:
# We need to define the set of words that will be used for features.  For this week’s lab, we will limit the length of the 
# word features to 1500.

all_words_list = [word for (sent,cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
word_items = all_words.most_common(1500)
word_features = [word for (word, freq) in word_items]


In [7]:
# As before, the word feature labels will be ‘V_keyword)’ for each keyword (aka word) in the word_features set, 
# and the value of the feature will be Boolean,  according to whether the word is contained in that document.

def document_features(document, word_features):
	document_words = set(document)
	features = {}
	for word in word_features:
		features['V_{}'.format(word)] = (word in document_words)
	return features


In [8]:
# Define the feature sets for the documents. 
featuresets = [(document_features(d,word_features), c) for (d,c) in documents]
len(featuresets)


10662

In [9]:
# We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy.  
# We separate the data into a 90%, 10% split for training and testing.

train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


0.742


Now that we have a baseline for performance for this random split of the data, we’ll create some bigram features.  

As we saw in the lab in Week 3, when we worked on generating bigrams from documents, if we want to use highly frequent bigrams, we need to filter out special characters, which were very frequent in the bigrams, and also filter by frequency.  The bigram pmi measure also required some filtering to get frequent and meaningful bigrams.  

But there is another bigram association measure that is more often used to filter bigrams for classification features.  This is the chi-squared measure, which is another measure of information gain, but which does its own frequency filtering.  Another frequently used alternative is to just use frequency, which is the bigram measure raw_freq.

We’ll start by importing the collocations package and creating a short cut variable name for the bigram association measures.


In [10]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()


In [11]:
# We create a bigram collocation finder using the original movie review words, since the bigram finder must have the words 
# in order.  Note that our all_words_list has exactly this list.

all_words_list[:50]
finder = BigramCollocationFinder.from_words(all_words_list)


In [12]:
# We use the chi-squared measure to get bigrams that are informative features.  
# Note that we don’t need to get the scores of the bigrams, so we use the nbest function which just returns the highest scoring 
# bigrams, using the number specified. (Or try bigram_measures.raw_freq.)

bigram_features = finder.nbest(bigram_measures.chi_sq, 500)

In [13]:
# The nbest function returns a list of significant bigrams in this corpus, and we can look at some of them.

print(bigram_features[:50])

# We are going to use these bigrams as features in a new features function.  In order to test if any bigram in the 
# bigram_features list is in the document, we need to generate the bigrams of the document, which we do using the 
# nltk.bigrams function.  To show this, we define a sentence and show the bigrams.

sent = ['Arthur','carefully','rode','the','brown','horse','around','the','castle']
sentbigrams = list(nltk.bigrams(sent))
sentbigrams


[("''independent", "film''"), ("'60s-homage", 'pokepie'), ("'[the", 'cockettes]'), ("'ace", "ventura'"), ("'alternate", "reality'"), ("'aunque", 'recurre'), ("'black", "culture'"), ("'blue", "crush'"), ("'chan", "moment'"), ("'chick", "flicks'"), ("'date", "movie'"), ("'ethnic", 'cleansing'), ("'face", "value'"), ("'fully", "experienced'"), ("'jason", "x'"), ("'juvenile", "delinquent'"), ("'laugh", "therapy'"), ("'masterpiece", "theatre'"), ("'nicholas", "nickleby'"), ("'old", "neighborhood'"), ("'opening", "up'"), ("'rare", "birds'"), ("'sacre", 'bleu'), ("'science", "fiction'"), ("'shindler's", "list'"), ("'snow", "dogs'"), ("'some", "body'"), ("'special", "effects'"), ("'terrible", "filmmaking'"), ("'time", "waster'"), ("'true", "story'"), ("'unfaithful'", 'cheats'), ("'very", "sneaky'"), ("'we're", '-doing-it-for'), ("'who's", "who'"), ('-after', 'spangle'), ('-as-it-', 'thinks-it-is'), ('-as-nasty', '-as-it-'), ('-doing-it-for', "-the-cash'"), ('10-course', 'banquet'), ('10-year',

[('Arthur', 'carefully'),
 ('carefully', 'rode'),
 ('rode', 'the'),
 ('the', 'brown'),
 ('brown', 'horse'),
 ('horse', 'around'),
 ('around', 'the'),
 ('the', 'castle')]

In [14]:
# For any one bigram, we can test if it is in the bigrams of the sentence and we can use string formatting, 
# with two occurrences of {}s, to insert the two words of the bigram into the name of the feature.

bigram = ('brown','horse')
print(bigram in sentbigrams)
print('B_{}_{}'.format(bigram[0], bigram[1]))

# Now we create a feature extraction function that has all the word features as before, but also has bigram features.

def bigram_document_features(document, word_features, bigram_features):
    document_words = set(document)
    document_bigrams = nltk.bigrams(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    for bigram in bigram_features:
        features['B_{}_{}'.format(bigram[0], bigram[1])] = (bigram in document_bigrams)    
    return features



True
B_brown_horse


In [15]:
# Now we create feature sets as before, but using this feature extraction function.

bigram_featuresets = [(bigram_document_features(d,word_features,bigram_features), c) for (d,c) in documents]

#There should be 2000 features:  1500 word features and 500 bigram features

len(bigram_featuresets[0][0].keys())

train_set, test_set = bigram_featuresets[1000:], bigram_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# So in my random training, test split, the bigrams did not improve the classification for this data.  
# But there are many classification tasks for which bigrams are important.


0.742


### NLP Lab Session Week 9
### More on Features and Evaluation for Classification
### Part 2:  POS tag features

#### Continuing our session with the movie review sentences

#### POS tag features

There are some classification tasks where part-of-speech tag features can have an effect.  In my experience, this is more likely for shorter units of classification, such as sentence level classification or shorter social media such as tweets.

The most common way to use POS tagging information is to include counts of various types of word tags.  Here is an example feature function that counts nouns, verbs, adjectives and adverbs for features.  [Note that this function calls nltk.pos_tag every time that it is run and for repeated experiments, you could pre-compute the pos tags and save them for every document.]


In [16]:
# Observing the Stanford POS tagger, which is the default in NLTK, on a sentence:
print(sent)
print(nltk.pos_tag(sent))


['Arthur', 'carefully', 'rode', 'the', 'brown', 'horse', 'around', 'the', 'castle']
[('Arthur', 'NNP'), ('carefully', 'RB'), ('rode', 'VBD'), ('the', 'DT'), ('brown', 'JJ'), ('horse', 'NN'), ('around', 'IN'), ('the', 'DT'), ('castle', 'NN')]


In [17]:
# Here is the definition of our new feature function, adding POS tag counts to the word features.

def POS_features(document):
    document_words = set(document)
    tagged_words = nltk.pos_tag(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    numNoun = 0
    numVerb = 0
    numAdj = 0
    numAdverb = 0
    for (word, tag) in tagged_words:
        if tag.startswith('N'): numNoun += 1
        if tag.startswith('V'): numVerb += 1
        if tag.startswith('J'): numAdj += 1
        if tag.startswith('R'): numAdverb += 1
    features['nouns'] = numNoun
    features['verbs'] = numVerb
    features['adjectives'] = numAdj
    features['adverbs'] = numAdverb
    return features


In [18]:
# Try out the POS features.
POS_featuresets = [(POS_features(d), c) for (d, c) in documents]
# number of features for document 0
len(POS_featuresets[0][0].keys())

# Show the first sentence in your (randomly shuffled) documents and look at its POS tag features.

print(documents[0])
# the pos tag features for this sentence
print('num nouns', POS_featuresets[0][0]['nouns'])
print('num verbs', POS_featuresets[0][0]['verbs'])
print('num adjectives', POS_featuresets[0][0]['adjectives'])
print('num adverbs', POS_featuresets[0][0]['adverbs'])

# Now split into training and test and rerun the classifier.
train_set, test_set = POS_featuresets[1000:], POS_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

#This improved classification a small amount for my train/test split.




(["it's", 'hard', 'not', 'to', 'feel', "you've", 'just', 'watched', 'a', 'feature-length', 'video', 'game', 'with', 'some', 'really', 'heavy', 'back', 'story', '.'], 'neg')
num nouns 3
num verbs 2
num adjectives 2
num adverbs 7


0.738

### NLP Lab Session Week 9
### More on Features and Evaluation for Classification
### Part 3:  The Evaluation Method of Cross-Validation

#### Continuing our session with the movie review sentences

#### Cross-Validation

As a final topic in evaluation, we have discussed that our testing of the features on the movie reviews and movie review sentences data is often skewed by the random sample.  The remedy for this is to use different chunks of the data as the test set to repeatedly train a model and then average our performance over those models.

This method is called cross-validation, or sometimes k-fold cross-validation.  In this method, we choose a number of folds, k, which is usually a small number like 5 or 10.  We first randomly partition the development data into k subsets, each approximately equal in size.  Then we train the classifier k times, where, at each iteration, we use each subset in turn as the test set and the others as a training set.


<img src="5_fold_cv.png" width="400">

 
NLTK does not have a built-in function for cross-validation, but we can program the process in a function that takes the number of folds and the feature sets, and iterates over training and testing a classifier.  This function only reports accuracy for each fold and for the overall average.

In [19]:
subset_size = len(featuresets)//10
i=0
print(subset_size,i)


1066 0


In [25]:
def cross_validation(num_folds, featuresets):
    subset_size = len(featuresets)//num_folds
    accuracy_list = []
    # iterate over the folds
    for i in range(num_folds):
        test_this_round = featuresets[i*subset_size:][:subset_size]
        train_this_round = featuresets[:i*subset_size]+featuresets[(i+1)*subset_size:]
        # train using train_this_round
        classifier = nltk.NaiveBayesClassifier.train(train_this_round)
        # evaluate against test_this_round and save accuracy
        accuracy_this_round = nltk.classify.accuracy(classifier, test_this_round)
        print(i, accuracy_this_round)
        accuracy_list.append(accuracy_this_round)
    # find mean accuracy over all rounds
    print('mean accuracy', sum(accuracy_list) / num_folds)

In [40]:
# Run the cross-validation on our word feature sets with 10 folds.
cross_validation(10, featuresets)

# Instead of accuracy, we should have a cross-validation function to report precision and recall for each label.


0 0.7504690431519699
1 0.7401500938086304
2 0.7073170731707317
3 0.7420262664165104
4 0.7157598499061913
5 0.7504690431519699
6 0.7542213883677298
7 0.7607879924953096
8 0.7495309568480301
9 0.7317073170731707
mean accuracy 0.7402439024390245


In [26]:
# Run the cross-validation on our word bigram feature sets with 10 folds.
cross_validation(10, bigram_featuresets)

0 0.7373358348968105
1 0.7523452157598499
2 0.7476547842401501
3 0.7410881801125704
4 0.7157598499061913
5 0.7532833020637899
6 0.7307692307692307
7 0.7485928705440901
8 0.7223264540337712
9 0.7392120075046904
mean accuracy 0.7388367729831146


In [27]:
# Run the cross-validation on our word POS feature sets with 10 folds.
cross_validation(10, POS_featuresets)

0 0.7382739212007504
1 0.7476547842401501
2 0.7467166979362101
3 0.7354596622889306
4 0.7176360225140713
5 0.7504690431519699
6 0.7317073170731707
7 0.7514071294559099
8 0.7195121951219512
9 0.7335834896810507
mean accuracy 0.7372420262664166


### NLP Lab Session Week 9
### More on Features and Evaluation for Classification
### Part 4:  Evaluation Measures:  Precision, Recall and F1

#### Continuing our session with the movie review sentences

#### Other Evaluation Measures

So far, we have been using simple accuracy for a performance evaluation measure of the predictive capability of the model that was learned from the training data.  But we can learn more by looking at the predictions for each of the labels in our classifier.

We start by looking at the confusion matrix, which shows the results of a test for how many of the actual class labels (the gold standard labels) match with the predicted labels.  In this diagram the two labels are called “Yes” and “No”.


<img src="confusion_matrix.png" width="600">

When the predicted class is the same as the actual class, we call those examples the true positives.  When the actual class was supposed to be Yes, but the predicted class was No, we call those examples the false negatives.  When the actual class is No, but the classifier incorrectly predicted Yes, we call those examples the false positives.  The true negatives are the remaining examples that were correctly predicted No.  The number of each of these types of examples in the test set is put into the confusion matrix.

Note that the intuition for the terminology comes from the idea that we are trying to find all the examples where the class label is Yes, the positive examples.  The false positives represent the positives which were predicted Wrong, and the false negatives represent the positives that were Missed.  This idea originated in the Information Retrieval field where the Yes answers represented documents that were correctly retrieved as the result of a search.  

In keeping with this intuition, two commonly used measures come from IR, where IR is only interested in the positive labels.

recall = TP / ( TP + FP )   	(the percentage of actual yes answers that are right)
precision =  TP / ( TP + FN ) (the percentage of predicted yes answers that are right)

These two measures are sometimes combined into a kind of average, the harmonic mean, called the F-measure, which in its simplest form is:

F-measure = 2 * (recall * precision) / (recall + precision)

In situations where we are equally interested in correctly predicting Yes and No, and the numbers of these are roughly equal, then we may compute precision and recall for both the positive and negative labels.  And we can also use the accuracy measure.

accuracy = TP + TN / (TP + FP + FN + TN)    (percentage of correct Yes and No out							of all text examples)


In the NLTK, the confusion matrix is given by a function that takes two lists of labels for the test set.  NLTK calls the first list the reference list, which is all the correct/gold labels for the test set, and the second list is the test list, which is all the predicted labels in the test set.  These two lists are both in the order of the test set, so they can be compared to see which examples the classifier model agreed on or not.

First we build the reference and test lists from the classifier on the test set, but we will call them the gold list and the predicted list.


In [20]:
# First we build the reference and test lists from the classifier on the test set, but we will call them the gold list and 
#the predicted list.

goldlist = []
predictedlist = []
for (features, label) in test_set:
    	goldlist.append(label)
    	predictedlist.append(classifier.classify(features))

# We can look at the first 30 examples and think about whether the corresponding elements of the last match.

print(goldlist[:30])
print(predictedlist[:30])

# Now we use the NLTK function to define the confusion matrix, and we print it out:


['neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg']
['neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg']


In [21]:
cm = nltk.ConfusionMatrix(goldlist, predictedlist)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))

 
# (row = gold; col = predicted)


    |      p      n |
    |      o      e |
    |      s      g |
----+---------------+
pos | <36.4%> 13.8% |
neg |  12.4% <37.4%>|
----+---------------+
(row = reference; col = test)



In our movie sentences classification task, we have two class labels:  ‘neg’ and ‘pos’ (instead of Yes and No).  If we consider the ‘pos’ class as the positive class and the ‘neg’ as the negative class, then this confusion matrix is reversed from our previous version, and there are 352 True Positives, 375 True Negatives, 125 False Positives and 148 False Negatives.  Since this classification task is symmetric with respect to the two classes, we can flip the terminology and consider the ‘neg’ class as positive and the ‘pos’ class as negative.  In that case, there are 375 True Positives, 352 True Negatives, 148 False Positives, and125 False Negatives.

Since we are interested in both the ‘pos’ and ‘neg’ classes, we next want to compute precision, recall and F1 for each class.  There are NLTK functions to do this, but they require a lot of setup to get the input in the correct forms.

Instead, I have written a function that takes the gold list and the predicted list, computes the True Positives, True Negatives, False Positives, False Negatives and then uses those to compute the other measures for each class.  I called this function eval_measures.


In [22]:
# Function to compute precision, recall and F1 for each label
#  and for any number of labels
# Input: list of gold labels, list of predicted labels (in same order)
# Output:  prints precision, recall and F1 for each label
def eval_measures(gold, predicted):
    # get a list of labels
    labels = list(set(gold))
    # these lists have values for each label 
    recall_list = []
    precision_list = []
    F1_list = []
    for lab in labels:
        # for each label, compare gold and predicted lists and compute values
        TP = FP = FN = TN = 0
        for i, val in enumerate(gold):
            if val == lab and predicted[i] == lab:  TP += 1
            if val == lab and predicted[i] != lab:  FN += 1
            if val != lab and predicted[i] == lab:  FP += 1
            if val != lab and predicted[i] != lab:  TN += 1
        # use these to compute recall, precision, F1
        recall = TP / (TP + FP)
        precision = TP / (TP + FN)
        recall_list.append(recall)
        precision_list.append(precision)
        F1_list.append( 2 * (recall * precision) / (recall + precision))

    # the evaluation measures in a table with one row per label
    print('\tPrecision\tRecall\t\tF1')
    # print measures for each label
    for i, lab in enumerate(labels):
        print(lab, '\t', "{:10.3f}".format(precision_list[i]), \
          "{:10.3f}".format(recall_list[i]), "{:10.3f}".format(F1_list[i]))



In [23]:
# Now we can call this function on our data.

eval_measures(goldlist, predictedlist)

# This gives us more information into the performance of the model for each label.  
# We can see that the ‘neg’ label is predicted with higher precision, .75,
# while the ‘pos’ label is predicted with higher recall, .738.


	Precision	Recall		F1
pos 	      0.725      0.746      0.735
neg 	      0.751      0.730      0.741
