Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit Steven Bird, mEwan Klein, and Edward Loper http://www.nltk.org/book/

# Chapter 06 - Learning to Classify Text

## 6.1 Supervised Classification

Classification is the task of choosing the correct class label for a given input. In basic
classification tasks, each input is considered in isolation from all other inputs, and the
set of labels is defined in advance.

### Gender Identification

In Section 2.4, we saw that male and female names have some distinctive characteristics.
Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and
t are likely to be male. Let’s build a classifier to model these differences more precisely.

In [1]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [2]:
gender_features('Shrek')

{'last_letter': 'k'}

In [15]:
import nltk
from nltk.corpus import names

In [16]:
import random

In [17]:
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

In [18]:
random.shuffle(names)

In [19]:
featuresets = [(gender_features(n), g) for (n, g) in names]

In [20]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [21]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [22]:
classifier.classify(gender_features('Neo'))

'male'

In [23]:
classifier.classify(gender_features('Trinity'))

'female'

In [24]:
print(nltk.classify.accuracy(classifier, test_set))

0.746


In [25]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     38.3 : 1.0
             last_letter = 'k'              male : female =     32.1 : 1.0
             last_letter = 'f'              male : female =     16.1 : 1.0
             last_letter = 'p'              male : female =      9.9 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0


In [26]:
from nltk.classify import apply_features

In [27]:
train_set = apply_features(gender_features, names[500:])

In [28]:
test_set = apply_features(gender_features, names[:500])

### Choosing the Right Features

Typically, feature extractors are built through a process of trial-and-error, guided by
intuitions about what information is relevant to the problem. It’s common to start with
a “kitchen sink” approach, including all the features that you can think of, and then
checking to see which features actually are helpful.

In [30]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [86]:
print(gender_features2('John'))

{'firstletter': 'j', 'lastletter': 'n', 'count(a)': 0, 'has(a)': False, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 1, 'has(h)': True, 'count(i)': 0, 'has(i)': False, 'count(j)': 1, 'has(j)': True, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 1, 'has(n)': True, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}


In [32]:
featuresets = [(gender_features2(n), g) for (n,g) in names]

In [33]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [34]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [35]:
print(nltk.classify.accuracy(classifier, test_set))

0.734


In [44]:
train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]

In [45]:
# train_set = [(gender_features(n), g) for (n, g) in train_names]
# devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
# test_set = [(gender_features(n), g) for (n, g) in test_names]

In [46]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)
test_set = apply_features(gender_features, test_names)

In [47]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [48]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.79


In [49]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

In [50]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Adriaens                      
correct=female   guess=male     name=Agnes                         
correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Alyson                        
correct=female   guess=male     name=Arlen                         
correct=female   guess=male     name=Ashlen                        
correct=female   guess=male     name=Ayn                           
correct=female   guess=male     name=Betteann                      
correct=female   guess=male     name=Bidget                        
correct=female   guess=male     name=Brittan                       
correct=female   guess=male     name=Candis                        
correct=female   guess=male     name=Carlin                        
correct=female   guess=male     name=Celestyn                      
correct=female   guess=male     name=Chad                          
correct=female   guess=male     name=Charlot    

In [51]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [52]:
# train_set = [(gender_features(n), g) for (n,g) in train_names]
# devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]

In [53]:
train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)

In [54]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [55]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.809


### Document Classification

Using these corpora, we can build classifiers that will automatically
tag new documents with appropriate category labels. First, we construct a list of documents,
labeled with the appropriate categories.

In [79]:
from nltk.corpus import movie_reviews

In [80]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [81]:
random.shuffle(documents)

In [82]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [83]:
word_features = [w for w, f in all_words.most_common(2000)]

In [84]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [85]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'contains(,)': True, 'contains(the)': True, 'contains(.)': True, 'contains(a)': True, 'contains(and)': True, 'contains(of)': True, 'contains(to)': True, "contains(')": True, 'contains(is)': True, 'contains(in)': True, 'contains(s)': True, 'contains(")': True, 'contains(it)': True, 'contains(that)': True, 'contains(-)': True, 'contains())': True, 'contains(()': True, 'contains(as)': True, 'contains(with)': True, 'contains(for)': True, 'contains(his)': True, 'contains(this)': True, 'contains(film)': False, 'contains(i)': False, 'contains(he)': True, 'contains(but)': True, 'contains(on)': True, 'contains(are)': True, 'contains(t)': False, 'contains(by)': True, 'contains(be)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(an)': True, 'contains(who)': True, 'contains(not)': True, 'contains(you)': True, 'contains(from)': True, 'contains(at)': False, 'contains(was)': False, 'contains(have)': True, 'contains(they)': True, 'contains(has)': True, 'contains(her)': False, 'conta

In [94]:
# featuresets = [(document_features(d), c) for (d, c) in documents]

In [95]:
featuresets = apply_features(document_features, documents)

In [96]:
train_set, test_set = featuresets[100:], featuresets[:100]

In [97]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [98]:
print(nltk.classify.accuracy(classifier, test_set))

0.79


In [100]:
classifier.show_most_informative_features(5)

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.6 : 1.0
        contains(seagal) = True              neg : pos    =      7.8 : 1.0
         contains(mulan) = True              pos : neg    =      7.7 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.7 : 1.0
        contains(poorly) = True              neg : pos    =      5.7 : 1.0


### Part-of-Speech Tagging

This regular expression tagger had to be handcrafted. Instead, we can train a classifier to work out which suffixes are most informative.

In [125]:
from nltk.corpus import brown

In [126]:
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [127]:
common_suffixes = [s for s, f, in suffix_fdist.most_common(100)]

In [128]:
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [129]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In [140]:
# file_ids = brown.fileids(categories='news')

In [141]:
tagged_words = brown.tagged_words(categories='news')

In [142]:
# featuresets = [(pos_features(n), g) for (n, g) in tagged_words]

In [143]:
featuresets = apply_features(pos_features, tagged_words)

In [144]:
size = int(len(featuresets) * 0.1)

In [145]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.DecisionTreeClassifier.train(train_set)

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
classifier.classify(pos_features('cats'))

In [None]:
print(classifier.pseudocode(depth=4))

## Exploiting Context

By augmenting the feature extraction function, we could modify this part-of-speech
tagger to leverage a variety of other word-internal features, such as the length of the
word, the number of syllables it contains, or its prefix. However, as long as the feature
extractor just looks at the target word, we have no way to add features that depend on
the context in which the word appears.

In [None]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [None]:
pos_features(brown.sents()[0], 8)

In [None]:
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
featuresets = []

In [None]:
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag) )

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

## Sequence Classification

In [None]:
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

In [None]:
class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [None]:
tagged_sents = brown.tagged_sents(categories='news')

In [None]:
size = int(len(tagged_sents) * 0.1)

In [None]:
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [None]:
tagger = ConsecutivePosTagger(train_sents)

In [None]:
print tagger.evaluate(test_sents)

## Other Methods for Sequence Classification

One shortcoming of this approach is that we commit to every decision that we make.
For example, if we decide to label a word as a noun, but later find evidence that it should
have been a verb, there’s no way to go back and fix our mistake. One solution to this
problem is to adopt a transformational strategy instead. Transformational joint classifiers
work by creating an initial assignment of labels for the inputs, and then iteratively
refining that assignment in an attempt to repair inconsistencies between related inputs.

# Further Examples of Supervised Classification

## Sentence Segmentation

Sentence segmentation can be viewed as a classification task for punctuation: whenever
we encounter a symbol that could possibly end a sentence, such as a period or a question
mark, we have to decide whether it terminates the preceding sentence.

In [None]:
sents = nltk.corpus.treebank_raw.sents()

In [None]:
tokens = []

In [None]:
boundaries = set()

In [None]:
offset = 0

In [None]:
for sent in nltk.corpus.treebank_raw.sents():
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [None]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
    'prevword': tokens[i-1].lower(),
    'punct': tokens[i],
    'prev-word-is-one-char': len(tokens[i-1]) == 1}

In [None]:
featuresets = [(punct_features(tokens, i), (i in boundaries))
    for i in range(1, len(tokens)-1)
    if tokens[i] in '.?!']

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in words:
        if word in '.?!' and classifier.classify(words, i) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])

## Identifying Dialogue Act Types

When processing dialogue, it can be useful to think of utterances as a type of action performed by the speaker. This interpretation is most straightforward for performative statements such as I forgive you or I bet you can’t climb that hill. But greetings, questions,answers, assertions, and clarifications can all be thought of as types of speech-based actions.Recognizing the dialogue acts underlying the utterances in a dialogue can be
an important first step in understanding the conversation.

In [None]:
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

In [None]:
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains(%s)' % word.lower()] = True
    return features

In [None]:
featuresets = [(dialogue_act_features(post.text), post.get('class'))
                for post in posts]

In [None]:
size = int(len(featuresets) * 0.1)

In [None]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print nltk.classify.accuracy(classifier, test_set)

## Recognizing Textual Entailment

Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the “hypothesis” (as already discussed in Section 1.5). To date, there have been four RTE Challenges, where shared development and test data is made available to competing teams. Here are a couple of examples of text/hypothesis pairs from the Challenge 3 development dataset. The label True indicates
that the entailment holds, and False indicates that it fails to hold.

In [None]:
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In [None]:
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]

In [None]:
extractor = nltk.RTEFeatureExtractor(rtepair)

In [None]:
print extractor.text_words

In [None]:
print extractor.hyp_words

In [None]:
print extractor.overlap('word')

In [None]:
print extractor.overlap('ne')

In [None]:
print extractor.hyp_extra('word')

# Evaluation

In order to decide whether a classification model is accurately capturing a pattern, we must evaluate that model. The result of this evaluation is important for deciding how trustworthy the model is, and for what purposes we can use it. Evaluation can also be an effective tool for guiding us in making future improvements to the model.

## The Test Set

It is very important that the test set be distinct from the training corpus: if we simply reused the training set as the test set, then a model that simply memorized its input,without learning how to generalize to new examples, would receive misleadingly high scores.

In [None]:
import random

In [None]:
from nltk.corpus import brown

In [None]:
tagged_sents = list(brown.tagged_sents(categories='news'))

In [None]:
random.shuffle(tagged_sents)

In [None]:
size = int(len(tagged_sents) * 0.1)

In [None]:
train_set, test_set = tagged_sents[size:], tagged_sents[:size]

In this case, our test set will be very similar to our training set. The training set and test set are taken from the same genre, and so we cannot be confident that evaluation results would generalize to other genres. What’s worse, because of the call to random.shuffle(), the test set contains sentences that are taken from the same documents that were used for training.

In [None]:
file_ids = brown.fileids(categories='news')

In [None]:
size = int(len(file_ids) * 0.1)

In [None]:
train_set = brown.tagged_sents(file_ids[size:])

In [None]:
test_set = brown.tagged_sents(file_ids[:size])

In [None]:
train_set = brown.tagged_sents(categories='news')

In [None]:
test_set = brown.tagged_sents(categories='fiction')

## Accuracy

The simplest metric that can be used to evaluate a classifier, accuracy, measures the percentage of inputs in the test set that the classifier correctly labeled.

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
print 'Accuracy: %4.2f' % nltk.classify.accuracy(classifier, test_set)

## Confusion Matrices

When performing classification tasks with three or more labels, it can be informative to subdivide the errors made by the model based on which types of mistake it made. A confusion matrix is a table where each cell [i,j] indicates how often label j was predicted when the correct label was i. Thus, the diagonal entries (i.e., cells [i,j]) indicate labels that were correctly predicted, and the off-diagonal entries indicate errors.

In [None]:
def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]

In [None]:
def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

In [None]:
gold = tag_list(brown.tagged_sents(categories='editorial'))

In [None]:
test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))

In [None]:
cm = nltk.ConfusionMatrix(gold, test)

## Cross-Validation

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available.

# Decision Trees

In the next three sections, we’ll take a closer look at three machine learning methods that can be used to automatically build classification models: decision trees, naive Bayes classifiers, and Maximum Entropy classifiers.

## Entropy and Information Gain

As was mentioned before, there are several methods for identifying the most informative feature for a decision stump. One popular alternative, called information gain, measures how much more organized the input values become when we divide them up using a given feature.     
In particular, entropy is defined as the sum of the probability of each label times the log probability of that same label:           
(1) H = Σl ∈ labelsP(l) × log2P(l). 

In [None]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
    return -sum([p * math.log(p,2) for p in probs])

In [None]:
print entropy(['male', 'male', 'male', 'male'])

In [None]:
print entropy(['male', 'female', 'male', 'male'])

In [None]:
print entropy(['female', 'male', 'female', 'male'])

In [None]:
print entropy(['female', 'female', 'male', 'female'])

In [None]:
print entropy(['female', 'female', 'female', 'female'])