# Chapter 6 Learning to Classify Text

Read through the chapter following along with the examples.  Then complete the programming problems below.  Add additional cells as necessary.

## Problem 1

Using any of the three classifiers described in this chapter, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect? (5 pts.)



In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import nltk
from nltk.corpus import names
import random
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
test, devtest, training = names[:500], names[500:1000], names[1000:]

def gender_features1(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["suffix1"] = name[-1:].lower()
    features["prefix"] = name[:3].lower()
    features["vowels"] = len([letter for letter in name if letter in 'aeiou'])
    #for letter in 'abcdefghijklmnopqrstuvwxyz':
        #features["count({})".format(letter)] = name.lower().count(letter)
        #features["has({})".format(letter)] = (letter in name.lower())
    return features

train_set = [(gender_features1(n), g) for (n,g) in training]
devtest_set = [(gender_features1(n), g) for (n,g) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.83


In [3]:
def error_analysis(gender_features):
    errors = []
    for (name, tag) in devtest:
        roll = classifier.classify(gender_features(name))
        if roll != tag:
            errors.append((tag, roll, name))
    print('no. of errors: ', len(errors))       
        
    for (tag, roll, name) in sorted(errors):
        print('correct={:8} roll={:<8s} name={:30}'.format(tag, roll, name))        
        
error_analysis(gender_features1)

no. of errors:  85
correct=female   roll=male     name=Ajay                          
correct=female   roll=male     name=Anne-Mar                      
correct=female   roll=male     name=Barb                          
correct=female   roll=male     name=Bren                          
correct=female   roll=male     name=Brigid                        
correct=female   roll=male     name=Buffy                         
correct=female   roll=male     name=Cal                           
correct=female   roll=male     name=Carolan                       
correct=female   roll=male     name=Cody                          
correct=female   roll=male     name=Dagmar                        
correct=female   roll=male     name=Dion                          
correct=female   roll=male     name=Doloritas                     
correct=female   roll=male     name=Drew                          
correct=female   roll=male     name=Fey                           
correct=female   roll=male     name=Gael   

In [4]:
# Performance on test set
test_set = [(gender_features1(n), g) for (n,g) in test]
print(nltk.classify.accuracy(classifier, test_set))

0.828


#### Analysis: 
Performance on the test set seems to more or less mirror the results of the devtest set. The accuracy and errors appear to vary in the order of which the features are placed. As an additional observation - the more the features function is run, the higher the accuracy becomes - this is likely because our classifier is actually learning. Classifier ran slightly worse on the test set than on the devtest set. 
Note: Commented code causes overfitting.

## Problem 2

Using the movie review document classifier discussed in this chapter, generate a list of the 30 features that the classifier finds to be most informative. Can you explain why these particular features are informative? Do you find any of them surprising? (5 pts.)

In [8]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(30)

0.84
Most Informative Features
 contains(unimaginative) = True              neg : pos    =      7.5 : 1.0
    contains(schumacher) = True              neg : pos    =      7.2 : 1.0
        contains(suvari) = True              neg : pos    =      6.9 : 1.0
          contains(mena) = True              neg : pos    =      6.9 : 1.0
        contains(shoddy) = True              neg : pos    =      6.9 : 1.0
        contains(sexist) = True              neg : pos    =      6.9 : 1.0
     contains(atrocious) = True              neg : pos    =      6.5 : 1.0
        contains(turkey) = True              neg : pos    =      6.4 : 1.0
       contains(unravel) = True              pos : neg    =      5.8 : 1.0
       contains(singers) = True              pos : neg    =      5.8 : 1.0
        contains(poorly) = True              neg : pos    =      5.7 : 1.0
           contains(ugh) = True              neg : pos    =      5.7 : 1.0
        contains(justin) = True              neg : pos    =      5.7 

#### Analysis:
Most feature rules appear to have some logic to them, since "unimaginative" would be intuitively assigned as a negative word - the same would go for "ugh", "sexist", "shoddy", "groan", "waste", and "uninspired". What is suprising is how many names are associated negatively: "schumacher", "justin" and "bronson". "Toll" is positive but it the sense of the review is ambiguous. 
Users tend to use more pronounced vocabulary when positive emotions are evoked, such as "explores" and "kudos". "Surveillance" is associated negatively, given its applied use in the real world, so that would also make sense. 

## Problem 3

Word features can be very useful for performing document classification, since the words that appear in a document give a strong indication about what its semantic content is. However, many words occur very infrequently, and some of the most informative words in a document may never have occurred in our training data. One solution is to make use of a lexicon, which describes how different words relate to one another. Using the WordNet lexicon, augment the movie review document classifier presented in this chapter to use features that generalize the words that appear in a document, making it more likely that they will match words found in the training data. (7 pts.)

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.corpus import wordnet as wn
from nltk.classify import apply_features
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [2]:
def synsets(words):
    syns = set()
    for w in words:
        syns.update(str(s) for s in wn.synsets(w))
    return syns

In [3]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
synset_features = synsets(word_features)

def document_features2(document):
    document_words = set(document)
    document_synsets = synsets(document_words)
    for word in document_words:
        document_synsets.update(str(s) for s in wn.synsets(word))
    features = {}
    for synset in synset_features:
        features[synset] = (synset in document_synsets)
    return features

In [5]:
train_set, test_set = apply_features(document_features2, documents[100:]), apply_features(document_features2, documents[:100])
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
classifier.show_most_informative_features(30)

0.72
Most Informative Features
 Synset('plodding.n.02') = True              neg : pos    =     13.8 : 1.0
Synset('unimaginative.s.02') = True              neg : pos    =      7.7 : 1.0
Synset('jerry-built.s.01') = True              neg : pos    =      7.1 : 1.0
    Synset('vomit.n.01') = True              neg : pos    =      7.1 : 1.0
   Synset('shoddy.n.01') = True              neg : pos    =      6.4 : 1.0
Synset('disorderly.s.02') = True              neg : pos    =      6.4 : 1.0
Synset('squandered.s.01') = True              neg : pos    =      5.8 : 1.0
Synset('surveillance.n.01') = True              neg : pos    =      5.7 : 1.0
Synset('underbrush.n.01') = True              neg : pos    =      5.7 : 1.0
   Synset('turkey.n.02') = True              neg : pos    =      5.4 : 1.0
   Synset('turkey.n.01') = True              neg : pos    =      5.4 : 1.0
   Synset('turkey.n.04') = True              neg : pos    =      5.4 : 1.0
 Synset('farcical.s.01') = True              neg : pos   