## NLP Lab Session Week 7
## Classification and Feature Sets in NLTK
## Part 1:  Classifying names for gender

### Getting Started


For this lab session download the examples:  LabWk7.classifybasics.txt put it in your class folder for copy/pasting examples.  Start your Jupyter notebook session.

In [1]:
import nltk

These examples and others appear in Chapter 6 of the NLTK book.

The goal of this week’s lab is to show how to set up a variety of text classification problems in the NLTK.

Learning to Classify Text
We have seen that many NLP tasks, particularly the semantic tasks such as WSD (Word Sense Disambiguation) and SRL (Semantic Role Labeling) are typically solved used machine learning techniques to classify text.  The WSD classification is to take each word in text and label it with one of the word senses from WordNet.  The SRL problem is to take each verb in a sentence and label each constituent phrase in the sentence with a semantic role label or not, thus finding phrases in the sentence that can be labeled with one of the semantic roles of that verb.

For these types of classification problems, we need to represent the input text by a set of features for the classifier algorithm.  Now you may have seen in some machine learning tools such as Weka or SKLearn that there are text processing functions – Weka’s is called StringToWordVector – that can obtain a set of features that represent a piece of text by the frequencies of the words that it contains.  These functions may also have filters such as stop words or minimum frequencies.  But with more advanced text processing, we can define many more types of features from text that are required for some text classification tasks.  In the next weeks, we will look at several ways to define text features for classification, ending with sentiment classification.

In this first lab using NLP for classification, we look at how to prepare data for classification in the NLTK.  These examples and others appear in Chapter 6 of the NLTK book. For each example, or instance, of the classification problem, we prepare a set of features that will represent that example to the machine learning algorithm.  (Look at the diagram in Section 1, Figure 1.1)  In the NLTK terminology, our focus in this class will be to write feature extractors that take advantage of our knowledge of NLP to design good features to represent text problems.

Name Gender Classifier
We start with a simple problem that will illustrate the process of preparing text data for classification and training and using classifiers.  This problem is based on the idea that male and female first names (in English) have distinctive characteristics.  For example, names ending in a, e, and i are likely to be female, while names ending in k, o, r, s and t are likely to be male.  We will build a classifier that will label any name with its gender.

For each item to be classified, in this case a single word, in NLTK we build the features of that item as a dictionary that maps each feature name to a value, which can be a Boolean, a number or a string.  A feature set is the feature dictionary together with the label of the item to be classified, in this case the gender. 

 
We first define a function that will extract or build the features for a single instance of the problem, in this case a single name.  To start with, we will generate a single feature which consists of the last letter of the name.  Note that the function returns a dictionary with a single item.


In [51]:
def gender_features(word):
    return{'last_letter': word[-1]}

def gender_features2(word):
    return{'last_letter': word[-2:]}

def gender_features3(word):
    return{'first_letter': word[:2]}

def gender_features4(word):
    return{'first_last_letter': word[:2]+word[-2:]}

In [52]:
# We can apply this function to any name:
print(gender_features('Shrek'))
print(gender_features2('Shrek'))
print(gender_features3('Shrek'))
print(gender_features4('Shrek'))

{'last_letter': 'k'}
{'last_letter': 'ek'}
{'first_letter': 'Sh'}
{'first_last_letter': 'Shek'}


Now that we’ve defined our features, we need to construct the training data, or “gold standard” data.  This will be a list of first names, each of which will be labeled either male or female.  So we want a list of names with known gender where we can construct the feature set for each name.  For example, if the gender of Shrek is known to be “male”, we could have:

 


The NLTK corpus contains a names corpus which has a list of male first names and another list of female first names, so we can use this data to create a list of all the first names, but where each is labeled with its gender.


In [5]:
from nltk.corpus import names

The names corpus has a function words that will return either the names identified by the string ‘male.txt’ or ‘female.txt’.  Here we look at the first 20 male names.

In [6]:
names.words('male.txt')[:20]

['Aamir',
 'Aaron',
 'Abbey',
 'Abbie',
 'Abbot',
 'Abbott',
 'Abby',
 'Abdel',
 'Abdul',
 'Abdulkarim',
 'Abdullah',
 'Abe',
 'Abel',
 'Abelard',
 'Abner',
 'Abraham',
 'Abram',
 'Ace',
 'Adair',
 'Adam']

From the male and female names lists, we will create one long list with (name, gender) pairs to create the labeled data.

In [7]:
namesgender = ([(name, 'male') for name in names.words('male.txt')] +
          [(name, 'female') for name in names.words('female.txt')])


In [8]:
# Take a look at this list with the first 20 names and the last 20 names.
print(len(namesgender))

print(namesgender[:20])
print(namesgender[7924:])


7944
[('Aamir', 'male'), ('Aaron', 'male'), ('Abbey', 'male'), ('Abbie', 'male'), ('Abbot', 'male'), ('Abbott', 'male'), ('Abby', 'male'), ('Abdel', 'male'), ('Abdul', 'male'), ('Abdulkarim', 'male'), ('Abdullah', 'male'), ('Abe', 'male'), ('Abel', 'male'), ('Abelard', 'male'), ('Abner', 'male'), ('Abraham', 'male'), ('Abram', 'male'), ('Ace', 'male'), ('Adair', 'male'), ('Adam', 'male')]
[('Zena', 'female'), ('Zenia', 'female'), ('Zia', 'female'), ('Zilvia', 'female'), ('Zita', 'female'), ('Zitella', 'female'), ('Zoe', 'female'), ('Zola', 'female'), ('Zonda', 'female'), ('Zondra', 'female'), ('Zonnya', 'female'), ('Zora', 'female'), ('Zorah', 'female'), ('Zorana', 'female'), ('Zorina', 'female'), ('Zorine', 'female'), ('Zsa Zsa', 'female'), ('Zsazsa', 'female'), ('Zulema', 'female'), ('Zuzana', 'female')]


In [9]:
# Now we create a random shuffle of the namesgender list so that we can easily split it into a training and test set.
import random
random.shuffle(namesgender)
print(namesgender[:20])


[('Marjie', 'female'), ('Eustacia', 'female'), ('Bubba', 'male'), ('Kraig', 'male'), ('Sherlock', 'male'), ('Gwenette', 'female'), ('Larina', 'female'), ('Dawn', 'female'), ('Lura', 'female'), ('Mathew', 'male'), ('Cherry', 'female'), ('Corilla', 'female'), ('Eliott', 'male'), ('Er', 'male'), ('Sanders', 'male'), ('Miriam', 'female'), ('Marissa', 'female'), ('Carley', 'female'), ('Rory', 'female'), ('Starla', 'female')]


In [10]:
# We split this list into training and test sets and run the Naïve Bayes classifier algorithm to create a trained classifier.  
# (The training set is the last 7444 examples and the test set is the first 500 examples.)  What do you think will happen if 
# some of the examples with last letter = “n” are female and some of them are male?

train_names = namesgender[500:]
test_names = namesgender[:500]


In [12]:
# Next we use the feature extractor function to create the list of instances of the problem that consists only of the features 
# and the gender label.  (This is the equivalent of the spread sheet for structured data in a classification problem.)  But 
# this example is unusual in that most of the actual text is ignored in the features!  

train_set = [(gender_features(n), g) for (n,g) in train_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
print(train_set[:20])
classifier = nltk.NaiveBayesClassifier.train(train_set)

[({'last_letter': 'y'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'i'}, 'female'), ({'last_letter': 'y'}, 'female'), ({'last_letter': 'r'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'n'}, 'male'), ({'last_letter': 'b'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'l'}, 'male'), ({'last_letter': 'n'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female')]


We can the compute the accuracy of the classifier on the test set.  The accuracy function for each classifier:

•	first removes the gender labels from the test set,

•	runs the classifier on each name in the test set to get a predicted gender, and 

•	compares the predicted gender with each actual gender from the test set to get the evaluation score.  

In this case, it just produces an accuracy score, instead of precision and recall.



In [14]:
print(nltk.classify.accuracy(classifier, test_set))

0.756


In [15]:
# One of the things that we can use a classifier for is to label totally new instances of the problem, in this case, names that 
# come from the future:

print(classifier.classify(gender_features('Neo')))
print(classifier.classify(gender_features('Trinity')))


male
female


In [16]:
# Finally, the classifier class for Naïve Bayes has a function that shows the feature values that were most important in 
# doing the classification.
classifier.show_most_informative_features(20)

Most Informative Features
             last_letter = 'k'              male : female =     44.8 : 1.0
             last_letter = 'a'            female : male   =     39.6 : 1.0
             last_letter = 'f'              male : female =     15.4 : 1.0
             last_letter = 'p'              male : female =     12.0 : 1.0
             last_letter = 'm'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     10.6 : 1.0
             last_letter = 'd'              male : female =      9.1 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.9 : 1.0
             last_letter = 'g'              male : female =      5.0 : 1.0
             last_letter = 'w'              male : female =      4.8 : 1.0
             last_letter = 'z'              male : female =      4.4 : 1.0
             last_letter = 't'              male : female =      4.1 : 1.0

When the output of this function shows a line like:
             last_letter = 'a'            female : male   =     35.4 : 1.0
it means that when an example has the feature last_letter = ‘a’, it is 35.4 times more likely to be classified as female than male.


### Choosing Good Features

Selecting relevant features can usually be the most important part of training a classifier.  Often the approach is to throw in as many features as possible and then try to figure out which ones were important.  For most machine learning algorithms, throwing in too many features can cause a problem known as “overfitting”, which is that the classifier is trained on so many of the exact details of the training set that it is not as good on new examples.

In developing a classifier for a classification problem, we will want to do some error analysis of the test set and then perhaps change our features and retrain the classifier.  This is known as the development process.  In the real world, it would be important to keep a separate test set that was not used in the error analysis for our final evaluation.  So we would actually divide our labeled data into a training set, a development test set, and a test set.  But for this lab, we will continue to just use a training and a test set.


In [17]:
def geterrors(test):
    errors = []
    for (name, tag) in test:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
    return errors


In [18]:
errors = geterrors(test_names)
len(errors)

122

In [42]:
# Then we define a function to print all the errors, sorted by the correct labels, so that we can look at the differences 
# (with even more string formatting).

def printerrors(errors):
    for (tag, guess, name) in sorted(errors):
        print('correct={:<8s} guess={:<8s} name={:<30s}'.format(tag, guess, name))
printerrors(errors)


correct=female   guess=male     name=Aleen                         
correct=female   guess=male     name=Allis                         
correct=female   guess=male     name=Ambur                         
correct=female   guess=male     name=Aurel                         
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Betteann                      
correct=female   guess=male     name=Bidget                        
correct=female   guess=male     name=Blair                         
correct=female   guess=male     name=Bridgett                      
correct=female   guess=male     name=Caren                         
correct=female   guess=male     name=Ceil                          
correct=female   guess=male     name=Charmain                      
correct=female   guess=male     name=Christal                      
correct=female   guess=male     name=Clem                          
correct=female   guess=male     name=Coleen     

Looking through the list of errors, we observe cases where using the last two letters of each name might be more informative.  As observed in the book, “For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female.”

In [54]:
train_set = [(gender_features2(n), g) for (n,g) in train_names]
test_set = [(gender_features2(n), g) for (n,g) in test_names]
print(train_set[:20])
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
print(classifier.classify(gender_features('Neo')))
print(classifier.classify(gender_features('Trinity')))
classifier.show_most_informative_features(20)

[({'last_letter': 'ey'}, 'male'), ({'last_letter': 'ra'}, 'female'), ({'last_letter': 'di'}, 'female'), ({'last_letter': 'ey'}, 'female'), ({'last_letter': 'ar'}, 'male'), ({'last_letter': 'na'}, 'female'), ({'last_letter': 'ie'}, 'female'), ({'last_letter': 'ne'}, 'female'), ({'last_letter': 'ca'}, 'female'), ({'last_letter': 'ra'}, 'female'), ({'last_letter': 'en'}, 'male'), ({'last_letter': 'eb'}, 'male'), ({'last_letter': 'ie'}, 'female'), ({'last_letter': 'na'}, 'female'), ({'last_letter': 'na'}, 'female'), ({'last_letter': 'ie'}, 'male'), ({'last_letter': 'el'}, 'male'), ({'last_letter': 'an'}, 'male'), ({'last_letter': 'ea'}, 'female'), ({'last_letter': 'ye'}, 'female')]
0.78
female
female
Most Informative Features
             last_letter = 'na'           female : male   =    166.3 : 1.0
             last_letter = 'la'           female : male   =     74.5 : 1.0
             last_letter = 'ia'           female : male   =     40.1 : 1.0
             last_letter = 'sa'           f

## NLP Lab Session Week 7
## Classification and Feature Sets in NLTK
## Part 2:  POS Tagging


### POS Tagging Classifier


We next use the example of POS tagging in order to show how to build a feature set in the NLTK and to run a classifier.  We will set up the POS tagging problem as a classification problem that tries to label each word with the correct POS tag.  From our previous discussions of POS tagging in the lectures, we know that the best POS taggers use a combination of an HMM sequential tagger that can use the previous tag and a feature-based classifier similar to the one that we’ll set up here.

As we saw last week, for each item to be classified, in this case a single word, in NLTK we build the features of that item as a dictionary that maps each feature name to a value, which can be a Boolean, a number or a string.  A feature set is the feature dictionary together with the label of the item to be classified, in this case the POS tag.

 

One source of information for POS tagging is the morphology of the word, and we can start by looking at suffixes of words and building features.

We also know that we can improve POS tagging if we take account of the context of the word.  So we define a POS feature function that takes an entire sentence and can use the previous word in the sentence.  We will use feature names of ‘suffix(1)’, ‘suffix(2)’, and ‘suffix(3)’ and the values of the these features will be the string that contains the suffix letters of lengths 1, 2, and 3.  


In [59]:
# the pos features function takes the sentence and the index of a word i
#   it creates features for word i, including the previous word i-1
def pos_features(sentence, i):
	features = {"suffix(1)": sentence[i][-1:],
		    "suffix(2)": sentence[i][-2:],
		    "suffix(3)": sentence[i][-3:]}
	if i == 0:
		features["prev-word"] = "<START>"
	else:
		features["prev-word"] = sentence[i-1]
	return features


In [61]:
# Recall that the corpus function “sents” returns a list of sentences, where each sentence is a list of tokens.  
# Look at the features of the word at index 8 of the first sentence in the Brown corpus:
from nltk.corpus import brown

sentence0 = brown.sents()[0]
sentence0
print(sentence0)
print(sentence0[8])
print(pos_features(sentence0, 8))


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']
investigation
{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}


For this word, our POS_features function gives the feature dictionary.  Now we need to apply our feature function to all words in the training corpus and to pair it with the correct tag.

 

Now we take all the sentences in the news portion of Brown and apply our function to get the POS features, as a dictionary, of each (untagged) word.  In order to apply the pos_features function, we use the nltk.tag.untag function to get an untagged sentence, e.g. here is the untag function applied to the first sentence.


In [63]:
tagged_sents = brown.tagged_sents(categories='news')
tag_sent0 = tagged_sents[0]
nltk.tag.untag(tag_sent0)


['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that',
 'any',
 'irregularities',
 'took',
 'place',
 '.']

In order to apply the POS_features function to the untagged sentence, we need an index number for each word, and the python enumerate function will return a list that pairs the index number of each word with the word and tag.  

In [64]:
for i,(word,tag) in enumerate(tag_sent0):
    print (i, word, tag)


0 The AT
1 Fulton NP-TL
2 County NN-TL
3 Grand JJ-TL
4 Jury NN-TL
5 said VBD
6 Friday NR
7 an AT
8 investigation NN
9 of IN
10 Atlanta's NP$
11 recent JJ
12 primary NN
13 election NN
14 produced VBD
15 `` ``
16 no AT
17 evidence NN
18 '' ''
19 that CS
20 any DTI
21 irregularities NNS
22 took VBD
23 place NN
24 . .


In [65]:
# After applying the pos_features function to get features for the word, we pair the features with the correct (gold) 
# tag to get a feature set for each word.

featuresets = []
for tagged_sent in tagged_sents:
	untagged_sent = nltk.tag.untag(tagged_sent)
	for i, (word, tag) in enumerate(tagged_sent):
		featuresets.append( (pos_features(untagged_sent, i), tag) )

# Look at the feature sets of the first 10 words.
for f in featuresets[:10]:
	print (f)


({'suffix(1)': 'e', 'suffix(2)': 'he', 'suffix(3)': 'The', 'prev-word': '<START>'}, 'AT')
({'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ton', 'prev-word': 'The'}, 'NP-TL')
({'suffix(1)': 'y', 'suffix(2)': 'ty', 'suffix(3)': 'nty', 'prev-word': 'Fulton'}, 'NN-TL')
({'suffix(1)': 'd', 'suffix(2)': 'nd', 'suffix(3)': 'and', 'prev-word': 'County'}, 'JJ-TL')
({'suffix(1)': 'y', 'suffix(2)': 'ry', 'suffix(3)': 'ury', 'prev-word': 'Grand'}, 'NN-TL')
({'suffix(1)': 'd', 'suffix(2)': 'id', 'suffix(3)': 'aid', 'prev-word': 'Jury'}, 'VBD')
({'suffix(1)': 'y', 'suffix(2)': 'ay', 'suffix(3)': 'day', 'prev-word': 'said'}, 'NR')
({'suffix(1)': 'n', 'suffix(2)': 'an', 'suffix(3)': 'an', 'prev-word': 'Friday'}, 'AT')
({'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}, 'NN')
({'suffix(1)': 'f', 'suffix(2)': 'of', 'suffix(3)': 'of', 'prev-word': 'investigation'}, 'IN')


Finally we separate our corpus into training and test sets and use these feature sets to train a Naïve Bayes classifier and look at the accuracy.  The training and test sets are created using a 90%/10% split of the data, which is typical for problems with large enough data.  Remember that the nltk.classify.accuracy function uses the classifier to classify the unlabeled words from the test set and then compares those tags with the gold tags.  

In [66]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
print(len(train_set))
print(len(test_set)) 

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


90499
10055
0.7891596220785678


Note that we have not incorporated other features from the surrounding words, so this classifier accuracy is not bad for this amount of feature information.  And, of course, the feature-based classifier should be combined with a sequential classifier, such as HMM, to achieve the highest performance. 

## NLP Lab Session Week 7
## Classification and Feature Sets in NLTK
## Part 3:  Text Categorization

### Text Classification (aka Text Categorization)


For a different type of classification problem, we next look at text classification.  In this problem, the items to be classified are documents.  Most widely known are datasets that label each document with a topic category (hence the name categorization), but we will look at documents from the NLTK Movie Review corpus, where each document is labeled either ‘pos’ for positive or ‘neg’ for negative, according to the opinion of the review.  There are 1000 positive reviews and 1000 negative reviews in the part of the corpus in NLTK.

The features of each document will be the words contained in the document, but limited to a set of words that are frequent in the whole document collection.


In [67]:
from nltk.corpus import movie_reviews
import random


In [68]:
print(movie_reviews.categories())

['neg', 'pos']


The movie review documents are not labeled individually, but are separated into file directories by category.  We first create the list of documents where each document is paired with its label.

In [69]:
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
print(len(documents))


2000


Since the documents are in order by label, we mix them up for later separation into training and test sets.



In [71]:
random.shuffle(documents)
# We look at the first document, which will consist of all the words in the review, followed by the label.  Since we did 
# independent shuffles, each person should have a different document.

print(documents[0])

(['ironically', ',', 'one', 'of', 'the', 'themes', 'of', '200', 'cigarettes', 'is', 'that', 'you', 'should', 'try', 'to', 'make', 'the', 'best', 'of', 'every', 'moment', 'in', 'life', '.', 'unfortunately', 'i', 'will', 'no', 'longer', 'be', 'able', 'to', 'say', 'that', 'i', 'lived', 'every', 'moment', 'of', 'my', 'life', 'to', 'the', 'fullest', ',', 'because', 'i', 'spent', '95', 'otherwise', 'perfectly', 'good', 'minutes', ',', 'watching', 'this', 'sad', 'excuse', 'for', 'a', 'movie', '.', 'actually', 'smoking', '200', 'cigarettes', 'in', 'the', 'same', '95', 'minutes', 'couldn', "'", 't', 'be', 'any', 'worse', 'for', 'you', 'than', 'it', 'is', 'to', 'watch', 'the', 'movie', '.', 'not', 'only', 'is', 'it', 'a', 'pathetically', 'bad', 'film', 'overall', ',', 'but', '200', 'cigarettes', 'actually', 'has', 'no', 'redeeming', 'qualities', 'whatsoever', '.', 'the', 'plot', 'is', 'scattered', ',', 'thin', 'and', 'predictable', ';', 'the', 'acting', 'is', 'monumentally', 'bad', ';', 'and', '

We need to define the set of words that will be used for features.  This is essentially all the words in the entire document collection, except that we will limit it to the 2000 most frequent words

In [88]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_items = all_words.most_common(2500)


In [89]:
# And then, the word features is the list of just the words, without frequencies.
word_features = [word for (word, freq) in word_items]

In [90]:
# Look at the first 100 words in the word_features list.

print(word_features[:100])

# Now we can define the features for each document.  The feature label will be ‘V_keyword’ for each keyword (aka word) 
# in the word_features set, and the value of the feature will be Boolean, according to whether the word is contained in that 
# document.

# (For topic categorization, it is better to represent each word feature by its frequency (or a related score) in the document, 
# but for sentiment classification, it is better to just use True of False depending on whether the word is present.)

def document_features(document):
	document_words = set(document)
	features = {}
	for word in word_features:
		features['V_{}'.format( word)] = (word in document_words)
	return features



[',', 'the', '.', 'a', 'and', 'of', 'to', "'", 'is', 'in', 's', '"', 'it', 'that', '-', ')', '(', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', '?', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', ':', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', ';', 'first', '--', 'after', 'see', '!', 'way', 'because', 'make', 'life']


The feature dictionary has 2000 features, each with the value True or False.

 

Define the feature sets for the documents.  We can look at the first one, but remember that it contains 2000 words.


In [91]:
featuresets = [(document_features(d), c) for (d,c) in documents]
# (optional – very long)
print(featuresets[0])




In [92]:
# We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy.

train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))

# The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other 
# one.  For example, if there are 20 times as many positive documents containing this word as negative ones, then the ratio will be reported as     20.00: 1.00   pos:neg.

0.82


In [82]:
print(classifier.show_most_informative_features(30))

Most Informative Features
           V_outstanding = True              pos : neg    =     10.9 : 1.0
                 V_mulan = True              pos : neg    =      8.4 : 1.0
           V_wonderfully = True              pos : neg    =      7.9 : 1.0
                V_seagal = True              neg : pos    =      6.6 : 1.0
                V_wasted = True              neg : pos    =      6.1 : 1.0
                 V_damon = True              pos : neg    =      6.1 : 1.0
                 V_flynt = True              pos : neg    =      5.7 : 1.0
                 V_awful = True              neg : pos    =      5.7 : 1.0
                   V_era = True              pos : neg    =      5.4 : 1.0
                V_poorly = True              neg : pos    =      5.3 : 1.0
            V_ridiculous = True              neg : pos    =      5.2 : 1.0
                  V_lame = True              neg : pos    =      4.9 : 1.0
                 V_waste = True              neg : pos    =      4.9 : 1.0