## Gender Identification

Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male.

In [1]:
import nltk

In [29]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [17]:
gender_features('Shrek')

{'last_letter': 'k', 'length': 5}

In [4]:
from nltk.corpus import names

In [5]:
import random

In [6]:
names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [30]:
featuresets = [(gender_features(n), g) for (n,g) in names]

In [31]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [32]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [33]:
classifier.classify(gender_features('Neo'))

'male'

In [34]:
classifier.classify(gender_features('Trinity'))

'female'

In [35]:
nltk.classify.accuracy(classifier, test_set)

0.602

In [36]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0


This listing shows that the names in the training set that end in a are female 38 times more often than they are male, but names that end in k are male 31 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory:

In [37]:
from nltk.classify import apply_features

In [38]:
train_set = apply_features(gender_features, names[500:])

In [39]:
test_set = apply_features(gender_features, names[:500])

In [42]:
def gender_features2(name): 
    features = {}
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower()) 
    return features

In [44]:
gender_features2('John')

{'firstletter': 'j',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [45]:
featuresets = [(gender_features2(n), g) for (n,g) in names]

In [46]:
train_set, test_set = featuresets[500:], featuresets[:500]

In [47]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [49]:
nltk.classify.accuracy(classifier, test_set)

0.052

Once an initial set of features has been chosen, a very productive method for refining the feature set is error analysis. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

In [51]:
train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]

In [52]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [53]:
nltk.classify.accuracy(classifier, devtest_set)

0.347

In [54]:
errors = []

In [55]:
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name)) 
    if guess != tag:
        errors.append( (tag, guess, name) )

In [57]:
for (tag, guess, name) in sorted(errors):
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=male     guess=female   name=Clinton                       
correct=male     guess=female   name=Clive                         
correct=male     guess=female   name=Clyde                         
correct=male     guess=female   name=Cob                           
correct=male     guess=female   name=Cobb                          
correct=male     guess=female   name=Cobbie                        
correct=male     guess=female   name=Cobby                         
correct=male     guess=female   name=Cody                          
correct=male     guess=female   name=Colbert                       
correct=male     guess=female   name=Cole                          
correct=male     guess=female   name=Coleman                       
correct=male     guess=female   name=Colin                         
correct=male     guess=female   name=Collin                        
correct=male     guess=female   name=Conan                         
correct=male     guess=female   name=Connie     

In [58]:
def gender_features(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [59]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, devtest_set)

0.604

## Document Classification

In [61]:
from nltk.corpus import movie_reviews

In [62]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

In [68]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

In [70]:
word_features = list(all_words.keys())[:2000]

In [71]:
def document_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words) 
    return features

In [72]:
document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

In [73]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [74]:
train_set, test_set = featuresets[100:], featuresets[:100]

In [75]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [77]:
nltk.classify.accuracy(classifier, test_set)

0.78

In [78]:
classifier.show_most_informative_features(5)

Most Informative Features
    contains(recognizes) = True              pos : neg    =      8.1 : 1.0
    contains(schumacher) = True              neg : pos    =      7.8 : 1.0
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
     contains(atrocious) = True              neg : pos    =      6.4 : 1.0


## Part-of-Speech Tagging

In [79]:
from nltk.corpus import brown 
suffix_fdist = nltk.FreqDist()

In [83]:
for word in brown.words():
    word = word.lower() 
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [84]:
suffix_fdist

FreqDist({'e': 202946, ',': 175002, '.': 152999, 's': 128722, 'd': 105687, 't': 94459, 'he': 92084, 'n': 87889, 'a': 74912, 'of': 72978, ...})

In [85]:
common_suffixes = list(suffix_fdist.keys())[:100]

In [87]:
def pos_features(word): 
    features = {}
    for suffix in common_suffixes:
        features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
    return features

In [88]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [None]:
featuresetsturesets

In [89]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

In [90]:
classifier = nltk.DecisionTreeClassifier.train(train_set)

In [91]:
nltk.classify.accuracy(classifier, test_set)

0.5689706613625062

In [92]:
classifier.classify(pos_features('cats'))

'NNS'

In [93]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]} 
    if i == 0:
        features["prev-word"] = "<START>" 
    else:
        features["prev-word"] = sentence[i-1] 
    return features

In [94]:
pos_features(brown.sents()[0], 8)

{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}

In [95]:
tagged_sents = brown.tagged_sents(categories='news')

In [96]:
featuresets = []

In [97]:
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent) 
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag))

In [98]:
size = int(len(featuresets) * 0.1)

In [99]:
train_set, test_set = featuresets[size:], featuresets[:size]

In [100]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [101]:
nltk.classify.accuracy(classifier, test_set)

0.7891596220785678

## Sequence Classification

In [102]:
def pos_features(sentence, i, history): 
    features = {"suffix(1)": sentence[i][-1:], 
                "suffix(2)": sentence[i][-2:], 
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>" 
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1] 
        features["prev-tag"] = history[i-1]
    return features

class ConsecutivePosTagger(nltk.TaggerI): 
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent) 
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history) 
                train_set.append( (featureset, tag) ) 
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence): 
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag) 
        return zip(sentence, history)

In [103]:
tagged_sents = brown.tagged_sents(categories='news')

In [104]:
size = int(len(tagged_sents) * 0.1)

In [105]:
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

In [106]:
tagger = ConsecutivePosTagger(train_sents)

In [107]:
tagger.evaluate(test_sents)

0.7980528511821975

## Sentence Segmentation

Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

In [108]:
sents = nltk.corpus.treebank_raw.sents()

In [110]:
tokens = []
boundaries = set()
offset = 0

In [111]:
for sent in nltk.corpus.treebank_raw.sents(): 
    tokens.extend(sent)
    offset += len(sent) 
    boundaries.add(offset-1)