In [9]:
import nltk

# Gender Identification

In 4 we saw that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

In [1]:
def gender_features(word):
    return {"last_letter": word[-1]}

gender_features("Shrek")

{'last_letter': 'k'}

**Note**

Most classification methods require that features be encoded using simple value types, such as booleans, numbers, and strings. But note that just because a feature has a simple type, this does not necessarily mean that the feature's value is simple to express or compute. Indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.


Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels.

In [6]:
from nltk.corpus import names
import random

labeled_names = ([(name, "male") for name in names.words("male.txt")] +
                 [(name, "female") for name in names.words("female.txt")])

random.shuffle(labeled_names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

In [10]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

We will learn more about the naive Bayes classifier later in the chapter. For now, let's just test it out on some names that did not appear in its training data:

In [25]:
classifier.classify(gender_features("Neo"))

'male'

In [27]:
classifier.classify(gender_features("Trinity"))

'female'

Observe that these character names from The Matrix are correctly classified. Although this science fiction movie is set in 2199, it still conforms with our expectations about names and genders. We can systematically evaluate the classifier on a much larger quantity of unseen data:

In [28]:
print(nltk.classify.accuracy(classifier, test_set))

0.766


Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

In [29]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.7 : 1.0
             last_letter = 'k'              male : female =     29.7 : 1.0
             last_letter = 'v'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     17.3 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0


This listing shows that the names in the training set that end in "a" are female 33 times more often than they are male, but names that end in "k" are male 32 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

**Note**

Your Turn: Modify the gender_features() function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.

In [45]:
# Add another features
def gender_features(word):
    return {"first_letter"       : word[0],
            "second_letter"      : word[1],
            "last_letter"        : word[-1],
            "second_last_letter" : word[-2]}

gender_features("rakka")

{'first_letter': 'r',
 'second_letter': 'a',
 'last_letter': 'a',
 'second_last_letter': 'k'}

In [46]:
features_sets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [47]:
classifier.classify(gender_features("rakka"))

'female'

In [48]:
print(nltk.classify.accuracy(classifier, test_set))

0.766
