# Name Gender Identifier

## 1. Building a feature extractor

An idea is to use the last letter of the name to predict the gender. For instance, names ending in *a*, *e* and *i* are likely to be female, while names ending in *k*, *o*, *r*, *s* and *t* are likely to be male.

In [1]:
# Feature extractor
def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('John')

{'last_letter': 'n'}

The returned dictionary is known as a **feature set**.

## 2. Exploring the `names` corpus

In [2]:
from nltk.corpus import names

names.readme().replace('\n', ' ')

'Names Corpus, Version 1.3 (1994-03-29) Copyright (C) 1991 Mark Kantrowitz Additions by Bill Ross  This corpus contains 5001 female names and 2943 male names, sorted alphabetically, one per line.  You may use the lists of names for any purpose, so long as credit is given in any published work. You may also redistribute the list if you provide the recipients with a copy of this README file. The lists are not in the public domain (I retain the copyright on the lists) but are freely redistributable.  If you have any additions to the lists of names, I would appreciate receiving them.  Mark Kantrowitz <mkant+@cs.cmu.edu> http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/'

In [3]:
names.fileids()

['female.txt', 'male.txt']

In [4]:
names.words('female.txt')[:5]

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi']

## 3. Building the classifier

We need to prepare a list of examples and corresponding class labels.

In [5]:
labeled_names = ([(name, 'female') for name in names.words('female.txt')] + [(name, 'male') for name in names.words('male.txt')])
labeled_names[:5]

[('Abagael', 'female'),
 ('Abagail', 'female'),
 ('Abbe', 'female'),
 ('Abbey', 'female'),
 ('Abbi', 'female')]

In [6]:
import random
random.shuffle(labeled_names) # We shuffle the data so that we can split it by index into training and test data.
labeled_names[:5]

[('Mufinella', 'female'),
 ('Jabez', 'male'),
 ('Lenee', 'female'),
 ('Demetris', 'female'),
 ('Renell', 'female')]

In [7]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets[:5]

[({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'z'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 's'}, 'female'),
 ({'last_letter': 'l'}, 'female')]

In [8]:
len(featuresets)

7944

In [9]:
from nltk import NaiveBayesClassifier

# We split the data into a training (80%) and test (20%) set:
TRAIN_SET_SIZE = round(len(featuresets) * .8)
train_set, test_set = featuresets[:TRAIN_SET_SIZE], featuresets[TRAIN_SET_SIZE:]

# We also get the names in the test set, to be used later:
test_names = labeled_names[TRAIN_SET_SIZE:]

classifier = NaiveBayesClassifier.train(train_set)

# When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory: 
# from nltk.classify import apply_features
# train_names, test_names = labeled_names[:round(len(featuresets) * .8)], labeled_names[round(len(featuresets) * .8):]
# train_set = apply_features(gender_features, labeled_names[500:])
# test_set = apply_features(gender_features, labeled_names[:500])

In [10]:
classifier.show_most_informative_features(10) # Prints likelihood ratios for most informative features

Most Informative Features
             last_letter = 'k'              male : female =     40.2 : 1.0
             last_letter = 'a'            female : male   =     35.6 : 1.0
             last_letter = 'v'              male : female =     17.5 : 1.0
             last_letter = 'f'              male : female =     11.8 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =      9.4 : 1.0
             last_letter = 'm'              male : female =      8.4 : 1.0
             last_letter = 'o'              male : female =      8.4 : 1.0
             last_letter = 'r'              male : female =      5.7 : 1.0
             last_letter = 'w'              male : female =      4.6 : 1.0


## 4. Testing the classifier

In [11]:
classifier.labels()

['female', 'male']

In [12]:
from nltk.classify import accuracy

round(accuracy(classifier, test_set), 2)

0.76

In [13]:
classifier.classify(gender_features('Aphrodite'))

'female'

In [14]:
classifier.classify(gender_features('Zeus'))

'male'

## 5. Building a classifier with more features

In [15]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('John')

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [16]:
featuresets2 = [(gender_features2(n), gender) for (n, gender) in labeled_names]
featuresets2[0]

({'first_letter': 'm',
  'last_letter': 'a',
  'count(a)': 1,
  'has(a)': True,
  'count(b)': 0,
  'has(b)': False,
  'count(c)': 0,
  'has(c)': False,
  'count(d)': 0,
  'has(d)': False,
  'count(e)': 1,
  'has(e)': True,
  'count(f)': 1,
  'has(f)': True,
  'count(g)': 0,
  'has(g)': False,
  'count(h)': 0,
  'has(h)': False,
  'count(i)': 1,
  'has(i)': True,
  'count(j)': 0,
  'has(j)': False,
  'count(k)': 0,
  'has(k)': False,
  'count(l)': 2,
  'has(l)': True,
  'count(m)': 1,
  'has(m)': True,
  'count(n)': 1,
  'has(n)': True,
  'count(o)': 0,
  'has(o)': False,
  'count(p)': 0,
  'has(p)': False,
  'count(q)': 0,
  'has(q)': False,
  'count(r)': 0,
  'has(r)': False,
  'count(s)': 0,
  'has(s)': False,
  'count(t)': 0,
  'has(t)': False,
  'count(u)': 1,
  'has(u)': True,
  'count(v)': 0,
  'has(v)': False,
  'count(w)': 0,
  'has(w)': False,
  'count(x)': 0,
  'has(x)': False,
  'count(y)': 0,
  'has(y)': False,
  'count(z)': 0,
  'has(z)': False},
 'female')

In [17]:
train_set2, test_set2 = featuresets2[:TRAIN_SET_SIZE], featuresets2[TRAIN_SET_SIZE:]
classifier2 = NaiveBayesClassifier.train(train_set2)
round(accuracy(classifier2, test_set2), 2)

0.79

We would have expected that having too many specific features on a small dataset would lead to overfitting, but it seems the classifier was good at avoiding that since its performance is slightly better.

In [18]:
classifier2.show_most_informative_features(15)

Most Informative Features
             last_letter = 'k'              male : female =     40.2 : 1.0
             last_letter = 'a'            female : male   =     35.6 : 1.0
             last_letter = 'v'              male : female =     17.5 : 1.0
             last_letter = 'f'              male : female =     11.8 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =      9.4 : 1.0
                count(v) = 2              female : male   =      8.9 : 1.0
             last_letter = 'm'              male : female =      8.4 : 1.0
             last_letter = 'o'              male : female =      8.4 : 1.0
             last_letter = 'r'              male : female =      5.7 : 1.0
            first_letter = 'w'              male : female =      5.0 : 1.0
                count(a) = 3              female : male   =      4.7 : 1.0
             last_letter = 'w'              male : female =      4.6 : 1.0

Indeed, it seems the classifier is mainly using the last letter, along with some other features that happen to improve the accuracy.

## 6. Comparing the two classifiers using `nltk.metrics`

Before we start, here's a useful function for comparing strings:

In [19]:
from nltk.metrics import edit_distance

edit_distance("John", "Joan")

1

The NLTK metrics module provides functions for calculating metrics beyond mere accuracy. But in order to do so, we need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values.

In [20]:
import collections

# Classifier 1
refsets = collections.defaultdict(set) # For what this is: https://stackoverflow.com/questions/5900578/how-does-collections-defaultdict-work
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    
# Classifier 2
refsets2 = collections.defaultdict(set)
testsets2 = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set2):
    refsets2[label].add(i)
    observed = classifier2.classify(feats)
    testsets2[observed].add(i)

In [21]:
refsets

defaultdict(set,
            {'female': {0,
              3,
              5,
              6,
              7,
              9,
              10,
              12,
              14,
              16,
              17,
              18,
              19,
              20,
              23,
              26,
              27,
              28,
              29,
              30,
              31,
              32,
              33,
              34,
              35,
              37,
              38,
              41,
              42,
              46,
              47,
              48,
              50,
              52,
              54,
              58,
              61,
              63,
              65,
              66,
              67,
              68,
              71,
              72,
              75,
              76,
              77,
              78,
              79,
              81,
              83,
              84,
              85,
              86,
       

In [22]:
testsets

defaultdict(set,
            {'female': {0,
              3,
              5,
              6,
              7,
              10,
              14,
              17,
              19,
              20,
              23,
              24,
              26,
              27,
              29,
              30,
              31,
              32,
              33,
              34,
              36,
              37,
              38,
              39,
              41,
              42,
              44,
              46,
              47,
              48,
              50,
              52,
              53,
              54,
              58,
              61,
              65,
              66,
              67,
              71,
              72,
              73,
              75,
              76,
              77,
              78,
              79,
              81,
              83,
              84,
              85,
              86,
              88,
              89,
      

In [23]:
from nltk.metrics.scores import (precision, recall, f_measure)

# We can proceed to print the metrics for each classifier. However, we cannot get the accuracy in this manner because nltk.metrics.scores.accuracy(reference, test) works by comparing test[i] == reference[i] and our reference and test are not formatted in a way that allows for this. It's the same for the confusion matrix.
args = (
    round(precision(refsets['female'], testsets['female']), 2),
    round(precision(refsets['male'], testsets['male']), 2),
    round(recall(refsets['female'], testsets['female']), 2),
    round(recall(refsets['male'], testsets['male']), 2),
    round(f_measure(refsets['female'], testsets['female']), 2),
    round(f_measure(refsets['male'], testsets['male']), 2)
)

args2 = (
    round(precision(refsets2['female'], testsets2['female']), 2),
    round(precision(refsets2['male'], testsets2['male']), 2),
    round(recall(refsets2['female'], testsets2['female']), 2),
    round(recall(refsets2['male'], testsets2['male']), 2),
    round(f_measure(refsets2['female'], testsets2['female']), 2),
    round(f_measure(refsets2['male'], testsets2['male']), 2)
)

print('''
CLASSIFIER 1
------------ 
Female precision: {0}
Male precision: {1}
Female recall: {2}
Male recall: {3}
Female F1 score: {4}
Male F1 score: {5}

CLASSIFIER 2
------------ 
Female precision: {6}
Male precision: {7}
Female recall: {8}
Male recall: {9}
Female F1 score: {10}
Male F1 score: {11}
'''.format(*args, *args2))


CLASSIFIER 1
------------ 
Female precision: 0.8
Male precision: 0.68
Female recall: 0.82
Male recall: 0.65
Female F1 score: 0.81
Male F1 score: 0.67

CLASSIFIER 2
------------ 
Female precision: 0.83
Male precision: 0.72
Female recall: 0.84
Male recall: 0.69
Female F1 score: 0.83
Male F1 score: 0.71



## 7. Error analysis

In [24]:
errors = []
for (name, tag) in test_names:
    guess = classifier2.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors[:5]

[('female', 'male', 'Gwen'),
 ('female', 'male', 'Joelynn'),
 ('female', 'male', 'Karyl'),
 ('female', 'male', 'Marilyn'),
 ('male', 'female', 'Vance')]

In [25]:
for (tag, guess, name) in sorted(errors):
    print('Correct = {:8} guess = {:8} name = {}'.format(tag, guess, name)) # :8 creates spaces between columns.

Correct = female   guess = male     name = Abagail
Correct = female   guess = male     name = Abigail
Correct = female   guess = male     name = Adel
Correct = female   guess = male     name = Adrien
Correct = female   guess = male     name = Aigneis
Correct = female   guess = male     name = Aileen
Correct = female   guess = male     name = Aleen
Correct = female   guess = male     name = Alisun
Correct = female   guess = male     name = Allison
Correct = female   guess = male     name = Allyn
Correct = female   guess = male     name = Amargo
Correct = female   guess = male     name = Annabel
Correct = female   guess = male     name = April
Correct = female   guess = male     name = Ardelis
Correct = female   guess = male     name = Arden
Correct = female   guess = male     name = Ardys
Correct = female   guess = male     name = Aryn
Correct = female   guess = male     name = Averil
Correct = female   guess = male     name = Avis
Correct = female   guess = male     name = Beilul
Corre

Looking through this list of errors, it seems that some suffixes that are more than one letter long can be indicative of name genders. For example, names ending in *yn* appear to be predominantly female, despite the fact that names ending in *n* tend to be male; and names ending in *ch* are usually male, even though names that end in *h* tend to be female.

## 8. Building a classifier with even more features

In [26]:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["suffix1"] = name[-1].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features3('John')

{'first_letter': 'j',
 'suffix1': 'n',
 'suffix2': 'hn',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [27]:
featuresets3 = [(gender_features3(n), gender) for (n, gender) in labeled_names]
featuresets3[0]

({'first_letter': 'm',
  'suffix1': 'a',
  'suffix2': 'la',
  'count(a)': 1,
  'has(a)': True,
  'count(b)': 0,
  'has(b)': False,
  'count(c)': 0,
  'has(c)': False,
  'count(d)': 0,
  'has(d)': False,
  'count(e)': 1,
  'has(e)': True,
  'count(f)': 1,
  'has(f)': True,
  'count(g)': 0,
  'has(g)': False,
  'count(h)': 0,
  'has(h)': False,
  'count(i)': 1,
  'has(i)': True,
  'count(j)': 0,
  'has(j)': False,
  'count(k)': 0,
  'has(k)': False,
  'count(l)': 2,
  'has(l)': True,
  'count(m)': 1,
  'has(m)': True,
  'count(n)': 1,
  'has(n)': True,
  'count(o)': 0,
  'has(o)': False,
  'count(p)': 0,
  'has(p)': False,
  'count(q)': 0,
  'has(q)': False,
  'count(r)': 0,
  'has(r)': False,
  'count(s)': 0,
  'has(s)': False,
  'count(t)': 0,
  'has(t)': False,
  'count(u)': 1,
  'has(u)': True,
  'count(v)': 0,
  'has(v)': False,
  'count(w)': 0,
  'has(w)': False,
  'count(x)': 0,
  'has(x)': False,
  'count(y)': 0,
  'has(y)': False,
  'count(z)': 0,
  'has(z)': False},
 'female')

In [28]:
train_set3, test_set3 = featuresets3[:TRAIN_SET_SIZE], featuresets3[TRAIN_SET_SIZE:]
classifier3 = NaiveBayesClassifier.train(train_set3)
round(accuracy(classifier3, test_set3), 2)

0.8

In [29]:
classifier3.show_most_informative_features(15)

Most Informative Features
                 suffix2 = 'na'           female : male   =     84.9 : 1.0
                 suffix2 = 'la'           female : male   =     66.2 : 1.0
                 suffix1 = 'k'              male : female =     40.2 : 1.0
                 suffix2 = 'ta'           female : male   =     37.8 : 1.0
                 suffix1 = 'a'            female : male   =     35.6 : 1.0
                 suffix2 = 'ia'           female : male   =     34.7 : 1.0
                 suffix2 = 'us'             male : female =     34.1 : 1.0
                 suffix2 = 'ra'           female : male   =     31.3 : 1.0
                 suffix2 = 'rt'             male : female =     29.5 : 1.0
                 suffix2 = 'ch'             male : female =     23.7 : 1.0
                 suffix2 = 'do'             male : female =     22.6 : 1.0
                 suffix2 = 'rd'             male : female =     21.2 : 1.0
                 suffix2 = 'ld'             male : female =     19.5 : 1.0

## 9. Trying to use a maximum entropy classifier

The principle of **maximum entropy** states that the probability distribution which best represents the current state of knowledge is the one with largest entropy.

The principle of maximum entropy is invoked when we have some piece(s) of information about a probability distribution, but not enough to characterize it completely—likely because we do not have the means or resources to do so. As an example, if all we know about a distribution is its average, we can imagine infinite shapes that yield a particular average. The principle of maximum entropy says that we should humbly choose the distribution that maximizes the amount of unpredictability contained in the distribution.

Taking the idea to the extreme, it wouldn’t be scientific to choose a distribution that simply yields the average value 100% of the time.

From all the models that fit our training data, the Maximum Entropy classifier selects the one which has the largest entropy. Due to the minimum assumptions that the Maximum Entropy classifier makes, it is usually used when we don’t know anything about the prior distributions and when it is unsafe to make any assumptions. Also, the maximum entropy classifier is used when we can’t assume the conditional independence of the features.

In [30]:
from nltk import MaxentClassifier

me_classifier = MaxentClassifier.train(train_set3, max_iter=25) # max_iter has default value 100. In this example, the performance in terms of accuracy on the test set starts significantly improving beyond the previous model's at around 25 iterations.

  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.371
             2          -0.60317        0.629
             3          -0.58163        0.629
             4          -0.56185        0.634
             5          -0.54374        0.665
             6          -0.52719        0.704
             7          -0.51209        0.735
             8          -0.49831        0.752
             9          -0.48572        0.767
            10          -0.47420        0.777
            11          -0.46366        0.785
            12          -0.45399        0.787
            13          -0.44510        0.791
            14          -0.43692        0.796
            15          -0.42937        0.799
            16          -0.42239        0.800
            17          -0.41593        0.802
            18          -0.40992        0.804
            19          -0.40434        0.805
  

In [31]:
round(accuracy(me_classifier, test_set3), 2) # The accuracies above were on the training set so this is what matters.

0.81

In [32]:
me_classifier.show_most_informative_features(10)

  -1.926 suffix2=='na' and label is 'male'
  -1.899 suffix2=='la' and label is 'male'
  -1.538 suffix2=='ta' and label is 'male'
  -1.513 suffix1=='k' and label is 'female'
  -1.491 suffix2=='ra' and label is 'male'
  -1.465 suffix1=='a' and label is 'male'
  -1.422 suffix2=='ia' and label is 'male'
  -1.359 suffix2=='us' and label is 'female'
  -1.312 suffix2=='ch' and label is 'female'
  -1.285 suffix2=='rt' and label is 'female'


## 10. More classifiers

Scikit-learn (sklearn) is a popular library which features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN.

NLTK provides an API to quickly use sklearn classifiers in `nltk.classify.scikitlearn`. The other option is to import and use sklearn directly.

For an example of integrating sklearn with NLTK, you can check out [this](https://www.kaggle.com/alvations/basic-nlp-with-nltk) notebook on Kaggle. Kaggle is a great website for NLP and machine learning in general, creating an account is highly recommended.