Stephanie Chiang  
DATA 620 Summer 2025  
### Project 3:
# Gender Classifier for Names

### Introduction

In ths project, I will build, test, evaluate and aim to improve upon a gender classifier for names using the Natural Language Toolkit (NLTK) library in Python. The goal is to have the model classify names as either male or female with the highest accuracy possible.

First, the NLTK `names` Corpus is loaded, labeled and randomized before splitting into the following subsets: 500 for the test set, 500 for the dev-test set and the remaining 6944 for the training set. 

In [134]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random

random.seed(101)

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

test_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
train_names = labeled_names[1000:]

print(len(test_names), len(devtest_names), len(train_names))

500 500 6944


Next, a simple initial function is defined to extract the last 2 letters of each name to be used as gender features. This function returns a dictionary and will be applied to each subset of names.

In [135]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

train_set = apply_features(gender_features, train_names)
devtest_set = apply_features(gender_features, devtest_names)
test_set = apply_features(gender_features, test_names)

I will use NLTK's Naive Bayes classifier on the training set to create a model, then evaluate its accuracy against the dev-test set.

The most informative features are shown below.

In [136]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))
classifier.show_most_informative_features(5)

0.764
Most Informative Features
                 suffix2 = 'na'           female : male   =    155.5 : 1.0
                 suffix2 = 'la'           female : male   =     71.7 : 1.0
                 suffix2 = 'rt'             male : female =     51.8 : 1.0
                 suffix1 = 'k'              male : female =     43.2 : 1.0
                 suffix1 = 'a'            female : male   =     37.6 : 1.0


The accuracy is already decent at 0.764, but we can examine which names were incorrectly classified by printing some of the errors.

In [137]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, name))

print(errors[:10])

[('female', 'Joellen'), ('male', 'Filipe'), ('female', 'Inez'), ('female', 'Ingaborg'), ('male', 'Simone'), ('female', 'Gillan'), ('female', 'Colleen'), ('male', 'Ellsworth'), ('male', 'Donnie'), ('female', 'Kristien')]


A quick visual inspection of the errors shows that many of the names are less common to English-speakers or ambiguous even to human eyes, which may be contributing to the misclassifications.

To improve on the classifier, I will include the first 2 letters as features as well. This improves the accuracy to 0.816 but does not change the most informative features.

In [138]:
def gender_features2(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:],
            'prefix1': word[:1],
            'prefix2': word[:2]}

train_set2 = apply_features(gender_features2, train_names)
devtest_set2 = apply_features(gender_features2, devtest_names)
test_set2 = apply_features(gender_features2, test_names)

classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
print(nltk.classify.accuracy(classifier2, devtest_set2))
classifier2.show_most_informative_features(5)

0.816
Most Informative Features
                 suffix2 = 'na'           female : male   =    155.5 : 1.0
                 suffix2 = 'la'           female : male   =     71.7 : 1.0
                 suffix2 = 'rt'             male : female =     51.8 : 1.0
                 suffix1 = 'k'              male : female =     43.2 : 1.0
                 suffix1 = 'a'            female : male   =     37.6 : 1.0


I thought it would be interesting and more informative to examine how the suffixes and prefixes work in combination, so a simple concatenation of the first and last letters is added to the feature function. This makes almost no impact on accuracy, but the informative features now include some of these combinations.

In [139]:
def gender_features3(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:],
            'prefix1': word[:1],
            'prefix2': word[:2],
            'first-last': word[0] + word[-1]}

train_set3 = apply_features(gender_features3, train_names)
devtest_set3 = apply_features(gender_features3, devtest_names)
test_set3 = apply_features(gender_features3, test_names)

classifier3 = nltk.NaiveBayesClassifier.train(train_set3)
print(nltk.classify.accuracy(classifier3, devtest_set3))
classifier3.show_most_informative_features(5)

0.812
Most Informative Features
                 suffix2 = 'na'           female : male   =    155.5 : 1.0
                 suffix2 = 'la'           female : male   =     71.7 : 1.0
              first-last = 'Aa'           female : male   =     66.0 : 1.0
              first-last = 'Ca'           female : male   =     52.3 : 1.0
                 suffix2 = 'rt'             male : female =     51.8 : 1.0


The performance of this classifier on the final test set is evaluated below, with an accuracy of 0.78. This is expected, since I was working with the dev-test set on the previous iterations to improve the model, which could naturally lead to some overfitting on the data. The model is still performing slightly better than the initial iteration of the classifier.

In [140]:
print(nltk.classify.accuracy(classifier3, test_set3))

0.78


As one last test, I will compare the performance of the Naive Bayes classifier to a Decision Tree classifier. The Decision Tree classifier has an accuracy of 0.776 when trained with the original feature set of only the last 2 letters of each name, which is slightly better than the my first Naive Bayes. However, the Decision Tree performance actually drops when the additional features of the prefixes or the combinations of first and last letters are included, to a 0.722. 

A final accuracy score for the decision tree is generated on the original test set and unaltered features, and we get the same performance of 0.78 as on the final version of Naive Bayes classifier. To me, this is a good sign that the data itself may be the limiting factor for improvement, rather than any tweaks that we can make to the features or classifiers.

In [141]:
classifier_dt = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier_dt, devtest_set))

classifier_dt3 = nltk.DecisionTreeClassifier.train(train_set3)
print(nltk.classify.accuracy(classifier_dt3, devtest_set3))

print(nltk.classify.accuracy(classifier_dt, test_set))

0.776
0.722
0.78
