<H1>Gender Identification from a Names Corpus</H1>

<H2>Introduction</H2>

The task at hand is to work with a Corpus containing approximately 7900 names, labeled either male or female (and about 60% female), and train a classifier to be able to determine the gender of a given name.  Three different classification systems were tried, Max Entropy, Naive Bayesian, and Decision Trees.

The Corpus was divided up into three pieces.  The largest piece, approximatlely 6900 names was the training set.  The other two were test sets of 500 words each.  One was to be used to test the classifier after we trained, and the other would be used as the final test when we think we have identified the best classifier.

The first set of code below in the notebook, loads the data, shuffles it (the names are not randomally distributed between male and female, hence the need for shuffle), and then divide up the data into the 3 sets of words described above.

In [45]:
import nltk
names = nltk.corpus.names
import random

vowels = {'a','e','i','o', 'u'}

names = ([(name.lower(), 'male') for name in names.words('male.txt')] +
         [(name.lower(), 'female') for name in names.words('female.txt')])

random.seed(42)
random.shuffle(names)

devTestNames = names[:500]
testNames = names[500:1000]
trainNames = names[1000:len(names)]


<H2>From the NLTK Text</H2>
The first classification is done simply from the NLTK Natural Language Processing Text.  The "feature" set for this classification is just the last letter of the name.  Surprisingly (or perhaps not),  this is substantially better than an empty feature set (i.e., if we just had a feature set return true).  We get 74.8% accuracy with both the Bayesian and the Decision Trees, which is much better than the about 61% an empty feature set would return.

We also print out the informative features, which show that many ending letters are very much predominately indicative of one gender or another.

In [46]:
def gender_features(word):
    return {'last_ketter': word[-1]}

trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]

classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
print classifier.show_most_informative_features(5);

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

0.748
Most Informative Features
             last_ketter = u'a'           female : male   =     35.8 : 1.0
             last_ketter = u'k'             male : female =     29.5 : 1.0
             last_ketter = u'v'             male : female =     16.5 : 1.0
             last_ketter = u'f'             male : female =     14.7 : 1.0
             last_ketter = u'p'             male : female =     12.0 : 1.0
None
0.748


<H2>Expanding (slightly) the Simple Classifier</H2>
Since we saw that even just chosing a one letter suffix gave promising results, we expand to choose 1, 2 and 3 letter suffixes as the classifier.  We see a slight improvement (almost 1%) for the Bayesian classifier, but we see over a 5% increase with the Decision trees, up to 80.4%.  When examining the top 10 "informative" features, for the Bayesian analysis we do not see any of the 2 or 3 letter suffixes showing up...obviously another indicator they are not helping much.

Not shown, but examined, was whether first letter prefixes improved the score (as well as 2 or 3 letter prefixes).  They did not noticeably help.

In [49]:
#function needed to handle a 2 letter name, some of which must exist...
def returnSuffix(word):
    if len(word) > 2:
        return(word[-3])
    else:
        return ''
def gender_features(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2],
            'suffix3': returnSuffix(word)}

trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]

classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(10)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

0.756
Most Informative Features
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix1 = u'v'             male : female =     16.5 : 1.0
                 suffix1 = u'f'             male : female =     14.7 : 1.0
                 suffix1 = u'p'             male : female =     12.0 : 1.0
                 suffix1 = u'm'             male : female =      9.7 : 1.0
                 suffix1 = u'd'             male : female =      9.0 : 1.0
                 suffix1 = u'o'             male : female =      8.4 : 1.0
                 suffix2 = u'o'             male : female =      7.9 : 1.0
                 suffix2 = u'u'             male : female =      7.6 : 1.0
0.804


<H2>Further Attempts to Improve the Classifier</H2>

Here we try a number of different items to see if they might improve the classification.  The methods tried were things like:
* Repeated Vowels (thought was perhaps names with combinations like 'ia' might prove more feminine)
* Repeated Consonants (thought was opposite, that multiple consonants might indicate more masculine names)
* Multiple Capitals in the name (e.g., BettySue) might indicate feminine (note this was originally tried before lowercasing all names)
* Length, it was hoped that length of names might indicate a gender preference

When playing around with this analysis, we did not use the 2 and 3 letter suffixes, so as to reduce the "noise" in determining if any of these above ideas, together or alone would improve the classification.  Sadly they did not improve much: we went from 74.8% for the two classifiers to 75.2% for them.  So not enough improvement to hang on to theses ideas, and they are not used in the rest of the analysis.

In [51]:
def findDoubleLetters(word):
    for i in range(1,len(word)):
        if (word[i-1] == word[i]):
            return True;
    return False;

def findRepeatedVowels(word):
    for i in range(1,len(word)):
        if set(word[i-1]).issubset(vowels) & set(word[i]).issubset(vowels):        
            return True;
    return False;
    
def findRepeatedConsonants(word):
    for i in range(1,len(word)):
        if (not(set(word[i-1]).issubset(vowels))) & (not(set(word[i]).issubset(vowels))):        
            return True;
    return False;

def gender_features(word):    
    return {'suffix1': word[-1],            
             'findRepeatedVowels': findRepeatedVowels(word),
            'doubleCap': sum(1 for c in word if c.isupper()),
            'repeatedLetter': findDoubleLetters(word),
            findRepeatedConsonants: findRepeatedConsonants(word),
            'length' : len(word)
           }

trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]

classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(5)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
errors = []

for (name, tag) in devTestNames:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors

sorted(errors, key = lambda x: x[0])
# sum([len(x[0]) for x in names if x[1] == 'male'])/float(sum([1 for x in names if x[1] == 'male']))  
# sum([len(x[0]) for x in names if x[1] == 'female'])/float(sum([1 for x in names if x[1] == 'female']))

0.752
Most Informative Features
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix1 = u'v'             male : female =     16.5 : 1.0
                 suffix1 = u'f'             male : female =     14.7 : 1.0
                 suffix1 = u'p'             male : female =     12.0 : 1.0
0.752


[('female', 'male', u'jenifer'),
 ('female', 'male', u'brittany'),
 ('female', 'male', u'amy'),
 ('female', 'male', u'melisent'),
 ('female', 'male', u'linell'),
 ('female', 'male', u'terri-jo'),
 ('female', 'male', u'jourdan'),
 ('female', 'male', u'berget'),
 ('female', 'male', u'dorcas'),
 ('female', 'male', u'ivett'),
 ('female', 'male', u'dynah'),
 ('female', 'male', u'kim'),
 ('female', 'male', u'cass'),
 ('female', 'male', u'marian'),
 ('female', 'male', u'marjory'),
 ('female', 'male', u'oprah'),
 ('female', 'male', u'sheelagh'),
 ('female', 'male', u'ester'),
 ('female', 'male', u'ceil'),
 ('female', 'male', u'miran'),
 ('female', 'male', u'noelyn'),
 ('female', 'male', u'garland'),
 ('female', 'male', u'cris'),
 ('female', 'male', u'phillis'),
 ('female', 'male', u'arden'),
 ('female', 'male', u'piper'),
 ('female', 'male', u'pris'),
 ('female', 'male', u'nan'),
 ('female', 'male', u'adrian'),
 ('female', 'male', u'jaquelin'),
 ('female', 'male', u'fleur'),
 ('female', 'male'

<H2>Phonetics, or Does it Sound like A Girl's Or Boy's name</h2>

Here we try to see if sounds a name can be parsed into, have a postive effect on classification.  We imported the CMU (Carnagie Mellon University) phonetic dictionary to try this out.

We tried many combinations of things, and found that indeed, phonetics improve our accuracy.  The simplest approach seemed to return the best results, i.e., comparing individual syllables improved the scores the most.  Some of the other things tried (and now commented out) were:
* Is a specific phoneme ('OW0') a good classifier, as it turned up in the 'informative' features list (it helped but less than the generic method we ended up staying with)
* Does syllable count of names help?
* Do names with multiple known pronunciations improve classification.

Some of these showed reduce performance, showing we were overfitting.

We also at this point, added in the Max Entropy classifier.  We didn't use it before now, due to how long it takes to execute.  The results now stand at 77.8% for Bayesean and 76% for the Decision Tree classifier.  The Max Entropy is up to 79.6%, even if it takes a long time to get there...

In [54]:
from nltk.corpus import cmudict
phonetic = cmudict.dict()

def GetPhonetic(word, syllable):
    try:
        return phonetic[word][0][syllable]  #may be multiple pronunciations, we want just the first
    except:  #i.e., word is not in dict
        return''
    
def ContainsOW0(word):    
    try:
        for phoneme in phonetic[word][0]:            
            if phoneme == 'OW0':                
                return True;        
    except:
        return False
    return False

def GetSyllableCount(word):
    try:
        return len(phonetic[word][0])
    except:
        return 0
    
def DoMultiplePronunciationsExist(word):
    try:
        if len(phonetic[word]) > 1:
            return True;
    except:
        return False
    return False
    

def gender_features(word):    
    return {'suffix1': word[-1],            
#            'findRepeatedVowels': findRepeatedVowels(word),
#           'multipleCap': sum(1 for c in word if c.isupper()),
#            'repeatedLetter': findDoubleLetters(word),
#            findRepeatedConsonants: findRepeatedConsonants(word),
#            'length' : len(word),
#            'phonemeOW0':   ContainsOW0(word),
             'phonetic0': GetPhonetic(word, 0),
             'phonetic1': GetPhonetic(word, 1),
             'phonetic2': GetPhonetic(word, 2),
             'phonetic3': GetPhonetic(word, 3),
             'phonetic4': GetPhonetic(word, 4),
             'phonetic5': GetPhonetic(word, 5),
             'phonetic6': GetPhonetic(word, 6),
             'phonetic7': GetPhonetic(word, 7),
#              'phonetic8': getPhonetic(word, 8),
#              'phonetic9': getPhonetic(word, 9),
# #              'phonetic10': getPhonetic(word, 10),
#              'syllableCount': GetSyllableCount(word),
#            'multiPronunciation': DoMultiplePronunciationsExist(word)
#            'returnTrue':  True            
           }


trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]

classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(40)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

classifier = nltk.classify.MaxentClassifier.train(trainSet, max_iter=50)
print nltk.classify.accuracy(classifier, devTestSet)

0.778
Most Informative Features
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
               phonetic6 = u'OW0'           male : female =     21.5 : 1.0
               phonetic5 = u'OW0'           male : female =     21.0 : 1.0
               phonetic5 = u'Z'             male : female =     18.7 : 1.0
                 suffix1 = u'v'             male : female =     16.5 : 1.0
               phonetic2 = u'W'             male : female =     16.5 : 1.0
                 suffix1 = u'f'             male : female =     14.7 : 1.0
               phonetic2 = u'NG'            male : female =     14.2 : 1.0
                 suffix1 = u'p'             male : female =     12.0 : 1.0
                 suffix1 = u'm'             male : female =      9.7 : 1.0
               phonetic5 = u'ER0'           male : female =      9.5 : 1.0
               phonetic4 = u'OW0'           male : female =      9.4

<H2>Individual Letters</h2>

Here we try to see if using individual letters as classifiers will improve our scores.  So this section immediately below is to count all the letters in all the names and see their relative proportions in the male and female name sets.

The letters 'a', 'f', and 'w' stand out as being the most distintive in either male or female names.  So we will use them in the next code section as classifiers.

males = [name[0] for name in names if name[1] == 'male']
females = [name[0] for name in names if name[1] == 'female']

import string
dLettersMale = dict.fromkeys(string.ascii_lowercase, 0)
dLettersFemale = dict.fromkeys(string.ascii_lowercase, 0)
dLettersNames = dict.fromkeys(string.ascii_lowercase, 0)

def addToLetterCount(dLetters, word):
    for letter in word:
        if letter in string.ascii_lowercase:
            dLetters[letter] += 1;

[addToLetterCount(dLettersMale, name) for name in males];
[addToLetterCount(dLettersFemale, name) for name in females];

for letter in dLettersMale:
    dLettersMale[letter] /= float(len(males))
    
for letter in dLettersFemale:
    dLettersFemale[letter] /= float(len(females))
    
for letter in dLettersMale:
    dLettersNames[letter] = dLettersMale[letter] / dLettersFemale[letter]
    
sortedLetters = sorted(dLettersNames.items(), key = lambda kv: kv[1])  #normalize them    
sortedLetters

<H2>Letter Classification</H2>

As noted above, we add in the letters 'a', 'f' and 'w' into the classification feature set.  We compare purely if the name has the letter or not to assist with classification.  We also at this point add in the 2 and 3 letter suffixes (note, the letter classification was tested without those suffixes as well, and improved scores that way too).

Our scores are now noticeably higher for 2 of the three methods.  We are at 81.4% for Bayesian and for the Max Entropy.  We went down to 75.6 for decision tree, and appears maybe for that method we are overfitting. 

We also printed out an error list from the Max Entropy to see if we can get additional ideas on how to improve, based on seeing what is being mis-classified.

In [57]:
def ContainsLetter(word,letter):
    if letter in word:
        return True
    else:
        return False

    
def gender_features(word): 
    features = {}
    features['suffix1'] = word[-1]
    features['suffix2'] = word[-2:]
    features['suffix3'] = word[-3:]
    for i in range(0,8):
        features['phonetic'+ str(i)] = GetPhonetic(word, i)
    features['lettera'] = ContainsLetter(word, 'a')
    features['letterf'] = ContainsLetter(word, 'f')
    features['letterw'] = ContainsLetter(word, 'w')        
    return features
    
trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]
    
classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(20)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

classifier = nltk.classify.MaxentClassifier.train(trainSet, max_iter=50);
print nltk.classify.accuracy(classifier, devTestSet)    

for (name, tag) in devTestNames:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

errors = sorted(errors, key = lambda x: x[0])
errors

0.814
Most Informative Features
                 suffix2 = u'na'          female : male   =     93.2 : 1.0
                 suffix2 = u'ia'          female : male   =     86.5 : 1.0
                 suffix2 = u'la'          female : male   =     68.2 : 1.0
                 suffix2 = u'us'            male : female =     39.2 : 1.0
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix2 = u'sa'          female : male   =     32.4 : 1.0
                 suffix2 = u'ta'          female : male   =     32.1 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix2 = u'do'            male : female =     27.4 : 1.0
                 suffix3 = u'ita'         female : male   =     27.1 : 1.0
                 suffix3 = u'tta'         female : male   =     24.1 : 1.0
                 suffix3 = u'ana'         female : male   =     24.1 : 1.0
                 suffix2 = u'ra'          female : male   =     23.9

[('female', 'male', u'jenifer'),
 ('female', 'male', u'brittany'),
 ('female', 'male', u'philippe'),
 ('female', 'male', u'marje'),
 ('female', 'male', u'jewel'),
 ('female', 'male', u'georgie'),
 ('female', 'male', u'melisent'),
 ('female', 'male', u'benny'),
 ('female', 'male', u'gretal'),
 ('female', 'male', u'orly'),
 ('female', 'male', u'jourdan'),
 ('female', 'male', u'berget'),
 ('female', 'male', u'winny'),
 ('female', 'male', u'dorcas'),
 ('female', 'male', u'marylou'),
 ('female', 'male', u'ivett'),
 ('female', 'male', u'kim'),
 ('female', 'male', u'mariam'),
 ('female', 'male', u'tammy'),
 ('female', 'male', u'cass'),
 ('female', 'male', u'marian'),
 ('female', 'male', u'lory'),
 ('female', 'male', u'ethyl'),
 ('female', 'male', u'jacquie'),
 ('female', 'male', u'chastity'),
 ('female', 'male', u'randy'),
 ('female', 'male', u'marjory'),
 ('female', 'male', u'abbey'),
 ('female', 'male', u'salome'),
 ('female', 'male', u'cordy'),
 ('female', 'male', u'ester'),
 ('female', 'm

<H2>All Letters, and Counts of Letters</H2>

Here we try to see if taking all letters (rather than the just 'a', 'f' and 'w'), and taking the counts of letters in words assists in classification.

We do improve a little in Bayesian, up to 82.2% and up a really small amount in the Max Entropy to 81.8%, but it took an exceptionally long time to calculate.  But decision tree is even worse, down to 75.2% and obviously we are overfitting even more.

In [58]:
def ContainsLetter(word,letter):
    if letter in word:
        return True
    else:
        return False

    
def gender_features(word):    
    features = {}
    features['suffix1'] = word[-1]
    features['suffix2'] = word[-2:]
    features['suffix3'] = word[-3:]
    for i in range(0,8):
        features['phonetic'+ str(i)] = GetPhonetic(word, i)
    for letter in string.ascii_lowercase:
        features['count' + letter] = word.count(letter)
        features['contains' + letter] = ContainsLetter(word, letter)
    return features
                
    
trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]

classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(40)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

classifier = nltk.classify.MaxentClassifier.train(trainSet, max_iter=50)
print nltk.classify.accuracy(classifier, devTestSet)    

0.822
Most Informative Features
                 suffix2 = u'na'          female : male   =     93.2 : 1.0
                 suffix2 = u'ia'          female : male   =     86.5 : 1.0
                 suffix2 = u'la'          female : male   =     68.2 : 1.0
                 suffix2 = u'us'            male : female =     39.2 : 1.0
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix2 = u'sa'          female : male   =     32.4 : 1.0
                 suffix2 = u'ta'          female : male   =     32.1 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix2 = u'do'            male : female =     27.4 : 1.0
                 suffix3 = u'ita'         female : male   =     27.1 : 1.0
                 suffix3 = u'tta'         female : male   =     24.1 : 1.0
                 suffix3 = u'ana'         female : male   =     24.1 : 1.0
                 suffix2 = u'ra'          female : male   =     23.9

<H2>Common Longer Suffixes</H2>
Here we find the top 15 longers suffixes, and we will use those in the next section to see if some of those can help us.  Obviously the improvement would be minimal at best, as many names aren't even as long as some of these suffixes, but will see if it helps.

In [59]:
from nltk.probability import FreqDist
import operator
suffixFDist = FreqDist();
for name in names:
    suffixFDist[name[0][-4:]] +=1
    suffixFDist[name[0][-5:]] +=1
    suffixFDist[name[0][-6:]] +=1
    
fDist = suffixFDist.items()
fDist.sort(key = operator.itemgetter(1), reverse=True)
fDist = fDist[:15]
fDist

[(u'ette', 91),
 (u'elle', 80),
 (u'ella', 70),
 (u'anne', 56),
 (u'etta', 55),
 (u'nnie', 52),
 (u'line', 51),
 (u'anna', 49),
 (u'lina', 43),
 (u'rina', 37),
 (u'llie', 35),
 (u'elia', 33),
 (u'bert', 32),
 (u'lene', 32),
 (u'nette', 31)]

<H2>Common Longer Suffixes Classification</H2>

In this step we took out the classification by counts of letters and also the method of using all the letters.  The improvement was mixed and the time increase was substantial.

Taking those out and adding suffixes back in, we see very little difference made by adding common longer suffixes.  The Max Entropy went from 81.4% to 81.8% but the other two methods stayed the same.

Therefore we will stick with the classifcation based on these characteristics:
* Suffixes of 1, 2, 3 letter length:
* Sound of Syllables
* The letters of 'a', 'f', 'w'

And in next section we will run against remaining test set.

In [60]:
def ContainsSuffix(word,suffixes):
    for tup in suffixes:        
        if word == tup[0]:
            return True
    return False    
    
def gender_features(word):    
    features = {}
    features['suffix1'] = word[-1]
    features['suffix2'] = word[-2:]
    features['suffix3'] = word[-3:]
    for i in range(0,8):
        features['phonetic'+ str(i)] = GetPhonetic(word, i)
    features['lettera'] = ContainsLetter(word, 'a')
    features['letterf'] = ContainsLetter(word, 'f')
    features['letterw'] = ContainsLetter(word, 'w')
    for i in range(4,7):
        features['suffix' + str(i)] = ContainsSuffix(word, fDist)     
    return features
            
trainSet = [(gender_features(n), g) for (n,g) in trainNames]
devTestSet = [(gender_features(n), g) for (n,g) in devTestNames]
    
classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)
classifier.show_most_informative_features(40)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, devTestSet)

classifier = nltk.classify.MaxentClassifier.train(trainSet, max_iter=50)
print nltk.classify.accuracy(classifier, devTestSet)    

0.814
Most Informative Features
                 suffix2 = u'na'          female : male   =     93.2 : 1.0
                 suffix2 = u'ia'          female : male   =     86.5 : 1.0
                 suffix2 = u'la'          female : male   =     68.2 : 1.0
                 suffix2 = u'us'            male : female =     39.2 : 1.0
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix2 = u'sa'          female : male   =     32.4 : 1.0
                 suffix2 = u'ta'          female : male   =     32.1 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix2 = u'do'            male : female =     27.4 : 1.0
                 suffix3 = u'ita'         female : male   =     27.1 : 1.0
                 suffix3 = u'tta'         female : male   =     24.1 : 1.0
                 suffix3 = u'ana'         female : male   =     24.1 : 1.0
                 suffix2 = u'ra'          female : male   =     23.9

<H2>Final Test</H2>

Here run against the final test set, to see if our results are similar to what we have done with the "dev" test set.

In [61]:
def ContainsSuffix(word,suffixes):
    for tup in suffixes:        
        if word == tup[0]:
            return True
    return False    
    
def gender_features(word):    
    features = {}
    features['suffix1'] = word[-1]
    features['suffix2'] = word[-2:]
    features['suffix3'] = word[-3:]
    for i in range(0,8):
        features['phonetic'+ str(i)] = GetPhonetic(word, i)
    features['lettera'] = ContainsLetter(word, 'a')
    features['letterf'] = ContainsLetter(word, 'f')
    features['letterw'] = ContainsLetter(word, 'w')
    return features
        
    
trainSet = [(gender_features(n), g) for (n,g) in trainNames]
testSet = [(gender_features(n), g) for (n,g) in testNames]    
    
classifier = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, testSet)
classifier.show_most_informative_features(40)

classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, testSet)

classifier = nltk.classify.MaxentClassifier.train(trainSet, max_iter=50)
print nltk.classify.accuracy(classifier, testSet)    

0.794
Most Informative Features
                 suffix2 = u'na'          female : male   =     93.2 : 1.0
                 suffix2 = u'ia'          female : male   =     86.5 : 1.0
                 suffix2 = u'la'          female : male   =     68.2 : 1.0
                 suffix2 = u'us'            male : female =     39.2 : 1.0
                 suffix1 = u'a'           female : male   =     35.8 : 1.0
                 suffix2 = u'sa'          female : male   =     32.4 : 1.0
                 suffix2 = u'ta'          female : male   =     32.1 : 1.0
                 suffix1 = u'k'             male : female =     29.5 : 1.0
                 suffix2 = u'do'            male : female =     27.4 : 1.0
                 suffix3 = u'ita'         female : male   =     27.1 : 1.0
                 suffix3 = u'tta'         female : male   =     24.1 : 1.0
                 suffix3 = u'ana'         female : male   =     24.1 : 1.0
                 suffix2 = u'ra'          female : male   =     23.9

<H2>Results</H2>

Here on the final test set, we see that Bayesian is down a little bit to 79.6%.  Perhaps we are overfitting slightly on the Bayesian method, ideally, we should run these using different sorted sets of initial test data, to see that.

The decision tree method was exceptionally poor, down to 74.6%, hardly better than when we started...clearly we seem to be overfitting (or we exceptionally unlucky with our test data).  So below we rerun a decision tree on the final test set, using just the suffixes, and the results are much improved at 80.4%...clearly we overfitted.

Lastly the Max Entropy gave similar (but slightly better) results at 82%.  It looks like we are not overfitting for that method...we just have to wait a while for it to work through it's computations.

In [63]:
#function needed to handle a 2 letter name, some of which must exist...
def returnSuffix(word):
    if len(word) > 2:
        return(word[-3])
    else:
        return ''
def gender_features(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2],
            'suffix3': returnSuffix(word)}

trainSet = [(gender_features(n), g) for (n,g) in trainNames]
testSet = [(gender_features(n), g) for (n,g) in testNames]    


classifier = nltk.DecisionTreeClassifier.train(trainSet)
print nltk.classify.accuracy(classifier, testSet)

0.804
