<h1>Project 3: Gender Name Classifier</h1>

<h3> DATA 620 Web Analytics, CUNY Spring 2018 </h3>

Team: Andy Carson, Nathan Cooper, Walt Wells

<h2> Assignment Details </h2>

For this project, please work with the entire class as one collaborative group!Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

<i>The ability to be an effective member of a virtual team is highly valued in the data science job market. </i>


-------------------------------------------------------------------------------------------------------------

Using any of the three classifiers described in chapter 6 of <b>Natural Language Processing with Python</b>,
and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest
set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.



In [1]:
#import nltk
#nltk.download('names')

In [80]:
import nltk
from nltk.corpus import names
import random
from nltk.classify import apply_features
import pandas as pd



In [81]:
#get data
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [82]:
names[0:10]

[(u'Olga', 'female'),
 (u'Tommy', 'male'),
 (u'Son', 'male'),
 (u'Ursala', 'female'),
 (u'Phillipe', 'male'),
 (u'Dillon', 'male'),
 (u'Charity', 'female'),
 (u'Brett', 'male'),
 (u'Rockwell', 'male'),
 (u'Jillane', 'female')]

In [83]:
#code pulled from NLP and from various sources online
#modified appropriately
def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["last_is_vowel"] = (name[-1] in 'aeiouy')
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
        features["first(%s)" % letter] = name.lower().find(letter)
    features["suffix2"] = name[-2].lower()
    features["last2"]  = (name[-2].lower() + name[-1].lower())
    if len(name) >= 3:
        features["last3"] = (name[-3].lower() + name[-2].lower() + name[-1].lower())
    else:
        features["last3"] = (" " + name[-2].lower() + name[-1].lower())
    features["length"] = len(name)
    return features

In [84]:
#split into train, devtest, and test sets
featuresets = [(gender_features(n), g) for (n,g) in names]

test_set = featuresets[:500] #500
dev_test_set = featuresets[500:1000] #500
train_set = featuresets[1000:] #6944

In [85]:
featuresets[0]

({'count(a)': 1,
  'count(b)': 0,
  'count(c)': 0,
  'count(d)': 0,
  'count(e)': 0,
  'count(f)': 0,
  'count(g)': 1,
  'count(h)': 0,
  'count(i)': 0,
  'count(j)': 0,
  'count(k)': 0,
  'count(l)': 1,
  'count(m)': 0,
  'count(n)': 0,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 0,
  'count(s)': 0,
  'count(t)': 0,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'first(a)': 3,
  'first(b)': -1,
  'first(c)': -1,
  'first(d)': -1,
  'first(e)': -1,
  'first(f)': -1,
  'first(g)': 2,
  'first(h)': -1,
  'first(i)': -1,
  'first(j)': -1,
  'first(k)': -1,
  'first(l)': 1,
  'first(m)': -1,
  'first(n)': -1,
  'first(o)': 0,
  'first(p)': -1,
  'first(q)': -1,
  'first(r)': -1,
  'first(s)': -1,
  'first(t)': -1,
  'first(u)': -1,
  'first(v)': -1,
  'first(w)': -1,
  'first(x)': -1,
  'first(y)': -1,
  'first(z)': -1,
  'firstletter': u'o',
  'has(a)': True,
  'has(b)': False,
  'has(c)': False,
  'has(d)': Fal

In [86]:
#classify - use NB and DT
classifier_NB = nltk.NaiveBayesClassifier.train(train_set)
classifier_DT = nltk.DecisionTreeClassifier.train(train_set)

In [87]:
#check accuracy
print nltk.classify.accuracy(classifier_NB, dev_test_set) #.784, .786, .794, 0.804, 0.826, 0.786, 0.826, 0.816
print nltk.classify.accuracy(classifier_DT, dev_test_set) #.766, .766, .8, 0.738, 0.744, 0.726, 0.742, 0.734

0.816
0.734


In [88]:
#show important features
classifier_NB.show_most_informative_features(5)


Most Informative Features
                   last2 = u'na'          female : male   =     94.0 : 1.0
                   last2 = u'la'          female : male   =     69.2 : 1.0
              lastletter = u'k'             male : female =     41.7 : 1.0
                   last2 = u'ld'            male : female =     37.1 : 1.0
                   last2 = u'ia'          female : male   =     35.8 : 1.0


In [89]:
#check errors
dev_test_names = names[500:1000]
dev_test_names[0]


(u'Jacklyn', 'female')

In [90]:
errors = []
for (name, tag) in dev_test_names:
    guess = classifier_NB.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [91]:
print "Error count: " + str(len(errors)) #87, 107, 87, 92

Error count: 92


In [92]:
#all
all_guesses = []
for (name, tag) in dev_test_names:
    guess = classifier_NB.classify(gender_features(name))
    all_guesses.append( (tag, guess, name) )

In [93]:
#make dataframe
all_guesses_pd = pd.DataFrame(all_guesses)

In [94]:
#confusion matrix
print nltk.ConfusionMatrix(list(all_guesses_pd[:][0]), list(all_guesses_pd[:][1]))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<256> 50 |
  male |  42<152>|
-------+---------+
(row = reference; col = test)



In [95]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)

correct=female   guess=male     name=Adrian                        
correct=female   guess=male     name=Aigneis                       
correct=female   guess=male     name=Berty                         
correct=female   guess=male     name=Bren                          
correct=female   guess=male     name=Cam                           
correct=female   guess=male     name=Cinnamon                      
correct=female   guess=male     name=Clair                         
correct=female   guess=male     name=Corliss                       
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Dix                           
correct=female   guess=male     name=Elspeth                       
correct=female   guess=male     name=Fan                           
correct=female   guess=male     name=Flore                         
correct=female   guess=male     name=Floris                        
correct=female   guess=male     name=Florry     

In [96]:
#check test accuracy
print nltk.classify.accuracy(classifier_NB, test_set) #.786, 0.79, 0.822, 0.812, 0.818
print nltk.classify.accuracy(classifier_DT, test_set) #.758, 0.724, 0.754, 0.738, 0.73

0.818
0.73


<b>Question:</b>

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

<b>Answer:</b>

It is generally going to be lower, because we are going to be optimizing against the dev-test set, which means we will probably be overfitting a little and our results will not generalize well to data our model hasn't seen before (the test set).  However, if we are doing a good job and not overfitting too much, it won't be significantly lower.