<h1>Project 3: Gender Name Classifier</h1>

<h3> DATA 620 Web Analytics, CUNY Spring 2018 </h3>

Team: Andy Carson, Nathan Cooper, Walt Wells

<h2> Assignment Details </h2>

For this project, please work with the entire class as one collaborative group!Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

<i>The ability to be an effective member of a virtual team is highly valued in the data science job market. </i>


-------------------------------------------------------------------------------------------------------------

Using any of the three classifiers described in chapter 6 of <b>Natural Language Processing with Python</b>,
and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest
set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.



In [121]:
#import nltk
#nltk.download('names')

In [122]:
import nltk
from nltk.corpus import names
import random
from nltk.classify import apply_features
import pandas as pd

#get data
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [123]:
names[0:10]

[(u'Konstance', 'female'),
 (u'Antonella', 'female'),
 (u'Concordia', 'female'),
 (u'Giles', 'male'),
 (u'Roseanna', 'female'),
 (u'Serge', 'male'),
 (u'Gavra', 'female'),
 (u'Huntington', 'male'),
 (u'Lorine', 'female'),
 (u'Brooks', 'female')]

In [124]:
def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    features["suffix2"] = name[-2].lower()
    features["last2"]  = (name[-2].lower() + name[-1].lower()) 
    return features

In [125]:
#split into train, devtest, and test sets
featuresets = [(gender_features(n), g) for (n,g) in names]

test_set = featuresets[:500] #500
dev_test_set = featuresets[500:1000] #500
train_set = featuresets[1000:] #6944

In [126]:
featuresets[0]

({'count(a)': 1,
  'count(b)': 0,
  'count(c)': 1,
  'count(d)': 0,
  'count(e)': 1,
  'count(f)': 0,
  'count(g)': 0,
  'count(h)': 0,
  'count(i)': 0,
  'count(j)': 0,
  'count(k)': 1,
  'count(l)': 0,
  'count(m)': 0,
  'count(n)': 2,
  'count(o)': 1,
  'count(p)': 0,
  'count(q)': 0,
  'count(r)': 0,
  'count(s)': 1,
  'count(t)': 1,
  'count(u)': 0,
  'count(v)': 0,
  'count(w)': 0,
  'count(x)': 0,
  'count(y)': 0,
  'count(z)': 0,
  'firstletter': u'k',
  'has(a)': True,
  'has(b)': False,
  'has(c)': True,
  'has(d)': False,
  'has(e)': True,
  'has(f)': False,
  'has(g)': False,
  'has(h)': False,
  'has(i)': False,
  'has(j)': False,
  'has(k)': True,
  'has(l)': False,
  'has(m)': False,
  'has(n)': True,
  'has(o)': True,
  'has(p)': False,
  'has(q)': False,
  'has(r)': False,
  'has(s)': True,
  'has(t)': True,
  'has(u)': False,
  'has(v)': False,
  'has(w)': False,
  'has(x)': False,
  'has(y)': False,
  'has(z)': False,
  'last2': u'ce',
  'lastletter': u'e',
  'suffix

In [127]:
#classify - use NB and DT
classifier_NB = nltk.NaiveBayesClassifier.train(train_set)
classifier_DT = nltk.DecisionTreeClassifier.train(train_set)

In [128]:
#check accuracy
print nltk.classify.accuracy(classifier_NB, dev_test_set) #.784, .786, .794
print nltk.classify.accuracy(classifier_DT, dev_test_set) #.766, .766, .8

0.794
0.8


In [129]:
#show important features
classifier_NB.show_most_informative_features(5)


Most Informative Features
                   last2 = u'na'          female : male   =    154.4 : 1.0
                   last2 = u'la'          female : male   =     74.4 : 1.0
                   last2 = u'us'            male : female =     62.6 : 1.0
                   last2 = u'ia'          female : male   =     37.8 : 1.0
                   last2 = u'ld'            male : female =     36.5 : 1.0


In [130]:
#check errors
dev_test_names = names[500:1000]
dev_test_names[0]


(u'Laila', 'female')

In [131]:
errors = []
for (name, tag) in dev_test_names:
    guess = classifier_NB.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [132]:
print "Error count: " + str(len(errors))

Error count: 103


In [133]:
#all
all_guesses = []
for (name, tag) in dev_test_names:
    guess = classifier_NB.classify(gender_features(name))
    all_guesses.append( (tag, guess, name) )

In [134]:
#make dataframe
all_guesses_pd = pd.DataFrame(all_guesses)

In [135]:
#confusion matrix
print nltk.ConfusionMatrix(list(all_guesses_pd[:][0]), list(all_guesses_pd[:][1]))

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<260> 65 |
  male |  38<137>|
-------+---------+
(row = reference; col = test)



In [136]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name)

correct=female   guess=male     name=Barby                         
correct=female   guess=male     name=Beau                          
correct=female   guess=male     name=Brett                         
correct=female   guess=male     name=Caro                          
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Chriss                        
correct=female   guess=male     name=Christel                      
correct=female   guess=male     name=Christin                      
correct=female   guess=male     name=Cody                          
correct=female   guess=male     name=Constance                     
correct=female   guess=male     name=Cortney                       
correct=female   guess=male     name=Daveen                        
correct=female   guess=male     name=Devin                         
correct=female   guess=male     name=Doreen                        
correct=female   guess=male     name=Edin       

In [137]:
#check test accuracy
print nltk.classify.accuracy(classifier_NB, test_set) #.786
print nltk.classify.accuracy(classifier_DT, test_set) #.758

0.786
0.758


<b>Question:</b>

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

<b>Answer:</b>

It is generally going to be lower, because we are going to be optimizing against the dev-test set, which means we will probably be overfitting a little and our results will not generalize well to data our model hasn't seen before (the test set).  However, if we are doing a good job and not overfitting too much, it won't be significantly lower.