### DATA 620
#### Project 3   
### [Video Presentation]()
##### Group Four
- Santosh Cheruku
- Vinicio Haro
- Javern Wilson
- Saayed Alam  

In [53]:
# load libraries
import nltk
from nltk.corpus import names
import random

### Introduction
In this assignment, we will work with naive Bayes classifiers to build a name gender classifier. We will select relevant features as we go along to improve the accuracy of our classifier. Deciding on relevant features for a classifier can have an enormous impact on the classifier's ability to extract a good model.

In [54]:
# load names from nltk library
labeled_names = ([(name, "male") for name in names.words("male.txt")] + 
                 [(name, "female") for name in names.words("female.txt")])

# random shuffle the names
random.shuffle(labeled_names)

### Data Preparation
We begin by splitting the names courpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the classifier.

In [55]:
# split data into three subsets
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

Using the dev-test set, we can generate a list of the errors that the classifier makes when predicting name genders. We will build a function to generate the error list with the number of errors.

In [56]:
# define error analysis function
def error_analysis(gender_features):
    # error list
    errors = []
    # extract mislabels
    for (name, tag) in devtest_names:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append((tag, guess, name))
    print("Number of Errors: ", len(errors))
    # print the errors
    for (tag, guess, name) in sorted(errors):
        print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

### Feature Engineering
#### Gender Feature 1
Our first feature function is a the example from the textbook. It classifies name based on the number of English alphabets and for that reason it overfits gender features. However, we will start with this function as benchmark and add or remove features to improve our classifier.

In [57]:
# define first version of gender features
def gender_features1(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in "abcdefghijklmnopqrstuvwxyz":
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

# train model and print accuracy score
train_set = [(gender_features1(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features1(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.758


 As expected, our first classifier score matches the one from the textbook. We will improve it.

In [58]:
# print error results
error_analysis(gender_features1)

Number of Errors:  121
correct=female   guess=male     name=Barry                         
correct=female   guess=male     name=Bonny                         
correct=female   guess=male     name=Bryn                          
correct=female   guess=male     name=Chrysler                      
correct=female   guess=male     name=Cody                          
correct=female   guess=male     name=Cordey                        
correct=female   guess=male     name=Dawn                          
correct=female   guess=male     name=Delores                       
correct=female   guess=male     name=Demetris                      
correct=female   guess=male     name=Diamond                       
correct=female   guess=male     name=Dorey                         
correct=female   guess=male     name=Dory                          
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Ethyl                         
correct=female   guess=ma

#### Gender Feature 2
As discussed in the textbook and from common knowledge the list above makes it clear that some suffixes have more than one letter that can be suggestive of name genders. For example, names ending in `ie` appear to be mislabeled. The same can be said about prefixes as well. For example, names starting with `Do` is mostly female. Also, names starting with `Je` is generally males in the error list. 

In [59]:
# define second version of gender features
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in "abcdefghijklmnopqrstuvwxyz":
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["prefix3"] = name[:3].lower()
    return features

# train model and print accuracy score
train_set = [(gender_features2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features2(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.822


We see good improvement in our accuracy score. The number of errors has decreased as well. Let us see if we can improve our classifier even further. 

In [60]:
# print error results
error_analysis(gender_features2)

Number of Errors:  89
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Barry                         
correct=female   guess=male     name=Chrysler                      
correct=female   guess=male     name=Cody                          
correct=female   guess=male     name=Darell                        
correct=female   guess=male     name=Dawn                          
correct=female   guess=male     name=Delores                       
correct=female   guess=male     name=Demetris                      
correct=female   guess=male     name=Diamond                       
correct=female   guess=male     name=Dorey                         
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Fawn                          
correct=female   guess=male     name=Gredel                        
correct=female   guess=male     name=Guinevere                     
correct=female   guess=mal

#### Gender Feature 3
For our last gender feature, we play around with few more numbers of prefixes and suffixes to yield the best accuracy score possible. We also notice `yn` is very indicative of female gender, so we implement that feature as well.

In [61]:
# define third version of gender features
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in "abcdefghijklmnopqrstuvwxyz":
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    features["suffix4"] = name[-4:].lower()
    features["prefix3"] = name[:3].lower()
    features["prefix4"] = name[:4].lower()
    features["has_yn"] = "yn" in name
    return features

# train model and print accuracy score
train_set = [(gender_features3(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features3(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.834


The resulting score is the best yet. We will test the classifier with `gender_features3()` on our unseen data. 

In [62]:
# print error results
error_analysis(gender_features3)

Number of Errors:  83
correct=female   guess=male     name=Adrien                        
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Barry                         
correct=female   guess=male     name=Chrysler                      
correct=female   guess=male     name=Cody                          
correct=female   guess=male     name=Darell                        
correct=female   guess=male     name=Dawn                          
correct=female   guess=male     name=Delores                       
correct=female   guess=male     name=Demetris                      
correct=female   guess=male     name=Diamond                       
correct=female   guess=male     name=Dory                          
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Ethyl                         
correct=female   guess=male     name=Eve                           
correct=female   guess=mal

In [63]:
# final performance test
test_set = [(gender_features3(n), gender) for (n, gender) in test_names]
print(nltk.classify.accuracy(classifier, test_set))

0.83


### Conclusion
The performance on the test set is on par with the performance on the dev-test set. We did not expect much improvement.