# DATA 620, Project 3: Name/Gender Classifier:
## Author: Kevin Kirby

I chose a decision tree classifier because they're known to perform better on name classifier tasks than Naive Bayes or Maximum Entroy Classifiers.

First, import required libraries. Please note, I downloaded the `names` Corpus using the NTLK corpus pickker via my terminal.

In [1]:
import random
from nltk.corpus import names
from nltk import DecisionTreeClassifier
from nltk.classify import accuracy

## Classifier Functions

`name_features()`: defines what features I'm interested in. Things like the last letter, first letter, and whether the name ends in a vowel usually fall along gender lines. The use of the last two letters and name length is meant to make the model more interesting than what would come the single character features.

`ng_classify()`: splits the Names corpus into male and feamle, along with the required test/dev-test/train breakouts provided in the assignment. `ng_decision_tree` creates the decision tree and trains it on the training set. The `entropy_cutoff` of 0.1 is meant to allow the tree to grow to the point where it's starting to split on rare occurences and begins to overfit. `support_cutoff` of 7 requires at least 7 occurences of a feature pattern before splitting.


In [2]:
def name_features(name, feature_names):
    features = {}
    if 'last_letter' in feature_names:
        features['last_letter'] = name[-1]
    if 'first_letter' in feature_names:
        features['first_letter'] = name[0]
    if 'last_is_vowel' in feature_names:
        features['last_is_vowel'] = name[-1].lower() in 'aeiou'
    if 'last_two' in feature_names:
        features['last_two'] = name[-2:]
    if 'name_length' in feature_names:
        features['name_length'] = len(name)
    return features

def ng_classfiy(feature_names):
    labels = ([(name, 'male') for name in names.words('male.txt')] +
                     [(name, 'female') for name in names.words('female.txt')])
    random.shuffle(labels)

    train_set = [(name_features(n, feature_names), g) for (n, g) in labels[:6900]]
    devtest_set = [(name_features(n, feature_names), g) for (n, g) in labels[6900:7400]]
    test_set = [(name_features(n, feature_names), g) for (n, g) in labels[7400:7900]]

    ng_decision_tree = DecisionTreeClassifier.train(train_set, entropy_cutoff=0.1, support_cutoff=7)
    dt_accuracy = accuracy(ng_decision_tree, devtest_set)
    test_accuracy = accuracy(ng_decision_tree, test_set)
    
    return dt_accuracy, test_accuracy

## Testing the Model

I went through three rounds of incrementally adding features to assess performance. For readability, I've organized it below into sections by test

### First Test, Three Features:

In [6]:
ngf_start = ['last_is_vowel', 'first_letter', 'last_letter']
dev_start, test_start = ng_classfiy(ngf_start)

print("Results of using three features:\n dev test accuracy: {}\n test accuracy: {}".format(dev_start, test_start))

Results of using three features:
 dev test accuracy: 0.788
 test accuracy: 0.798


### Second Test, Four Features

In [7]:
ngf_second = ['last_is_vowel', 'first_letter', 'last_letter', 'last_two']
dev_second, test_second = ng_classfiy(ngf_second)

print("Results of using four features:\n dev test accuracy: {}\n test accuracy: {}".format(dev_second, test_second))

Results of using four features:
 dev test accuracy: 0.786
 test accuracy: 0.768


### Final Test, Five Features

In [8]:
ngf_final = ['last_is_vowel', 'first_letter', 'last_letter', 'last_two', 'name_length']
dev_final, test_final = ng_classfiy(ngf_final)

print("Results of using all five features:\n dev test accuracy: {}\n test accuracy: {}".format(dev_final, test_final))

Results of using all five features:
 dev test accuracy: 0.778
 test accuracy: 0.782


## Results Analysis

It's very interesting to see that three features performed the best overall. The dataset is pretty small in the grand scheme of the universe so this could be the result of:
    * Too high of entropy and support cutoffs relative to overall data it draws from
    * Additional features introduced noise that wasn't helpful to the model

I expected better performance for the five features than the three features. However, I look back and realize this was a naive expectations. Less than 8,000 data points overall is really inconsequential and I should have known to expect better performane from a simpler model. Smaller datasets are vulnerable to overfitting, especially when test and dev are only 500 points. 

I'm impressed by the baseline performance, though. Close to 80% accuracy as a starting point is a good launch pad for further refinement with a larger dataset. 