# Project 3
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev- test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.
How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


# Libraries used

In [1]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random
import pandas as pd

The nltk library was of the utmost importance in this project; it was used for the names corpus and for its classifiers. The library random was used for shuffling the names, and pandas was used for creating a function to test the accuracy of the final gender-predicting function more efficiently.

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

The names provided by nltk were utilized for training and testing our algorithms, with male and female names being stored in a single variable.

# Determination of accuracy
When creating a function for determining the accuracy of any given combination of features, it was determined the Naive Bayes method of classification would be best suited

In [3]:
def accuracy(number_of_runs, function_to_use):
    acc_df = {
        "classifier": [],
        "train_set_accuracy": [],
        "test_set_accuracy": [],
        "devtest_set_accuracy": [],
        "devtest_errors": []
    }
    for i in range(number_of_runs):
        random.shuffle(names)
        acc_train_names = names[1000:]
        acc_devtest_names = names[500:1000]
        acc_test_names = names[:500]
        acc_train_set = [(function_to_use(n), g) for (n,g) in acc_train_names]
        acc_devtest_set = [(function_to_use(n), g) for (n,g) in acc_devtest_names]
        acc_test_set = [(function_to_use(n), g) for (n,g) in acc_test_names]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier)
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["devtest_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_devtest_set))
        acc_errors = []
        for (name, tag) in acc_devtest_names:
            acc_guess = acc_classifier.classify(function_to_use(name))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, name) )
        acc_df["devtest_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

The dictionary was later transformed into a data frame would be created to store the number of runs performed for the given created function for checking features against the names in the names variable. This is why this function, accuracy, has a parameter called number_of_runs, to determine how many times a given function should be run before being considered accurate. Ultimately the number settled on was 100.

Within the accuracy function itself the names were shuffled for every run; for each shuffling of the names, the first 500 names would be used as a test set, the next 500 for the dev test, and the remaining names for the training set. The classifiers for each run were kept, as were the list of errors.

Lastly, the data frame would be returned, best stored in another user-defined variable.

# Gender features
Natural Language Processing with Python, Chapter 6, provided two premade functions with features to check against the corpus of names,a third function was created to compare against the accuracy of with the textbook's examples.

In [4]:
def textbook_gender_features_1(word):
    return {'last_letter': word[-1]}

This is the textbook's first example of testing for gender features. All it tests for is the last letter of the name.

In [5]:
def textbook_gender_features_2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

This is the textbook's second example of testing for gender features. It expands upon the previous example by checking for the last letter of a given name, but also by looking into the first letter, the number of times each letter appears, and whether or not the letter was present in the name at all.

In [6]:
def function_gender_features(name):
    features = {}
    temp_name = name
    eng_cons_clusters = ["bl", "br", "ch", "cl", "cr", "dr", "fl", "fr", "gl", "gr", "pl", "pr", "sc", "sh", "sk", "sl", "sm", "sn", "sp", "st", "sw", "th", "tr", "tw", "wh", "wr", "sch", "scr", "shr", "sph", "spl", "spr", "squ", "str", "thr"]
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    clusters = []
    for cluster in eng_cons_clusters[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["english_consonant_clusters_1"] = clusters[0] if len(clusters) > 0 else None
    features["english_consonant_clusters_2"] = clusters[1] if len(clusters) > 1 else None
    features["english_consonant_clusters_3"] = clusters[2] if len(clusters) > 2 else None
    return features

The 3rd function utilizes the first and last letter from the previous text book, but it also looks for the prefix and suffix - or first and last two or three letters, depending on the name's length - of a name and looks for whether or not any of the consonant clusters in English are present.

# Testing accuracy
The aim of this project is to imporve the accuracy of the gender feature functions provided by the textbook. To do so, the function is run 100 times.

In [7]:
textbook_df_1 = accuracy(100, textbook_gender_features_1)
textbook_df_1.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.762978,0.75984,0.75984
std,0.001871,0.018614,0.018279
min,0.759649,0.716,0.712
25%,0.761629,0.7475,0.7475
50%,0.762961,0.76,0.761
75%,0.764113,0.7725,0.772
max,0.767425,0.808,0.814


The first function, while simplistic, has fairly impressive results; the average accuracy across the board is between 76.1% and 76.3%. 

In [8]:
textbook_df_2 = accuracy(100, textbook_gender_features_2)
textbook_df_2.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.778291,0.77312,0.77314
std,0.00204,0.016736,0.01776
min,0.774338,0.738,0.732
25%,0.776642,0.76,0.76
50%,0.778226,0.774,0.774
75%,0.779558,0.784,0.7865
max,0.783842,0.804,0.816


The second function provided by the textbook, while slightly more complex, had an average accuracy across the board that ranged from 77.4% to 77.8%, looking into a few more features could produce a substantial increase in accuracy.

In [9]:
function_df = accuracy(100, function_gender_features)
function_df.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.88343,0.83058,0.83106
std,0.001477,0.016537,0.017267
min,0.880328,0.796,0.786
25%,0.882344,0.82,0.8195
50%,0.883497,0.83,0.833
75%,0.884505,0.8425,0.842
max,0.887241,0.872,0.874


The thrid developed function was more complex than what the textbook offered. It resulted in an average accuracy of 83.1% to 88.3%, and sometimes even higher depending on the run. It succeeded in overcoming the results the textbook provided.

# Conclusion
The challenge to produce a function more accurate than the one provided by the textbooks was accomplished.