# Data 620, Project 3
July 10, 2019 
Team 6: Alice Friedman, Scott Jones, Jeff Littlejohn, and Jun Pan

## Assignment Description
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

## Text Classification: Identifying Gender from the ```NLTK``` Names Corpus  

### `nltk`  

Adapted from the site: 
https://gist.github.com/vinovator/6e5bf1e1bc61687a1e809780c30d6bf6
https://www.geeksforgeeks.org/python-gender-identification-by-name-using-nltk/

### Setup

First, we import the names corpus from the ```nltk``` list of corpuses, and create three sets of names. All sets will be of equal length and generated from a randomized shuffle of each of the corpuses.

- A training set, used to train the model based on our selected features

- A couple of "dev" sets, which we will use to test progress on the gender identifier and perform error analysis

- A final "test" set, which we will use to test how well our predictions ultimately worked

We will attempt XXX different versions of the 

In [1]:
import nltk
from nltk.corpus import names
import random

In [326]:
mcorpus = [(name, "male") for name in names.words("male.txt")]
fcorpus = [(name, "female") for name in names.words("female.txt")]
random.shuffle(mcorpus); random.shuffle(fcorpus)
print(mcorpus[0:5],len(mcorpus))
print(fcorpus[0:5],len(fcorpus))

corpus = mcorpus + fcorpus
random.shuffle(corpus)
print(corpus[:5], len(corpus))

[('Lucio', 'male'), ('Donal', 'male'), ('Markus', 'male'), ('Iago', 'male'), ('Kelsey', 'male')] 2943
[('Antoinette', 'female'), ('Lorilyn', 'female'), ('Letti', 'female'), ('Ella', 'female'), ('Corry', 'female')] 5001
[('Niall', 'male'), ('Munroe', 'male'), ('Samuella', 'female'), ('Francis', 'female'), ('Zoe', 'female')] 7944


Python slices lists from the first index up to *but not including* the second given index. This is counter-intutive, but ultimately makes it easier to slice lists consecutively as '''list[:10] + lists[10:]''' will return the complete, original list.

From the textbook: "Each time the error analysis is repeated, we should select a *different* dev-test/training split, to ensure that the classifier does not start to reflect the idiosyncracies in the dev-test set." (pg 227)

To prevent a lot of re-coding, we can write a function to remix the development-training corpus, after setting aside the first 500 male and female names as the final test slice.

In [340]:
# Create a function to return a new training and dev-test mix of the corpus for each iteration of the model
def reslicer(corpus):
    final_test_n = 500 # can be adjusted
    dev_test_n = 500 # can be adjusted
    
    test_corpus = corpus[:final_test_n] #reserve first 500 for the final test
    print("Test Corpus Sample: ", test_corpus[0:3], ", Length: ", len(test_corpus))
    
    dev_set = corpus[final_test_n:] #create a copy of the dev_set to preserve the original test set before shuffling
    random.shuffle(dev_set) #remix before re-slicing
    
    dev_test_corpus = dev_set[:dev_test_n]
    print("Dev-Test Corpus Sample: ",dev_test_corpus[0:3], ", Length: ", len(dev_test_corpus)) #should have length 500
    
    train_corpus = dev_set[dev_test_n:]
    print("Training Corpus Sample: ",train_corpus[0:3], ", Length: ", len(train_corpus)) #should be longer

    return train_corpus, dev_test_corpus, test_corpus


In [341]:
train_names, dev_names, test_names = reslicer(corpus)

Test Corpus Sample:  [('Niall', 'male'), ('Munroe', 'male'), ('Samuella', 'female')] , Length:  500
Dev-Test Corpus Sample:  [('Letitia', 'female'), ('Kylen', 'female'), ('Judy', 'male')] , Length:  500
Training Corpus Sample:  [('Theophyllus', 'male'), ('Hubert', 'male'), ('Nanice', 'female')] , Length:  6944


In [342]:
#re-running the code will remix the training and dev-test sets, while leaving the original test names intact
train_names, dev_names, test_names = reslicer(corpus)

Test Corpus Sample:  [('Niall', 'male'), ('Munroe', 'male'), ('Samuella', 'female')] , Length:  500
Dev-Test Corpus Sample:  [('Greg', 'male'), ('Camille', 'female'), ('Frannie', 'female')] , Length:  500
Training Corpus Sample:  [('Belita', 'female'), ('Marijo', 'female'), ('Gunvor', 'female')] , Length:  6944


### Function definitions to process and classify the data
Because we will be changing the feature function each time, we can create a series of functions to process and classify the data to to minimize repeated code. 

In [351]:
# Define function to process the names through feature extractor
def feature_ext(feature_func, corpus):
    
    #first, remix and reslice the data to ensure we are using a new mix of dev-test and training data each time
    train_names, dev_names, test_names = reslicer(corpus)
    
    #then, extract features from the names slices
    train_set = [(feature_func(n), gender) for (n, gender) in train_names]
    devtest_set = [(feature_func(n), gender) for (n, gender) in dev_names]
    test_set = [(feature_func(n), gender) for (n, gender) in test_names]
    
    return train_set, devtest_set, test_set

In [423]:
def test_model(feature_func, corpus):
    feature_func = feature_func #use feature function given as arg
    corpus = corpus
    
    # Run the feature_ext function to create the necessary labeled feature sets
    train_set, devtest_set, test_set = feature_ext(feature_func, corpus)
    
    # Train the naiveBayes classifier
    classifier = nltk.NaiveBayesClassifier.train(train_set)

    # Test the accuracy of the classifier on the dev data--this is so we can evaluate errors and make tweaks
    a = round(nltk.classify.accuracy(classifier, devtest_set), 4)*100
    accuracy = f'{a:.2f}'
    print("\n")
    print("Model is %s percent accurate" % accuracy)
    print("\n")
    # examine classifier to determine which last letter is most effective for
    # distinguishing the name's gender
    print(classifier.show_most_informative_features(10))
    
    def find_errors(dev_names, feature_func):
        errors = {'name' : [], 'label' : [], 'guess' : [], 'features': [] }
        for (name, label) in dev_names:
            guess = classifier.classify(feature_func(name))
            features = feature_func(name)
            if guess != label:
                errors['name'].append(name)
                errors['label'].append(label)
                errors['guess'].append(guess)
                errors['features'].append(features)
        errors = pd.DataFrame(errors)
        print("\nErrors")
        print(errors.sample(20))
    
    errors = find_errors(dev_names, feature_func)
    
    return errors
    


### Training on the last letter of the name

Now that we have our functions set up, we can define some differnt feature extraction functions and test each one.

First, we will test a model by examining only the last letter of each name. We will use ```nltk```'s built-in Naive Baysian Classifier to train the model based on this feature. 

In [424]:
def last_letter(word): #first feature functiont to test
    
    return {"last_letter": word[-1]}

In [425]:
test_model(last_letter, corpus)

Test Corpus Sample:  [('Niall', 'male'), ('Munroe', 'male'), ('Samuella', 'female')] , Length:  500
Dev-Test Corpus Sample:  [('Penni', 'female'), ('Prissie', 'female'), ('Parker', 'male')] , Length:  500
Training Corpus Sample:  [('Dani', 'female'), ('Essa', 'female'), ('Anne-Corinne', 'female')] , Length:  6944


Model is 76.40 percent accurate


Most Informative Features
             last_letter = 'a'            female : male   =     41.1 : 1.0
             last_letter = 'k'              male : female =     28.4 : 1.0
             last_letter = 'f'              male : female =     26.6 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : female =      9.9 : 1.0
             last_letter = 'd'              male : female =      9.5 : 1.0
             last_letter = 'o'              male : female =      8.7 : 1.0
             last_letter = 'm'              male : female =      7.9 : 1.0
             last_lette

These are interesting results! Ending in `a` is the *only* letter in the top ten that predicts female names instead of male names. 

Does this basic model work for our names? Let's write a short script to test our names and print a result.

## Insert Data Viz?

### Training on the last 3 letters of the name

Next, we try adding the last 3 letters as well as just the last letter. 

In [426]:
def two_features(word):
    return {"last_letter": word[-1], "last3letters": word[-3:]}  # feature set

#note, we are automatically re-slicing the training/dev-test slices
errors = test_model(two_features, corpus) #setting the result to errors will allow us to investigate further

Test Corpus Sample:  [('Niall', 'male'), ('Munroe', 'male'), ('Samuella', 'female')] , Length:  500
Dev-Test Corpus Sample:  [('Dennis', 'male'), ('Alf', 'male'), ('Salem', 'male')] , Length:  500
Training Corpus Sample:  [('Selie', 'female'), ('Elihu', 'male'), ('Reagan', 'male')] , Length:  6944


Model is 81.80 percent accurate


Most Informative Features
             last_letter = 'a'            female : male   =     40.8 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'f'              male : female =     26.4 : 1.0
            last3letters = 'ita'          female : male   =     25.4 : 1.0
            last3letters = 'ana'          female : male   =     23.7 : 1.0
            last3letters = 'tta'          female : male   =     23.2 : 1.0
            last3letters = 'nne'          female : male   =     19.1 : 1.0
            last3letters = 'ard'            male : female =     18.6 : 1.0
            last3letters = 'vin'       

Adding the last three letters increased our accuracy by about 5 percentage points! Let's see if we can learn anything from the remaining errors to improve our model further.

# Maybe can cut this section -- or use for mor error analysis?

   ### Analysis of name parts
   
Next, let's take a look at whether the presence or absense of any letters or consecutive groups of letters can tell us anything about the names by doing an analysis.

First, we create a function that returns all the combinations consecutive letters in a name, except the name itself. So, for example, in the name, "Noah", it returns the list:
```['N', 'No', 'Noa', 'o', 'oa', 'oah', 'a', 'ah', 'h']```

In [410]:
def name_parts(name):
    i = 0
    letters = ''
    parts = []
    ans = ''
    for letter in name:
        next_part = name[i:]
        for letter in next_part:
            letters += letter
            parts.append({'feature':letters})
            i+=0
        letters=''
        i += 1
    return parts
print(name_parts("Noah"))

[{'feature': 'N'}, {'feature': 'No'}, {'feature': 'Noa'}, {'feature': 'Noah'}, {'feature': 'o'}, {'feature': 'oa'}, {'feature': 'oah'}, {'feature': 'a'}, {'feature': 'ah'}, {'feature': 'h'}]


In [266]:
def parts_set(name_set):
    parts_set = []
    for name, label in name_set:
        features_list = name_parts(name)
        #print(features_list)
        parts_list = [(item, label) for item in features_list]
        #print(parts_list)
        parts_set += parts_list
        #print(parts_set)
    return parts_set

In [428]:
train_set2 = parts_set(train_names)
dev_set2 = parts_set(dev_names)

In [429]:
# Train the naiveBayes classifier
classifier2 = nltk.NaiveBayesClassifier.train(train_set2)

accuracy = round(nltk.classify.accuracy(classifier2, dev_set2), 2)*100

# Test the accuracy of the classifier on the test data
print("Model is %d percent accurate" % accuracy)
print("")

# examine classifier to determine which feature is most effective for
# distinguishing the name's gender
print(classifier2.show_most_informative_features(10))

Model is 70 percent accurate

Most Informative Features
                 feature = 'tte'          female : male   =     37.9 : 1.0
                 feature = 'rv'             male : female =     30.6 : 1.0
                 feature = 'tta'          female : male   =     23.4 : 1.0
                 feature = 'iss'          female : male   =     22.6 : 1.0
                 feature = 'hu'             male : female =     22.0 : 1.0
                 feature = 'ing'            male : female =     18.8 : 1.0
                 feature = 'etta'         female : male   =     18.4 : 1.0
                 feature = 'ton'            male : female =     17.4 : 1.0
                 feature = 'Ros'          female : male   =     16.8 : 1.0
                 feature = 'rw'             male : female =     16.6 : 1.0
None


## `sklearn`  

From the site: https://blog.ayoungprogrammer.com/2016/04/determining-gender-of-name-with-80.html/    

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

Create the names corpus with gender assignment; each name is converted to lowercase.

In [None]:
labeled_names = ([(name.lower(), "M") for name in names.words("male.txt")] +
                 [(name.lower(), "F") for name in names.words("female.txt")])

my_data = np.asarray(labeled_names) 

Using `sklearn`, we train a model on numbers associated with each letter in the name, where a=1, b=2, c=3, ... in this manner a model is created based on integer values.

In [232]:
def name_count(name):
    arr = np.zeros(65)
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
    return arr

name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xlist = name_map(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7219679633867276
0.7292143401983219
0.7315026697177727
0.7257818459191457
0.7265446224256293


In [186]:
def name_count2(name):
    arr = np.zeros(65+26)
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    return arr

name_map2 = np.vectorize(name_count2, otypes=[np.ndarray])
Xlist = name_map2(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7829900839054157
0.7807017543859649
0.7864225781845919
0.7688787185354691
0.7936689549961862


In [193]:
def name_count3(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    return arr

name_map3 = np.vectorize(name_count3, otypes=[np.ndarray])
Xlist = name_map3(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7963386727688787
0.7967200610221206
0.7894736842105263
0.7917620137299771
0.7921434019832189


In [194]:
def name_count4(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    # Last character
    arr[-3] = ord(name[-1])-ord('a')+1
    # Second Last character
    arr[-2] = ord(name[-2])-ord('a')+1
    # Length of name
    arr[-1] = len(name)
    return arr

name_map4 = np.vectorize(name_count4, otypes=[np.ndarray])
Xlist = name_map4(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.8180778032036613
0.8188405797101449
0.8096872616323417
0.8032036613272311
0.8081617086193745


In [195]:
def name_count7(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    # Last character
    arr[-3] = ord(name[-1])-ord('a')+1
    # Second Last character
    arr[-2] = ord(name[-2])-ord('a')+1
    # Length of name
    arr[-1] = len(name)
    return arr

name_map7 = np.vectorize(name_count7, otypes=[np.ndarray])
Xlist = name_map7(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(clf.feature_importances_.argsort()[-10:][::-1])
    print(np.mean(clf.predict(Xte) == yte))

[1797   26 1798   40   30   34    0   43   29    8]
0.8081617086193745
[1797   26 1798   40   30    0   34   43   22   29]
0.8119755911517925
[1797   26 1798   40    0   30   34   43 1799    8]
0.8005339435545386
[1797   26 1798   40   30    0   34   43   29 1799]
0.8012967200610221
[1797   26 1798   40    0   30   34   43   29 1799]
0.8154080854309688


The following code trains a model based on the last letter of the name.

In [224]:
def name_count8(name):
    arr = np.zeros(1)
    arr[0] = ord(name[-1])-ord('a')+1
    return arr

name_map8 = np.vectorize(name_count8, otypes=[np.ndarray])
Xlist = name_map8(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.761632341723875
0.7715484363081617
0.7528604118993135
0.7570556826849733
0.7589626239511823


The following code trains a model based on the last three letters of the name.

In [228]:
def name_count9(name):
    arr = np.zeros(3)
    arr[0] = ord(name[-1])-ord('a')+1
    arr[1] = ord(name[-2])-ord('a')+1
    for ind, x in enumerate(name):
        if len(name)>=3:
            arr[2] = ord(name[-3])-ord('a')+1
    
    return arr

name_map9 = np.vectorize(name_count9, otypes=[np.ndarray])
Xlist = name_map9(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7890922959572845
0.7913806254767353
0.8012967200610221
0.7940503432494279
0.8032036613272311


In [221]:
def name_count10(name):
    arr = np.zeros(3)
    arr[0] = ord(name[-1])-ord('a')+1
    arr[1] = ord(name[-2])-ord('a')+1
    # Order of a's
    for ind, x in enumerate(name):
        if x == 'a':
            arr[2] += ind+1
    
    return arr

name_map10 = np.vectorize(name_count10, otypes=[np.ndarray])
Xlist = name_map10(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))


0.7784134248665141
0.7738367658276125
0.7734553775743707
0.7837528604118993
0.7738367658276125


In [207]:
idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False)
Xname = [x[0] for x in my_data]
xs = [Xname[x] for x in idx]
ys = [y[x] for x in idx]
pred = clf.predict(X[idx])

for a,b, p in zip(xs,ys, pred):
    print(a,b, p)

reid M M
galina F F
bishop M M
channa F F
uta F F
tracy M M
gillian F M
cindi F F
eustacia F F
raleigh M M


In [231]:
for x in range(5): 
    print(X[x])
    print(Xname[x])

[18.  9. 13.]
aamir
[14. 15. 18.]
aaron
[25.  5.  2.]
abbey
[5. 9. 2.]
abbie
[20. 15.  2.]
abbot
