# Data 620, Project 3
July 10, 2019 
Team 6: Alice Friedman, Scott Jones, Jeff Littlejohn, and Jun Pan

## Assignment Description
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

### Text Classification: Identifying Gender from the ```NLTK``` Names Corpus  

Adapted from:

- [GitHub, Vinovator](https://gist.github.com/vinovator/6e5bf1e1bc61687a1e809780c30d6bf6)

- [Geeks for Geeks: Python Gender Identification by Name](https://www.geeksforgeeks.org/python-gender-identification-by-name-using-nltk/)


First, we import the names corpus from the ```nltk``` list of corpuses, and create three sets of names. 

In [51]:
import nltk
from nltk.corpus import names
import random
import pandas as pd
import matplotlib.pyplot as plt

In [21]:
mcorpus = [(name, "male") for name in names.words("male.txt")]
fcorpus = [(name, "female") for name in names.words("female.txt")]
random.shuffle(mcorpus); random.shuffle(fcorpus)
print(mcorpus[0:5],len(mcorpus))
print(fcorpus[0:5],len(fcorpus))

corpus = mcorpus + fcorpus
random.shuffle(corpus)
print(corpus[:5], len(corpus))

[('Mark', 'male'), ('Bjorn', 'male'), ('Stephan', 'male'), ('Hyatt', 'male'), ('Romain', 'male')] 2943
[('Sheilah', 'female'), ('Carolyne', 'female'), ('Devondra', 'female'), ('Tessi', 'female'), ('Jean', 'female')] 5001
[('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female'), ('Markus', 'male'), ('Petunia', 'female')] 7944


First, let's examine the data to see if we can determine any patterns.

Next, we will subdivide the shuffled names corpus as follows:
- A training set, used to train the model based on our selected features

- A a development test (dev-test) set, which we will use to test progress on the gender identifier and perform error analysis

- A final "test" set, which we will use to test how well our predictions ultimately worked

In order to avoid overfitting the data, we will remix the training and dev-test with each new feature extraaction model. To prevent a lot of re-coding, we can write a function to remix the development-training corpus, after setting aside the first 500 male and female names as the final test slice.

In [22]:
# Create a function to return a new training and dev-test mix of the corpus for each iteration of the model
def reslicer(corpus):
    
    #prints message to explain output
    print("Reslicer returns 3 sliced, remixed set of corpuses:")
    print("\tThe first returned value is the remixed training corpus, length is variable")
    print("\tThe second returned value is the remixed dev-test corpus, length is 500")
    print("\tThe third returned value is the un-remixed test set, length is 500\n")
    
    final_test_n = 500 # per assignment instructions
    dev_test_n = 500 # per assignment instructions
    
    #reserve first 500 for the final test    
    test_corpus = corpus[:final_test_n] 
    
    #create a copy of the dev_set to preserve the original test set before shuffling
    dev_set = corpus[final_test_n:] 
    random.shuffle(dev_set) #remix before re-slicing
    
    #re-cut re-shuffled development set into dev-test set (len 500) and training set (remainder)
    dev_test_corpus = dev_set[:dev_test_n]
    train_corpus = dev_set[dev_test_n:]
    
    #prints sample of sets
    print("Training Corpus Sample: ",train_corpus[0:3], ", Length: ", len(train_corpus)) #should be longer
    print("Dev-Test Corpus Sample: ",dev_test_corpus[0:3], ", Length: ", len(dev_test_corpus)) #should have length 500
    print("Test Corpus Sample: ", test_corpus[0:3], ", Length: ", len(test_corpus))
    
    return train_corpus, dev_test_corpus, test_corpus


Using the above function, we can split the names corpus into a training set, dev-test set, and test set.

In [23]:
train_names, dev_names, test_names = reslicer(corpus)

Reslicer returns 3 sliced, remixed set of corpuses:
	The first returned value is the remixed training corpus, length is variable
	The second returned value is the remixed dev-test corpus, length is 500
	The third returned value is the un-remixed test set, length is 500

Training Corpus Sample:  [('Roxanne', 'female'), ('Charmane', 'female'), ('Teddy', 'female')] , Length:  6944
Dev-Test Corpus Sample:  [('Vikki', 'female'), ('Jerold', 'male'), ('Tomi', 'female')] , Length:  500
Test Corpus Sample:  [('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female')] , Length:  500


Re-running the code will remix the training and dev-test sets, while leaving the original test names intact, allowing us to quickly run new models.

In [24]:
train_names, dev_names, test_names = reslicer(corpus)

Reslicer returns 3 sliced, remixed set of corpuses:
	The first returned value is the remixed training corpus, length is variable
	The second returned value is the remixed dev-test corpus, length is 500
	The third returned value is the un-remixed test set, length is 500

Training Corpus Sample:  [('Gerhardine', 'female'), ('Broddy', 'male'), ('Vale', 'female')] , Length:  6944
Dev-Test Corpus Sample:  [('Kiri', 'female'), ('Manda', 'female'), ('Tedda', 'female')] , Length:  500
Test Corpus Sample:  [('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female')] , Length:  500


Before proceeding, let's take a look at the labeled data. To avoid fixing the results, we will only look at the training set.

In [47]:
def name_parts(name):
    i = 0
    letters = ''
    parts = []
    ans = ''
    for letter in name:
        next_part = name[i:]
        for letter in next_part:
            letters += letter
            parts.append(letters)
            i+=0
        letters=''
        i += 1
    return parts
print(name_parts("Noah"))

['N', 'No', 'Noa', 'Noah', 'o', 'oa', 'oah', 'a', 'ah', 'h']


In [55]:
name_dict = {'label':[], 'feature':[], 'name':[]}

for name, label in train_names:
    parts = name_parts(name)
    for part in parts:
        name_dict['label'].append(label)
        name_dict['feature'].append(part)
        name_dict['name'].append(name)

data = pd.DataFrame(name_dict)

In [66]:
counts = data.groupby(['feature', 'label'])['feature'].count()
counts = counts.sort_values(ascending = False)
counts.head(20)

feature  label 
a        female    3583
e        female    3415
i        female    2567
n        female    2308
l        female    2026
r        female    1757
e        male      1609
a        male      1274
r        male      1237
n        male      1073
i        male      1049
t        female    1021
o        female     946
         male       925
l        male       873
y        female     796
s        female     775
d        female     616
t        male       572
h        female     557
Name: feature, dtype: int64

Now that we've taken a look at the data, we can start to work on developing a model.

First, we can combine the ```reslicer``` function with a feature extracter function, ```feature_ext```, to generate labeled data with features.

In order to avoid overfitting the data, we will remix the training and dev-test with each new feature extraaction model. To prevent a lot of re-coding, we can write a function to remix the development-training corpus, after setting aside the first 500 male and female names as the final test slice.

In [76]:
# Define function to process the names through feature extractor
def feature_ext(feature_func, corpus):
    
    #first, remix and reslice the data to ensure we are using a new mix of dev-test and training data each time
    train_names, dev_names, test_names = reslicer(corpus)
    
    #then, extract features from the names slices
    train_set = [(feature_func(n), gender) for (n, gender) in train_names]
    devtest_set = [(feature_func(n), gender) for (n, gender) in dev_names]

    
    return train_set, devtest_set

Finally, a ```test_model``` function will combine all of the above to provide feedback on the feature extraction method selected to develop a model.

In [96]:
def find_errors(names, feature_func, classifier):
    errors = {'name' : [], 'label' : [], 'guess' : [], 'features': [] }
    for (name, label) in dev_names:
            guess = classifier.classify(feature_func(name))
            features = feature_func(name)
            
            if guess != label:
                errors['name'].append(name)
                errors['label'].append(label)
                errors['guess'].append(guess)
                errors['features'].append(features)
        
    errors = pd.DataFrame(errors)
        
    # Prints sample of errors
    print("\nErrors")
    print(errors.sample(20))
    
    return errors

In [177]:
def test_model(feature_func, corpus):
    
    # Run the feature_ext function to create the necessary labeled feature sets
    train_set, devtest_set = feature_ext(feature_func, corpus)
    
    # Train on the training set using the naiveBayes classifier built in to nltk
    classifier = nltk.NaiveBayesClassifier.train(train_set)

    # Test the accuracy of the classifier on the dev data--this is so we can evaluate errors and make tweaks
    a = round(nltk.classify.accuracy(classifier, devtest_set), 4)*100
    
    # Format results as a 2 digit decimal
    accuracy = f'{a:.2f}'
    
    # Print message with results
    print("\n")
    print("Model is %s percent accurate" % accuracy)
    print("\n")
    
    # Examine classifier to determine which last letter is most effective for predicting gender
    print(classifier.show_most_informative_features(10))
    
    # Runs errors function
    errors = find_errors(dev_names, feature_func, classifier)
    
    return classifier

### Training on the last letter of the name

Now that we have our functions set up, we can use them to test different feature extraction functions, starting with the last letter.

In [178]:
def last_letter(name): #first feature extraction function to test
    
    return {"last_letter": name[-1]}

In [179]:
last_letter_model = test_model(last_letter, corpus)

Reslicer returns 3 sliced, remixed set of corpuses:
	The first returned value is the remixed training corpus, length is variable
	The second returned value is the remixed dev-test corpus, length is 500
	The third returned value is the un-remixed test set, length is 500

Training Corpus Sample:  [('Koo', 'female'), ('Gerry', 'female'), ('Evania', 'female')] , Length:  6944
Dev-Test Corpus Sample:  [('Prudi', 'female'), ('Rycca', 'female'), ('Pet', 'female')] , Length:  500
Test Corpus Sample:  [('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female')] , Length:  500


Model is 74.80 percent accurate


Most Informative Features
             last_letter = 'a'            female : male   =     32.9 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'f'              male : female =     24.5 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : femal

These are interesting results! 

Does this basic model work for our names? Using the output from our model, we can try. 

In [180]:
team6names = ['Alice', 'Jun', 'Scott', 'Jeff']
for name in team6names:
    print ("Name: "+name+". Guess: ", last_letter_model.classify(last_letter(name)))

Name: Alice. Guess:  female
Name: Jun. Guess:  male
Name: Scott. Guess:  male
Name: Jeff. Guess:  male


It works! Let's see if we can do any better by adding an additional feature.

### Training on the last 3 letters of the name + last letter

Next, we try adding the last 3 letters as well as just the last letter. 

In [181]:
def two_features(name):
    return {"last_letter": name[-1], "last3letters": name[-3:]}  # feature set

#note, we are automatically re-slicing the training/dev-test slices
two_features_model = test_model(two_features, corpus) 

Reslicer returns 3 sliced, remixed set of corpuses:
	The first returned value is the remixed training corpus, length is variable
	The second returned value is the remixed dev-test corpus, length is 500
	The third returned value is the un-remixed test set, length is 500

Training Corpus Sample:  [('Shaylynn', 'female'), ('Lil', 'female'), ('Augustine', 'female')] , Length:  6944
Dev-Test Corpus Sample:  [('Earle', 'male'), ('Kizzee', 'female'), ('Betteanne', 'female')] , Length:  500
Test Corpus Sample:  [('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female')] , Length:  500


Model is 77.80 percent accurate


Most Informative Features
             last_letter = 'a'            female : male   =     33.9 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
            last3letters = 'ana'          female : male   =     24.8 : 1.0
            last3letters = 'tta'          female : male   =     22.2 : 1.0
            last3letters = 'ard'            m

Adding the last three letters increased our accuracy by about 5 percentage points! Let's see if we can learn anything from the remaining errors to improve our model further.

### Three features
   
If two features are better than one, will three be even better?

In [182]:
def three_features(name):
    return {"last_letter": name[-1], "last3letters": name[-3:], "first_letter": name[0]}  # feature set

#note, we are automatically re-slicing the training/dev-test slices
three_features_model = test_model(three_features, corpus) 

Reslicer returns 3 sliced, remixed set of corpuses:
	The first returned value is the remixed training corpus, length is variable
	The second returned value is the remixed dev-test corpus, length is 500
	The third returned value is the un-remixed test set, length is 500

Training Corpus Sample:  [('Kass', 'female'), ('Lorne', 'male'), ('Reynolds', 'male')] , Length:  6944
Dev-Test Corpus Sample:  [('Ahmet', 'male'), ('Cherie', 'female'), ('Silvia', 'female')] , Length:  500
Test Corpus Sample:  [('Vikky', 'female'), ('Tamera', 'female'), ('Melisande', 'female')] , Length:  500


Model is 83.00 percent accurate


Most Informative Features
             last_letter = 'a'            female : male   =     34.3 : 1.0
             last_letter = 'k'              male : female =     29.8 : 1.0
            last3letters = 'ita'          female : male   =     25.2 : 1.0
            last3letters = 'ana'          female : male   =     23.9 : 1.0
            last3letters = 'tta'          female : male

Interestingly, this didn't make a difference at all! Looks like the ```two_features_model``` is the winner. We can now test our best model on the test data--which has so far not been used to train any of the models.

### Test model on unused data

The final step is to use the classify the ```test_set``` using ```two_features_model``` to see how we did. 

In [183]:
def final_test(classifier, feature_func, name_set):
    
    #generate test_set
    test_set = [(feature_func(n), gender) for (n, gender) in name_set]
        
    
    #test the accuracy of the model on the test set
    a = round(nltk.classify.accuracy(classifier, test_set), 4)*100
    
    #format and print output
    accuracy = f'{a:.2f}'
    print("\n")
    print("Model is %s percent accurate when used on the the test set" % accuracy)
    print("\n")
    

final_test(two_features_model, two_features, test_names)



Model is 81.60 percent accurate when used on the the test set




## NLTK Summary
In conclusion, the final test does not produce identical results when run on the test set (or even re-run on the a remixed development set). This should not be surprising because the model is making a prediction based on patterns that are not necessarily hard and fast rules. 

## `sklearn`  

From the site: https://blog.ayoungprogrammer.com/2016/04/determining-gender-of-name-with-80.html/    

In [186]:
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

ModuleNotFoundError: No module named 'sklearn'

Create the names corpus with gender assignment; each name is converted to lowercase.

In [None]:
labeled_names = ([(name.lower(), "M") for name in names.words("male.txt")] +
                 [(name.lower(), "F") for name in names.words("female.txt")])

my_data = np.asarray(labeled_names) 

Using `sklearn`, we train a model on numbers associated with each letter in the name, where a=1, b=2, c=3, ... in this manner a model is created based on integer values.

In [232]:
def name_count(name):
    arr = np.zeros(65)
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
    return arr

name_map = np.vectorize(name_count, otypes=[np.ndarray])
Xlist = name_map(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7219679633867276
0.7292143401983219
0.7315026697177727
0.7257818459191457
0.7265446224256293


In [186]:
def name_count2(name):
    arr = np.zeros(65+26)
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    return arr

name_map2 = np.vectorize(name_count2, otypes=[np.ndarray])
Xlist = name_map2(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7829900839054157
0.7807017543859649
0.7864225781845919
0.7688787185354691
0.7936689549961862


In [193]:
def name_count3(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    return arr

name_map3 = np.vectorize(name_count3, otypes=[np.ndarray])
Xlist = name_map3(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7963386727688787
0.7967200610221206
0.7894736842105263
0.7917620137299771
0.7921434019832189


In [194]:
def name_count4(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    # Last character
    arr[-3] = ord(name[-1])-ord('a')+1
    # Second Last character
    arr[-2] = ord(name[-2])-ord('a')+1
    # Length of name
    arr[-1] = len(name)
    return arr

name_map4 = np.vectorize(name_count4, otypes=[np.ndarray])
Xlist = name_map4(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=100, min_samples_split=2)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.8180778032036613
0.8188405797101449
0.8096872616323417
0.8032036613272311
0.8081617086193745


In [195]:
def name_count7(name):
    arr = np.zeros(1800)
    # Iterate each character
    for ind, x in enumerate(name):
        arr[ord(x)-ord('a')] += 1
        arr[ord(x)-ord('a')+26] += ind+1
    # Iterate every 2 characters
    for x in range(len(name)-1):
        ind = (ord(name[x])-ord('a'))*26 + (ord(name[x+1])-ord('a')) + 52
        arr[ind] += 1
    # Last character
    arr[-3] = ord(name[-1])-ord('a')+1
    # Second Last character
    arr[-2] = ord(name[-2])-ord('a')+1
    # Length of name
    arr[-1] = len(name)
    return arr

name_map7 = np.vectorize(name_count7, otypes=[np.ndarray])
Xlist = name_map7(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(clf.feature_importances_.argsort()[-10:][::-1])
    print(np.mean(clf.predict(Xte) == yte))

[1797   26 1798   40   30   34    0   43   29    8]
0.8081617086193745
[1797   26 1798   40   30    0   34   43   22   29]
0.8119755911517925
[1797   26 1798   40    0   30   34   43 1799    8]
0.8005339435545386
[1797   26 1798   40   30    0   34   43   29 1799]
0.8012967200610221
[1797   26 1798   40    0   30   34   43   29 1799]
0.8154080854309688


The following code trains a model based on the last letter of the name.

In [224]:
def name_count8(name):
    arr = np.zeros(1)
    arr[0] = ord(name[-1])-ord('a')+1
    return arr

name_map8 = np.vectorize(name_count8, otypes=[np.ndarray])
Xlist = name_map8(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.761632341723875
0.7715484363081617
0.7528604118993135
0.7570556826849733
0.7589626239511823


The following code trains a model based on the last three letters of the name.

In [228]:
def name_count9(name):
    arr = np.zeros(3)
    arr[0] = ord(name[-1])-ord('a')+1
    arr[1] = ord(name[-2])-ord('a')+1
    for ind, x in enumerate(name):
        if len(name)>=3:
            arr[2] = ord(name[-3])-ord('a')+1
    
    return arr

name_map9 = np.vectorize(name_count9, otypes=[np.ndarray])
Xlist = name_map9(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))

0.7890922959572845
0.7913806254767353
0.8012967200610221
0.7940503432494279
0.8032036613272311


In [221]:
def name_count10(name):
    arr = np.zeros(3)
    arr[0] = ord(name[-1])-ord('a')+1
    arr[1] = ord(name[-2])-ord('a')+1
    # Order of a's
    for ind, x in enumerate(name):
        if x == 'a':
            arr[2] += ind+1
    
    return arr

name_map10 = np.vectorize(name_count10, otypes=[np.ndarray])
Xlist = name_map10(np.asarray(list(zip(*my_data))[0],dtype=str))
X = np.array(Xlist.tolist())
y = [x[1] for x in my_data]

for x in range(5):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
    clf = RandomForestClassifier(n_estimators=150, min_samples_split=20)
    clf.fit(Xtr, ytr)
    print(np.mean(clf.predict(Xte) == yte))


0.7784134248665141
0.7738367658276125
0.7734553775743707
0.7837528604118993
0.7738367658276125


In [207]:
idx = np.random.choice(np.arange(len(Xlist)), 10, replace=False)
Xname = [x[0] for x in my_data]
xs = [Xname[x] for x in idx]
ys = [y[x] for x in idx]
pred = clf.predict(X[idx])

for a,b, p in zip(xs,ys, pred):
    print(a,b, p)

reid M M
galina F F
bishop M M
channa F F
uta F F
tracy M M
gillian F M
cindi F F
eustacia F F
raleigh M M


In [231]:
for x in range(5): 
    print(X[x])
    print(Xname[x])

[18.  9. 13.]
aamir
[14. 15. 18.]
aaron
[25.  5.  2.]
abbey
[5. 9. 2.]
abbie
[20. 15.  2.]
abbot


## Conclusion

The last-three letters model performs the best when trained on naive Bayes via ```nltk``` or in ```sklearn```.

An interesting project for further study would be to add weights to the names based on the number of people who have each name so that more common names more heavily tilt the model. While this might not produce a more accurate result looking at a list of names, it should be more accurate when dealing with new, real-world data.