# Project 3

## Team members: 

#### <font color='sapphire'>Dennis Pong, Stefano Biguzzi, Ian Costello </font>

### Natural Language Processing with Python, exercise 6.10.2 (P. 257)

#### Problem: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

#### Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

#### How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


## import packages required

In [153]:
# !pip install nltk
# !pip install emoji --upgrade
# !pip install gender-guesser

In [154]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
from nltk.metrics import ConfusionMatrix, accuracy, precision, recall, f_measure
import pandas as pd
import random
import collections
import seaborn as sns
import matplotlib.pyplot as plt
import emoji 

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

import gender_guesser.detector as gender
    
    
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Loading of Data from the Names Corpora

In [4]:
# Load the names corpus, use random to shuffle the names
# !pwd
# nltk.download()
# nltk.download('averaged_perceptron_tagger')

# check the corpus, there are two files, female.txt and male.txt
nltk.corpus.names.fileids()

print(f"There are {len(names.words('male.txt'))} male names.")
print(f"There are {len(names.words('female.txt'))} female names.")
# concatenate the lists 
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])


There are 2943 male names.
There are 5001 female names.


###### we see that the male to female ratios is currently at  ~ (59:100)

In [5]:
labeled_names[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

In [6]:
# random shuffling of the list 
random.shuffle(labeled_names)

In [7]:
labeled_names[:10]

[('Beverlie', 'female'),
 ('Miran', 'female'),
 ('Mei', 'female'),
 ('Riva', 'female'),
 ('Grissel', 'female'),
 ('Raine', 'female'),
 ('Melina', 'female'),
 ('Donna', 'female'),
 ('Doyle', 'male'),
 ('Heathcliff', 'male')]

In [8]:
# Extract name from the list, and check for length
len(set(item[0] for item in labeled_names))

7579

### We've to remove names that is labeled as both male and female

In [9]:
#3 examples 
sorted([item for item in labeled_names if item[0] in ["Jude","Pen","Gabriel"]])

[('Gabriel', 'female'),
 ('Gabriel', 'male'),
 ('Jude', 'female'),
 ('Jude', 'male'),
 ('Pen', 'female'),
 ('Pen', 'male')]

In [10]:
# Remove the duplicates, and check for the total number of unique names
names_freq = nltk.FreqDist(item[0] for item in labeled_names)
nm_dupes = [(k,v) for k,v in names_freq.items() if v >1]
nm_dupes

first_names_to_be_removed = [item[0] for item in nm_dupes]
labeled_names_dedupped = [item for item in labeled_names if not item[0] in first_names_to_be_removed]

len(labeled_names_dedupped)

7214

### With the removal of doubly-labeled names, we're ready to do splitting of the datasets for test, dev-test, and training set.

In [325]:
test = labeled_names_dedupped[0:500]
dev_test = labeled_names_dedupped[500:1000]
train = labeled_names_dedupped[1000:]

# Confirm the size of the three subsets
print("Training Set = {}".format(len(train)))
print("Dev-test (or the valiation) Set = {}".format(len(dev_test)))
print("Test Set = {}".format(len(test)))

Training Set = 6214
Dev-test (or the valiation) Set = 500
Test Set = 500


In [12]:
# train

In [13]:
# Extract the male/female category
train_dist = [cat  for (nm, cat) in train]
nltk.FreqDist(train_dist)

FreqDist({'female': 3970, 'male': 2244})

male to female ratios is currently at ~ (34:100)

#### Because of the male-to-feamle ratios that is very imbalanced in terms of labels, we normally deemed accuracy not the most appropriate measure as it doesn't depict the actual prediction accuracy for the least represented class, male, in this case. In addition to accuracy, recall and precision are reported for each class via a custom function.



- Precision or positive predictive value  
${\displaystyle \mathrm {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }}}$ , i.e., higher precision means there are fewer false positives.

- Recall or true positive rate  
${\displaystyle \mathrm {TPR} = {\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }} }$, i.e., higher recall means there are fewer false negatives.

- F1 score
is the harmonic mean of precision and sensitivity:  
${\displaystyle \mathrm {F} _{1}=2\times {\frac {\mathrm {PPV} \times \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}{\displaystyle \mathrm {F} _{1}=2\times {\frac {\mathrm {PPV} \times \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}$

## Naive Bayes Classification
#### We are going to look at the 8 distinct features functions and evaluate them with all the them using a set of performance metrics to determine their relevance to what we're trying to predict using NB classifier.

### Feature Sets

### 1st feature: last letter of the given name

In [14]:
def gender_features(name):
  return {'last_letter': name[-1]}

In [15]:
gender_features("Mary")

{'last_letter': 'y'}

### <b> Building a function for easier bundling of performance metrics </b>

In [125]:
def performance_metrics(model, training_set, digits=4):
    """Prints the precision, recall, and F-measure (or F1 score) of an NLTK Naive Bayes classifer.
       alpha for F-measure is default to 0.5
    """
    reference = collections.defaultdict(set)
    test = collections.defaultdict(set)
    
    for i, (features, label) in enumerate(training_set):
        reference[label].add(i)
        pred = model.classify(features)
        test[pred].add(i)
        
    m_precision = round(precision(reference['male'], test['male']), digits)
    f_precision = round(precision(reference['female'], test['female']), digits)
    
    m_recall = round(recall(reference['male'], test['male']), digits)
    f_recall = round( recall(reference['female'], test['female']), digits)
    
    m_f_measure = round(f_measure(reference['male'], test['male']), digits)
    f_f_measure = round(f_measure(reference['female'], test['female']), digits)
    
    print('Male precision: ', m_precision)
    print('Female precision: ', f_precision)
    print('Male recall: ', m_recall)
    print('Female recall: ', f_recall)
    printmd('Male F1 Score: '); print(m_f_measure)
    printmd('Female F1 Score: '); print(f_f_measure)
    



In [126]:
train_set = [(gender_features(n), g) for (n,g) in train]
dev_test_set = [(gender_features(n), g) for (n,g) in dev_test]
test_set = [(gender_features(n), g) for (n,g) in test]
nb1 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb1, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb1, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb1, train_set )

Validation accuracy is
0.822
Test accuracy is
0.8

Performance metrics for training set: 

Male precision:  0.7376
Female precision:  0.8693
Male recall:  0.7754
Female recall:  0.8441


Male F1 Score: 

0.756


Female F1 Score: 

0.8565


In [127]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb1, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.7399
Female precision:  0.8654
Male recall:  0.7442
Female recall:  0.8628


Male F1 Score: 

0.742


Female F1 Score: 

0.8641


There is not much of a dropoff for Male F1 Score while Female F1 Score saw a slight increase. That tells me that there is no indications of overfitting with this NB classfier.

### 2nd feature: kitchen sink approach - first_letter and last_letter are printed. Then out of all alphabets, whether it's present with that letter or not, and what's the count.

In [18]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [19]:
gender_features2('John') 

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [128]:
train_set = [(gender_features2(n), g) for (n,g) in train]
dev_test_set = [(gender_features2(n), g) for (n,g) in dev_test]
test_set = [(gender_features2(n), g) for (n,g) in test]
nb2 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb2, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb2, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb2, train_set )

Validation accuracy is
0.786
Test accuracy is
0.832

Performance metrics for training set: 

Male precision:  0.7433
Female precision:  0.8446
Male recall:  0.7201
Female recall:  0.8594


Male F1 Score: 

0.7316


Female F1 Score: 

0.8519


In [129]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb2, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.6836
Female precision:  0.8421
Male recall:  0.7035
Female recall:  0.8293


Male F1 Score: 

0.6934


Female F1 Score: 

0.8356


There is a small dropoff for Male F1 Score while Female F1 Score saw a slight decrease. It is still not at an alarming
level where as far as overfitting is concerned.

### 3rd feature: Last letter and last 2 letters of name
Some suffixes that are more than one letter can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. 

In [65]:
def gender_features3(name):
    return {'suffix1': name[-1:],
            'suffix2': name[-2:]
           }

In [75]:
suffix2_yn_dist = [item[1] for item in labeled_names_dedupped if gender_features3(item[0])['suffix2'] == 'yn']
nltk.FreqDist(suffix2_yn_dist)

FreqDist({'female': 77, 'male': 10})

In [107]:
print(emoji.emojize(':bulb:It\'s indeed true that suffixes ending in \'yn\' is overwhemlingly more likely to be a female',
                    use_aliases=True)) 

💡It's indeed true that suffixes ending in 'yn' is overwhemlingly more likely to be a female


In [130]:
train_set = [(gender_features3(n), g) for (n,g) in train]
dev_test_set = [(gender_features3(n), g) for (n,g) in dev_test]
test_set = [(gender_features3(n), g) for (n,g) in test]
nb3 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb3, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb3, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb3, train_set )

Validation accuracy is
0.822
Test accuracy is
0.8

Performance metrics for training set: 

Male precision:  0.7376
Female precision:  0.8693
Male recall:  0.7754
Female recall:  0.8441


Male F1 Score: 

0.756


Female F1 Score: 

0.8565


In [131]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb3, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.7399
Female precision:  0.8654
Male recall:  0.7442
Female recall:  0.8628


Male F1 Score: 

0.742


Female F1 Score: 

0.8641


There is a slight dropoff for Male F1 Score while Female F1 Score saw a slight increase. No evidence for overfitting.

If you're interested in the most informative features of the NB classifier built with the training set, there is a built-in function called show_most_informative_features

In [150]:
# showing just in the training set
nb3.show_most_informative_features(None) #None will give me all

Most Informative Features
                 suffix2 = 'na'           female : male   =    152.5 : 1.0
                 suffix2 = 'ta'           female : male   =     65.2 : 1.0
                 suffix1 = 'a'            female : male   =     63.3 : 1.0
                 suffix1 = 'k'              male : female =     62.9 : 1.0
                 suffix2 = 'la'           female : male   =     62.8 : 1.0
                 suffix2 = 'ra'           female : male   =     53.9 : 1.0
                 suffix2 = 'ia'           female : male   =     48.6 : 1.0
                 suffix2 = 'us'             male : female =     39.7 : 1.0
                 suffix2 = 'ld'             male : female =     38.5 : 1.0
                 suffix2 = 'rt'             male : female =     29.3 : 1.0
                 suffix2 = 'do'             male : female =     27.0 : 1.0
                 suffix2 = 'rd'             male : female =     23.6 : 1.0
                 suffix1 = 'p'              male : female =     20.6 : 1.0

### 4th feature: 1-letter suffix, 2-letter suffix + last trigram + first trigram + first fourgram

######  A combination of features: A name's last letter, last two letters, the last three letters, the first trigram, and the first 4-gram.
###### Trigram: a group of three consecutive written units such as letters, syllables, or words

In [158]:
def gender_features4(name):
        name = name.lower()
        return {
            'suffix1': name[-1:],
            'suffix2': name[-2:],
            'last_trigram': name[-3:],
            'first_trigram': name[:3], 
            'first_fourgram': name[:4]
               }


In [159]:
gender_features4("Tarrah")

{'suffix1': 'h',
 'suffix2': 'ah',
 'last_trigram': 'rah',
 'first_trigram': 'tar',
 'first_fourgram': 'tarr'}

In [160]:
train_set = [(gender_features4(n), g) for (n,g) in train]
dev_test_set = [(gender_features4(n), g) for (n,g) in dev_test]
test_set = [(gender_features4(n), g) for (n,g) in test]
nb4 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb4, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb4, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb4, train_set )

Validation accuracy is
0.892
Test accuracy is
0.908

Performance metrics for training set: 

Male precision:  0.9098
Female precision:  0.9649
Male recall:  0.9389
Female recall:  0.9474


Male F1 Score: 

0.9241


Female F1 Score: 

0.956


In [161]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb4, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.8391
Female precision:  0.9202
Male recall:  0.8488
Female recall:  0.9146


Male F1 Score: 

0.8439


Female F1 Score: 

0.9174


#### This is the first feature that suffers quite markedly a dent for both Male F1 Score and Female F1 Score. I think there is an overfitting going on with the training set

### 5th Feature: Vowel positions - a combination of ending in vowel, last letter, last three letters, and last two letters.

 #### Female names end more often with a vowel than male names.

In [233]:
def vowel_features(name):
    return({'last_is_vowel': (name[-1] in 'aeiouy'),
            'last_letter': name[-1],
            'last_three': name[-3:],
            'last_two': name[-2:]
           }
          )

In [234]:
[item[1] for item in labeled_names_dedupped if vowel_features(item[0])['last_is_vowel'] is True]

['female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'male',
 'female',
 'male',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female'

In [235]:
last_is_vowel_dist = [item[1] for item in labeled_names_dedupped if vowel_features(item[0])['last_is_vowel'] is True]
nltk.FreqDist(last_is_vowel_dist)

FreqDist({'female': 3779, 'male': 813})

At first glance, names ending in vowels really has a higher percentage of being female. With this feature being true, we were able to get a male to female ratios is currently at ~ (22:100)

### 6th Feature: Consonent blends - look for 1 or 2 clusters of consonants

In [242]:
def consonant_blends(name):
    features = {}
    temp_name = name
    consonant_blends = ["bl", 
                         "br", 
                         "ch", 
                         "cl", 
                         "cr", 
                         "dr", 
                         "fl", 
                         "fr", 
                         "gl", 
                         "gr", 
                         "pl", 
                         "pr", 
                         "sc", 
                         "sh", 
                         "sk", 
                         "sl", 
                         "sm", 
                         "sn", 
                         "sp", 
                         "st", 
                         "sw", 
                         "th", 
                         "tr", 
                         "tw", 
                         "wh", 
                         "wr", 
                         "sch", 
                         "scr", 
                         "shr", 
                         "sph", 
                         "spl", 
                         "spr", 
                         "squ", 
                         "str", 
                         "thr"
                       ]
    clusters = []
    for cluster in consonant_blends[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["consonant_blends_1"] = clusters[0] if len(clusters) > 0 else None
    features["consonant_blends_2"] = clusters[1] if len(clusters) > 1 else None
    return features

In [213]:
consonant_blends = ["bl", 
                         "br", 
                         "ch", 
                         "cl", 
                         "cr", 
                         "dr", 
                         "fl", 
                         "fr", 
                         "gl", 
                         "gr", 
                         "pl", 
                         "pr", 
                         "sc", 
                         "sh", 
                         "sk", 
                         "sl", 
                         "sm", 
                         "sn", 
                         "sp", 
                         "st", 
                         "sw", 
                         "th", 
                         "tr", 
                         "tw", 
                         "wh", 
                         "wr", 
                         "sch", 
                         "scr", 
                         "shr", 
                         "sph", 
                         "spl", 
                         "spr", 
                         "squ", 
                         "str", 
                         "thr"
                       ]

In [214]:
consonant_blends[::-1]

['thr',
 'str',
 'squ',
 'spr',
 'spl',
 'sph',
 'shr',
 'scr',
 'sch',
 'wr',
 'wh',
 'tw',
 'tr',
 'th',
 'sw',
 'st',
 'sp',
 'sn',
 'sm',
 'sl',
 'sk',
 'sh',
 'sc',
 'pr',
 'pl',
 'gr',
 'gl',
 'fr',
 'fl',
 'dr',
 'cr',
 'cl',
 'ch',
 'br',
 'bl']

In [210]:
# f1=consonant_blends('Beverlie')
# f1
# type(f1)
# f1['consonant_blends_1']

In [211]:
con_bl_1_dist = [item[1] for item in labeled_names_dedupped if consonant_blends(item[0])['consonant_blends_1'] is not None]
nltk.FreqDist(con_bl_1_dist)

FreqDist({'female': 578, 'male': 439})

In [212]:
con_bl_2_dist = [item[1] for item in labeled_names_dedupped if consonant_blends(item[0])['consonant_blends_2'] is not None]
nltk.FreqDist(con_bl_2_dist)

FreqDist({'male': 10, 'female': 4})

#### As I found out there is never a name that has more than 2 consonant clusters, we are just going to check for consonant_blends1 and consonant_blends2. If there are consonant_blends_2 existed for the name given, it's more likely to be a male name. If there is a consonant_blends_1 existed, it's more likely to be a female name. How does that sound to you? What a simple but yet an effective feature, isn't it?

### 7th Feature: bouba_letters blends & kiki_letters. 


#### The “bouba/kiki effect” is the robust tendency to associate rounded objects (vs. angular objects) with names that require rounding of the mouth to pronounce, and may reflect synesthesia-like mapping across perceptual modalities. Here we show for the first time a “social” bouba/kiki effect, such that experimental participants associate round names (“Bob,” “Lou”) with round-faced (vs. angular-faced) individuals. 

In [220]:
def bouba_kiki_features(name):
        name=name.lower()
        return {
            'bouba_letters': len([v for v in name if v in 'blmnuo']),
            'kiki_letters':len([v for v in name if v in 'kptiezv']),
               }

In [221]:
bouba_kiki_features('Adirel')

{'bouba_letters': 1, 'kiki_letters': 2}

In [326]:
# built a choose_features function for features 5 thru' 7

def choose_features(metric):
    train_empty = []
    dev_test_empty = [] 
    test_set_empty = []
    if metric == "vowel_features":
        train1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]
        return train1, dev_test1, test1      
    elif metric == "consonant_blends":
        train2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]     
        return train2, dev_test2, test2      
    elif metric== 'bouba_kiki_features':
        train3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]
        return train3, dev_test3, test3      
    else:
        print("Invalid Metric")
        return train_empty, dev_test_empty, test_set_empty


### Feature 5 Performance Metrics

In [328]:
train_set, dev_test_set, test_set = choose_features(metric='vowel_features')

nb5 = nltk.NaiveBayesClassifier.train(train_set) 

print('Validation accuracy is')
print(nltk.classify.accuracy(nb5, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb5, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb5, train_set )

Validation accuracy is
0.83
Test accuracy is
0

Performance metrics for training set: 

Male precision:  0.7696
Female precision:  0.8892
Male recall:  0.8097
Female recall:  0.863


Male F1 Score: 

0.7891


Female F1 Score: 

0.8759


In [329]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb5, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.7574
Female precision:  0.8671
Male recall:  0.7442
Female recall:  0.875


Male F1 Score: 

0.7507


Female F1 Score: 

0.871


There is a slight dropoff in Male F1 Score while there is virtually no change in Female F1 Score. No evidence of overfitting in training set.

[('Odelia', 'female'),
 ('Eveline', 'female'),
 ('Bentley', 'male'),
 ('Ardys', 'female'),
 ('Case', 'male'),
 ('Eldon', 'male'),
 ('Tia', 'female'),
 ('Jannel', 'female'),
 ('Waly', 'female'),
 ('Zeb', 'male'),
 ('Deidre', 'female'),
 ('Gretta', 'female'),
 ('Karalynn', 'female'),
 ('Acacia', 'female'),
 ('Rhodie', 'female'),
 ('Tarra', 'female'),
 ('Charlotte', 'female'),
 ('Cathyleen', 'female'),
 ('Nunzio', 'male'),
 ('Englebert', 'male'),
 ('Jennings', 'male'),
 ('Nicki', 'female'),
 ('Kory', 'male'),
 ('Meriel', 'female'),
 ('Baily', 'male'),
 ('Herbie', 'male'),
 ('Jonell', 'female'),
 ('Anthia', 'female'),
 ('Darin', 'male'),
 ('Marmaduke', 'male'),
 ('Bella', 'female'),
 ('Noach', 'male'),
 ('Waldon', 'male'),
 ('Amadeus', 'male'),
 ('Kaari', 'female'),
 ('Alameda', 'female'),
 ('Annalena', 'female'),
 ('Jean-Marc', 'male'),
 ('Mellissa', 'female'),
 ('Cathe', 'female'),
 ('Felicio', 'male'),
 ('Blythe', 'female'),
 ('Connolly', 'male'),
 ('Vitia', 'female'),
 ('Garvy', 'male'

In [307]:
type([(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ])

list

In [290]:
[(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]

[({'last_is_vowel': True,
   'last_letter': 'e',
   'last_three': 'lle',
   'last_two': 'le'},
  'female'),
 ({'last_is_vowel': False,
   'last_letter': 'k',
   'last_three': 'ick',
   'last_two': 'ck'},
  'male'),
 ({'last_is_vowel': True,
   'last_letter': 'e',
   'last_three': 'ore',
   'last_two': 're'},
  'male'),
 ({'last_is_vowel': False,
   'last_letter': 'f',
   'last_three': 'eff',
   'last_two': 'ff'},
  'male'),
 ({'last_is_vowel': False,
   'last_letter': 'g',
   'last_three': 'oug',
   'last_two': 'ug'},
  'male'),
 ({'last_is_vowel': False,
   'last_letter': 'd',
   'last_three': 'and',
   'last_two': 'nd'},
  'female'),
 ({'last_is_vowel': True,
   'last_letter': 'i',
   'last_three': 'ori',
   'last_two': 'ri'},
  'female'),
 ({'last_is_vowel': False,
   'last_letter': 'n',
   'last_three': 'Win',
   'last_two': 'in'},
  'male'),
 ({'last_is_vowel': True,
   'last_letter': 'e',
   'last_three': 'ese',
   'last_two': 'se'},
  'male'),
 ({'last_is_vowel': False,
   'last

In [314]:
[(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ],\
                [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ],\
                [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]

([({'last_is_vowel': True,
    'last_letter': 'a',
    'last_three': 'lia',
    'last_two': 'ia'},
   'female'),
  ({'last_is_vowel': True,
    'last_letter': 'e',
    'last_three': 'ine',
    'last_two': 'ne'},
   'female'),
  ({'last_is_vowel': True,
    'last_letter': 'y',
    'last_three': 'ley',
    'last_two': 'ey'},
   'male'),
  ({'last_is_vowel': False,
    'last_letter': 's',
    'last_three': 'dys',
    'last_two': 'ys'},
   'female'),
  ({'last_is_vowel': True,
    'last_letter': 'e',
    'last_three': 'ase',
    'last_two': 'se'},
   'male'),
  ({'last_is_vowel': False,
    'last_letter': 'n',
    'last_three': 'don',
    'last_two': 'on'},
   'male'),
  ({'last_is_vowel': True,
    'last_letter': 'a',
    'last_three': 'Tia',
    'last_two': 'ia'},
   'female'),
  ({'last_is_vowel': False,
    'last_letter': 'l',
    'last_three': 'nel',
    'last_two': 'el'},
   'female'),
  ({'last_is_vowel': True,
    'last_letter': 'y',
    'last_three': 'aly',
    'last_two': 'ly'},


In [312]:
[(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]

[({'last_is_vowel': True,
   'last_letter': 'a',
   'last_three': 'lia',
   'last_two': 'ia'},
  'female'),
 ({'last_is_vowel': True,
   'last_letter': 'e',
   'last_three': 'ine',
   'last_two': 'ne'},
  'female'),
 ({'last_is_vowel': True,
   'last_letter': 'y',
   'last_three': 'ley',
   'last_two': 'ey'},
  'male'),
 ({'last_is_vowel': False,
   'last_letter': 's',
   'last_three': 'dys',
   'last_two': 'ys'},
  'female'),
 ({'last_is_vowel': True,
   'last_letter': 'e',
   'last_three': 'ase',
   'last_two': 'se'},
  'male'),
 ({'last_is_vowel': False,
   'last_letter': 'n',
   'last_three': 'don',
   'last_two': 'on'},
  'male'),
 ({'last_is_vowel': True,
   'last_letter': 'a',
   'last_three': 'Tia',
   'last_two': 'ia'},
  'female'),
 ({'last_is_vowel': False,
   'last_letter': 'l',
   'last_three': 'nel',
   'last_two': 'el'},
  'female'),
 ({'last_is_vowel': True,
   'last_letter': 'y',
   'last_three': 'aly',
   'last_two': 'ly'},
  'female'),
 ({'last_is_vowel': False,
   '