# Project 3

## Team members: 

#### <font color='sapphire'>Dennis Pong, Stefano Biguzzi, Ian Costello </font>

### Natural Language Processing with Python, exercise 6.10.2 (P. 257)

#### Problem: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

#### Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

#### How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


## import packages required

In [3]:
# !pip install nltk
# !pip install emoji --upgrade
# !pip install gender-guesser

In [4]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
from nltk.metrics import ConfusionMatrix, accuracy, precision, recall, f_measure
import pandas as pd
import random
import collections
import seaborn as sns
import matplotlib.pyplot as plt
import emoji 

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

import gender_guesser.detector as gender
    
    
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## Loading of Data from the Names Corpora

In [5]:
# Load the names corpus, use random to shuffle the names
# !pwd
# nltk.download()
# nltk.download('averaged_perceptron_tagger')

# check the corpus, there are two files, female.txt and male.txt
nltk.corpus.names.fileids()

print(f"There are {len(names.words('male.txt'))} male names.")
print(f"There are {len(names.words('female.txt'))} female names.")
# concatenate the lists 
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])


There are 2943 male names.
There are 5001 female names.


###### we see that the male to female ratios is currently at  ~ (59:100)

In [6]:
labeled_names[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

In [7]:
# random shuffling of the list 
random.shuffle(labeled_names)

In [8]:
labeled_names[:10]

[('Antonina', 'female'),
 ('Lynsey', 'female'),
 ('Lois', 'female'),
 ('Nate', 'male'),
 ('Holly-Anne', 'female'),
 ('Bailie', 'male'),
 ('Leslie', 'male'),
 ('Melina', 'female'),
 ('Celinka', 'female'),
 ('Bernhard', 'male')]

In [9]:
# Extract name from the list, and check for length
len(set(item[0] for item in labeled_names))

7579

### We've to remove names that is labeled as both male and female

In [10]:
#3 examples 
sorted([item for item in labeled_names if item[0] in ["Jude","Pen","Gabriel"]])

[('Gabriel', 'female'),
 ('Gabriel', 'male'),
 ('Jude', 'female'),
 ('Jude', 'male'),
 ('Pen', 'female'),
 ('Pen', 'male')]

In [11]:
# Remove the duplicates, and check for the total number of unique names
names_freq = nltk.FreqDist(item[0] for item in labeled_names)
nm_dupes = [(k,v) for k,v in names_freq.items() if v >1]
nm_dupes

first_names_to_be_removed = [item[0] for item in nm_dupes]
labeled_names_dedupped = [item for item in labeled_names if not item[0] in first_names_to_be_removed]

len(labeled_names_dedupped)

7214

### With the removal of doubly-labeled names, we're ready to do splitting of the datasets for test, dev-test, and training set.

In [119]:
test = labeled_names_dedupped[0:500]
dev_test = labeled_names_dedupped[500:1000]
train = labeled_names_dedupped[1000:]

# Confirm the size of the three subsets
print("Training Set = {}".format(len(train)))
print("Dev-test (or the valiation) Set = {}".format(len(dev_test)))
print("Test Set = {}".format(len(test)))

Training Set = 6214
Dev-test (or the valiation) Set = 500
Test Set = 500


In [13]:
# train

In [14]:
# Extract the male/female category
train_dist = [cat  for (nm, cat) in train]
nltk.FreqDist(train_dist)

FreqDist({'female': 3996, 'male': 2218})

male to female ratios is currently at ~ (56:100)

#### Because of the male-to-feamle ratios that is very imbalanced in terms of labels, we normally deemed accuracy not the most appropriate measure as it doesn't depict the actual prediction accuracy for the least represented class, male, in this case. In addition to accuracy, recall and precision are reported for each class via a custom function.



- Precision or positive predictive value  
${\displaystyle \mathrm {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }}}$ , i.e., higher precision means there are fewer false positives.

- Recall or true positive rate  
${\displaystyle \mathrm {TPR} = {\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }} }$, i.e., higher recall means there are fewer false negatives.

- F1 score
is the harmonic mean of precision and sensitivity:  
${\displaystyle \mathrm {F} _{1}=2\times {\frac {\mathrm {PPV} \times \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}{\displaystyle \mathrm {F} _{1}=2\times {\frac {\mathrm {PPV} \times \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}$

## Naive Bayes Classification
#### We are going to look at the 8 distinct features functions and evaluate them with all the them using a set of performance metrics to determine their relevance to what we're trying to predict using NB classifier.

### Feature Sets

### 1st feature: last letter of the given name

In [15]:
def gender_features(name):
  return {'last_letter': name[-1]}

In [16]:
gender_features("Mary")

{'last_letter': 'y'}

### <b> Building a function for easier bundling of performance metrics </b>

In [1]:
def performance_metrics(model, training_set, digits=4):
    """Prints the precision, recall, and F-measure (or F1 score) of an NLTK Naive Bayes classifer.
       alpha for F-measure is default to 0.5
    """
    reference = collections.defaultdict(set)
    test = collections.defaultdict(set)
    
    for i, (features, label) in enumerate(training_set):
        reference[label].add(i)
        pred = model.classify(features)
        test[pred].add(i)
        
    m_precision = round(precision(reference['male'], test['male']), digits)
    f_precision = round(precision(reference['female'], test['female']), digits)
    
    m_recall = round(recall(reference['male'], test['male']), digits)
    f_recall = round( recall(reference['female'], test['female']), digits)
    
    m_f_measure = round(f_measure(reference['male'], test['male']), digits)
    f_f_measure = round(f_measure(reference['female'], test['female']), digits)
    
    print('Male precision: ', m_precision)
    print('Female precision: ', f_precision)
    print('Male recall: ', m_recall)
    print('Female recall: ', f_recall)
    printmd('Male F1 Score: '); print(m_f_measure)
    printmd('Female F1 Score: '); print(f_f_measure)
    



In [19]:
train_set = [(gender_features(n), g) for (n,g) in train]
dev_test_set = [(gender_features(n), g) for (n,g) in dev_test]
test_set = [(gender_features(n), g) for (n,g) in test]
nb1 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb1, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb1, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb1, train_set )

Validation accuracy is
0.75
Test accuracy is
0.822

Performance metrics for training set: 

Male precision:  0.7005
Female precision:  0.8418
Male recall:  0.7191
Female recall:  0.8293


Male F1 Score: 

0.7097


Female F1 Score: 

0.8355


In [20]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb1, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.6592
Female precision:  0.8006
Male recall:  0.6484
Female recall:  0.8082


Male F1 Score: 

0.6537


Female F1 Score: 

0.8044


There is a noticeable dropoff for Male F1 Score while Female F1 Score saw a slight decrease. That tells me that there is no a strong indication of overfitting with this NB classfier.

### 2nd feature: kitchen sink approach - first_letter and last_letter are printed. Then out of all alphabets, whether it's present with that letter or not, and what's the count.

In [21]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [22]:
gender_features2('John') 

{'first_letter': 'j',
 'last_letter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [23]:
train_set = [(gender_features2(n), g) for (n,g) in train]
dev_test_set = [(gender_features2(n), g) for (n,g) in dev_test]
test_set = [(gender_features2(n), g) for (n,g) in test]
nb2 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb2, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb2, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb2, train_set )

Validation accuracy is
0.8
Test accuracy is
0.798

Performance metrics for training set: 

Male precision:  0.7359
Female precision:  0.8458
Male recall:  0.7187
Female recall:  0.8569


Male F1 Score: 

0.7272


Female F1 Score: 

0.8513


In [24]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb2, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.7384
Female precision:  0.8323
Male recall:  0.6978
Female recall:  0.8585


Male F1 Score: 

0.7175


Female F1 Score: 

0.8452


There is a slight dropoff for Male F1 Score while Female F1 Score saw a slight decrease. It is still not at an alarming level where as far as overfitting is concerned.

### 3rd feature: Last letter and last 2 letters of name
Some suffixes that are more than one letter can be indicative of name genders. For example, names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male; and names ending in ch are usually male, even though names that end in h tend to be female. 

In [25]:
def gender_features3(name):
    return {'suffix1': name[-1:],
            'suffix2': name[-2:]
           }

In [26]:
suffix2_yn_dist = [item[1] for item in labeled_names_dedupped if gender_features3(item[0])['suffix2'] == 'yn']
nltk.FreqDist(suffix2_yn_dist)

FreqDist({'female': 73, 'male': 6})

In [27]:
print(emoji.emojize(':bulb:It\'s indeed true that suffixes ending in \'yn\' is overwhemlingly more likely to be a female',
                    use_aliases=True)) 

💡It's indeed true that suffixes ending in 'yn' is overwhemlingly more likely to be a female


In [28]:
train_set = [(gender_features3(n), g) for (n,g) in train]
dev_test_set = [(gender_features3(n), g) for (n,g) in dev_test]
test_set = [(gender_features3(n), g) for (n,g) in test]
nb3 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb3, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb3, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb3, train_set )

Validation accuracy is
0.766
Test accuracy is
0.834

Performance metrics for training set: 

Male precision:  0.7441
Female precision:  0.8671
Male recall:  0.7642
Female recall:  0.8541


Male F1 Score: 

0.754


Female F1 Score: 

0.8606


In [29]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb3, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.6757
Female precision:  0.819
Male recall:  0.6868
Female recall:  0.8113


Male F1 Score: 

0.6812


Female F1 Score: 

0.8152


There is a noticeable dropoff for Male F1 Score while Female F1 Score saw a 5% increase. No strong evidence for overfitting.

If you're interested in the most informative features of the NB classifier built with the training set, there is a built-in function called show_most_informative_features

In [30]:
# showing just in the training set
nb3.show_most_informative_features(None) #None will give me all

Most Informative Features
                 suffix2 = 'ia'           female : male   =     81.5 : 1.0
                 suffix1 = 'k'              male : female =     79.7 : 1.0
                 suffix1 = 'a'            female : male   =     67.1 : 1.0
                 suffix2 = 'us'             male : female =     62.6 : 1.0
                 suffix2 = 'rt'             male : female =     56.7 : 1.0
                 suffix2 = 'ra'           female : male   =     53.8 : 1.0
                 suffix2 = 'ta'           female : male   =     38.2 : 1.0
                 suffix2 = 'ch'             male : female =     26.3 : 1.0
                 suffix2 = 'do'             male : female =     25.1 : 1.0
                 suffix2 = 'rd'             male : female =     24.8 : 1.0
                 suffix2 = 'ld'             male : female =     21.4 : 1.0
                 suffix2 = 'os'             male : female =     19.3 : 1.0
                 suffix1 = 'p'              male : female =     18.6 : 1.0

### 4th feature: 1-letter suffix, 2-letter suffix + last trigram + first trigram + first fourgram

######  A combination of features: A name's last letter, last two letters, the last three letters, the first trigram, and the first 4-gram.
###### Trigram: a group of three consecutive written units such as letters, syllables, or words

In [31]:
def gender_features4(name):
        name = name.lower()
        return {
            'suffix1': name[-1:],
            'suffix2': name[-2:],
            'last_trigram': name[-3:],
            'first_trigram': name[:3], 
            'first_fourgram': name[:4]
               }


In [32]:
gender_features4("Tarrah")

{'suffix1': 'h',
 'suffix2': 'ah',
 'last_trigram': 'rah',
 'first_trigram': 'tar',
 'first_fourgram': 'tarr'}

In [33]:
train_set = [(gender_features4(n), g) for (n,g) in train]
dev_test_set = [(gender_features4(n), g) for (n,g) in dev_test]
test_set = [(gender_features4(n), g) for (n,g) in test]
nb4 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb4, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb4, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb4, train_set )

Validation accuracy is
0.882
Test accuracy is
0.9

Performance metrics for training set: 

Male precision:  0.912
Female precision:  0.9659
Male recall:  0.9396
Female recall:  0.9497


Male F1 Score: 

0.9256


Female F1 Score: 

0.9577


In [34]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb4, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.8361
Female precision:  0.9085
Male recall:  0.8407
Female recall:  0.9057


Male F1 Score: 

0.8384


Female F1 Score: 

0.9071


#### This is the first feature that suffers quite markedly a dent for both Male F1 Score and Female F1 Score. I think there is an overfitting going on with the training set

### 5th Feature: Vowel positions - a combination of ending in vowel, last letter, last three letters, and last two letters.

 #### Female names end more often with a vowel than male names.

In [35]:
def vowel_features(name):
    return({'last_is_vowel': (name[-1] in 'aeiouy'),
            'last_letter': name[-1],
            'last_three': name[-3:],
            'last_two': name[-2:]
           }
          )

In [36]:
[item[1] for item in labeled_names_dedupped if vowel_features(item[0])['last_is_vowel'] is True]

['female',
 'female',
 'male',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'male',
 'female',
 'male',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female',
 'female',
 'female',
 'female',
 'male',
 'male',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'male',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'female',
 'male

In [37]:
last_is_vowel_dist = [item[1] for item in labeled_names_dedupped if vowel_features(item[0])['last_is_vowel'] is True]
nltk.FreqDist(last_is_vowel_dist)

FreqDist({'female': 3779, 'male': 813})

At first glance, names ending in vowels really has a higher percentage of being female. With this feature being true, we were able to get a male to female ratios is currently at ~ (22:100)

### 6th Feature: Consonent blends - look for 1 or 2 clusters of consonants

In [44]:
def consonant_blends(name):
    features = {}
    temp_name = name
    consonant_blends = ["bl", 
                         "br", 
                         "ch", 
                         "cl", 
                         "cr", 
                         "dr", 
                         "fl", 
                         "fr", 
                         "gl", 
                         "gr", 
                         "pl", 
                         "pr", 
                         "sc", 
                         "sh", 
                         "sk", 
                         "sl", 
                         "sm", 
                         "sn", 
                         "sp", 
                         "st", 
                         "sw", 
                         "th", 
                         "tr", 
                         "tw", 
                         "wh", 
                         "wr", 
                         "sch", 
                         "scr", 
                         "shr", 
                         "sph", 
                         "spl", 
                         "spr", 
                         "squ", 
                         "str", 
                         "thr"
                       ]
    clusters = []
    for cluster in consonant_blends[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["consonant_blends_1"] = clusters[0] if len(clusters) > 0 else None
    features["consonant_blends_2"] = clusters[1] if len(clusters) > 1 else None
    return features

In [41]:
consonant_blends = ["bl", 
                     "br", 
                     "ch", 
                     "cl", 
                     "cr", 
                     "dr", 
                     "fl", 
                     "fr", 
                     "gl", 
                     "gr", 
                     "pl", 
                     "pr", 
                     "sc", 
                     "sh", 
                     "sk", 
                     "sl", 
                     "sm", 
                     "sn", 
                     "sp", 
                     "st", 
                     "sw", 
                     "th", 
                     "tr", 
                     "tw", 
                     "wh", 
                     "wr", 
                     "sch", 
                     "scr", 
                     "shr", 
                     "sph", 
                     "spl", 
                     "spr", 
                     "squ", 
                     "str", 
                     "thr"
                   ]

In [42]:
consonant_blends[::-1]

['thr',
 'str',
 'squ',
 'spr',
 'spl',
 'sph',
 'shr',
 'scr',
 'sch',
 'wr',
 'wh',
 'tw',
 'tr',
 'th',
 'sw',
 'st',
 'sp',
 'sn',
 'sm',
 'sl',
 'sk',
 'sh',
 'sc',
 'pr',
 'pl',
 'gr',
 'gl',
 'fr',
 'fl',
 'dr',
 'cr',
 'cl',
 'ch',
 'br',
 'bl']

In [210]:
# f1=consonant_blends('Beverlie')
# f1
# type(f1)
# f1['consonant_blends_1']

In [45]:
con_bl_1_dist = [item[1] for item in labeled_names_dedupped if consonant_blends(item[0])['consonant_blends_1'] is not None]
nltk.FreqDist(con_bl_1_dist)

FreqDist({'female': 547, 'male': 408})

In [46]:
con_bl_2_dist = [item[1] for item in labeled_names_dedupped if consonant_blends(item[0])['consonant_blends_2'] is not None]
nltk.FreqDist(con_bl_2_dist)

FreqDist({'male': 10, 'female': 4})

#### As I found out there is never a name that has more than 2 consonant clusters, we are just going to check for consonant_blends1 and consonant_blends2. If there are consonant_blends_2 existed for the name given, it's more likely to be a male name. If there is a consonant_blends_1 existed, it's more likely to be a female name. How does that sound to you? What a simple but yet an effective feature, isn't it?

### 7th Feature: bouba_letters blends & kiki_letters. 


#### The “bouba/kiki effect” is the robust tendency to associate rounded objects (vs. angular objects) with names that require rounding of the mouth to pronounce, and may reflect synesthesia-like mapping across perceptual modalities. Here we show for the first time a “social” bouba/kiki effect, such that experimental participants associate round names (“Bob,” “Lou”) with round-faced (vs. angular-faced) individuals. 

In [47]:
def bouba_kiki_features(name):
        name=name.lower()
        return {
            'bouba_letters': len([v for v in name if v in 'blmnuo']),
            'kiki_letters':len([v for v in name if v in 'kptiezv']),
               }

In [48]:
bouba_kiki_features('Adirel')

{'bouba_letters': 1, 'kiki_letters': 2}

In [49]:
# built a choose_features function for features 5 thru' 7

def choose_features(metric):
    train_empty = []
    dev_test_empty = [] 
    test_set_empty = []
    if metric == "vowel_features":
        train1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test1 = [(vowel_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]
        return train1, dev_test1, test1      
    elif metric == "consonant_blends":
        train2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test2 = [(consonant_blends(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]     
        return train2, dev_test2, test2      
    elif metric== 'bouba_kiki_features':
        train3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in train ]
        dev_test3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in dev_test ]
        test3 = [(bouba_kiki_features(n), gender) for (n, gender) in labeled_names_dedupped if (n,gender) in test_set ]
        return train3, dev_test3, test3      
    else:
        print("Invalid Metric")
        return train_empty, dev_test_empty, test_set_empty


### Feature 5 Performance Metrics

In [50]:
train_set, dev_test_set, test_set = choose_features(metric='vowel_features')

nb5 = nltk.NaiveBayesClassifier.train(train_set) 

print('Validation accuracy is')
print(nltk.classify.accuracy(nb5, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb5, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb5, train_set )

Validation accuracy is
0.778
Test accuracy is
0

Performance metrics for training set: 

Male precision:  0.7763
Female precision:  0.8923
Male recall:  0.8106
Female recall:  0.8704


Male F1 Score: 

0.7931


Female F1 Score: 

0.8812


In [51]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb5, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.694
Female precision:  0.8265
Male recall:  0.6978
Female recall:  0.8239


Male F1 Score: 

0.6959


Female F1 Score: 

0.8252


There is a noticeable dropoff in Male F1 Score while there is 5% decrease in Female F1 Score. There is some overfitting in training set.

### Feature 6 Performance Metrics


In [52]:
train_set, dev_test_set, test_set = choose_features(metric='consonant_blends')

nb6 = nltk.NaiveBayesClassifier.train(train_set) 

print('Validation accuracy is')
print(nltk.classify.accuracy(nb6, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb6, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb6, train_set )

Validation accuracy is
0.648
Test accuracy is
0

Performance metrics for training set: 

Male precision:  0.7032
Female precision:  0.6519
Male recall:  0.0491
Female recall:  0.9885


Male F1 Score: 

0.0919


Female F1 Score: 

0.7857


In [53]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb6, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.8
Female precision:  0.6449
Male recall:  0.044
Female recall:  0.9937


Male F1 Score: 

0.0833


Female F1 Score: 

0.7822


As there is virtually no change for both Male F1 Score and Female F1 score, I don't see there is any evidence of overfitting

### Feature 7 Performance Metrics


In [55]:
train_set, dev_test_set, test_set = choose_features(metric='bouba_kiki_features')

nb7 = nltk.NaiveBayesClassifier.train(train_set) 

print('Validation accuracy is')
print(nltk.classify.accuracy(nb7, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb7, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb7, train_set )

Validation accuracy is
0.638
Test accuracy is
0

Performance metrics for training set: 

Male precision:  0.5417
Female precision:  0.6438
Male recall:  0.0059
Female recall:  0.9972


Male F1 Score: 

0.0116


Female F1 Score: 

0.7824


In [56]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb7, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.6
Female precision:  0.6384
Male recall:  0.0165
Female recall:  0.9937


Male F1 Score: 

0.0321


Female F1 Score: 

0.7774


As there is almost no change to the Female F1 Score, and slight uptick in Male F1 Score, there is no indications of overfitting

### 8th Feature: phonetic gender score - leveraging get_gender( ) 
#### The result will be one of unknown (name not found), andy (androgynous), male, female, mostly_male, or mostly_female. The difference between andy and unknown is that the former is found to have the same probability to be male than to be female, while the later means that the name wasn’t found in the database.



In [92]:
# removing d from the gender_features8() func to avoid creating many Detectors, as each creation means reading the 
# data file
d = gender.Detector(case_sensitive=False)

def gender_features8(name):
    return {'phonetic_gender_score':  d.get_gender(name)}

In [93]:
gender_features8('Ann')

{'phonetic_gender_score': 'female'}

In [94]:
train_set = [(gender_features8(n), g) for (n,g) in train]
dev_test_set = [(gender_features8(n), g) for (n,g) in dev_test]
test_set = [(gender_features8(n), g) for (n,g) in test]
nb8 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb8, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb8, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb8, train_set )

Validation accuracy is
0.822
Test accuracy is
0.864

Performance metrics for training set: 

Male precision:  0.9108
Female precision:  0.8162
Male recall:  0.6078
Female recall:  0.967


Male F1 Score: 

0.729


Female F1 Score: 

0.8852


In [95]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb8, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.8843
Female precision:  0.8021
Male recall:  0.5879
Female recall:  0.956


Male F1 Score: 

0.7063


Female F1 Score: 

0.8723


As there is virutally no dropoffs from bot Male F1 Score and Female F1 Score, I'm ascertained that there is no issues of overfitting

In [97]:
nb8.show_most_informative_features(6)


Most Informative Features
   phonetic_gender_score = 'female'       female : male   =     41.9 : 1.0
   phonetic_gender_score = 'male'           male : female =     23.6 : 1.0
   phonetic_gender_score = 'mostly_male'    male : female =      7.1 : 1.0
   phonetic_gender_score = 'mostly_female' female : male   =      3.4 : 1.0
   phonetic_gender_score = 'andy'           male : female =      2.2 : 1.0
   phonetic_gender_score = 'unknown'      female : male   =      1.2 : 1.0


###### There phonetic score really are performing quite well just by looking at the results from the function show_most_informative_features( )

#### Let's merge the features that do not result in overfitting into one feature as a classifer

In [101]:
def gender_features_finalized(name):
    name = name.lower()
    features = {}
    features["first_letter"] = name[0]
    features["last_letter"] = name[-1]
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.count(letter)
        features["has({})".format(letter)] = (letter in name)

    features["suffix1"] = name[-1:] 
    features["suffix2"] = name[-2:] 
        
    features = {}
    temp_name = name
    consonant_blends = ["bl", 
                         "br", 
                         "ch", 
                         "cl", 
                         "cr", 
                         "dr", 
                         "fl", 
                         "fr", 
                         "gl", 
                         "gr", 
                         "pl", 
                         "pr", 
                         "sc", 
                         "sh", 
                         "sk", 
                         "sl", 
                         "sm", 
                         "sn", 
                         "sp", 
                         "st", 
                         "sw", 
                         "th", 
                         "tr", 
                         "tw", 
                         "wh", 
                         "wr", 
                         "sch", 
                         "scr", 
                         "shr", 
                         "sph", 
                         "spl", 
                         "spr", 
                         "squ", 
                         "str", 
                         "thr"
                       ]
    clusters = []
    for cluster in consonant_blends[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["consonant_blends_1"] = clusters[0] if len(clusters) > 0 else None
    features["consonant_blends_2"] = clusters[1] if len(clusters) > 1 else None    
    
    features['bouba_letters'] = len([v for v in name if v in 'blmnuo'])
    features['kiki_letters'] = len([v for v in name if v in 'kptiezv'])
    
    features['phonetic_gender_score'] =  d.get_gender(name)
    return features

In [102]:
train_set = [(gender_features_finalized(n), g) for (n,g) in train]
dev_test_set = [(gender_features_finalized(n), g) for (n,g) in dev_test]
test_set = [(gender_features_finalized(n), g) for (n,g) in test]
nb9 = nltk.NaiveBayesClassifier.train(train_set) 
print('Validation accuracy is')
print(nltk.classify.accuracy(nb9, dev_test_set))
print('Test accuracy is')
print(nltk.classify.accuracy(nb9, test_set))
print("")
print("Performance metrics for training set: \n", )
performance_metrics(nb9, train_set )

Validation accuracy is
0.822
Test accuracy is
0.866

Performance metrics for training set: 

Male precision:  0.9048
Female precision:  0.8258
Male recall:  0.6339
Female recall:  0.963


Male F1 Score: 

0.7455


Female F1 Score: 

0.8891


In [103]:
print("Performance metrics for validation set: \n", )
performance_metrics(nb9, dev_test_set )

Performance metrics for validation set: 

Male precision:  0.872
Female precision:  0.8053
Male recall:  0.5989
Female recall:  0.9497


Male F1 Score: 

0.7101


Female F1 Score: 

0.8716


In [131]:
print("Performance metrics for test set: \n", )
performance_metrics(nb9, test_set )

Performance metrics for test set: 

Male precision:  0.8936
Female precision:  0.8552
Male recall:  0.7079
Female recall:  0.9534


Male F1 Score: 

0.79


Female F1 Score: 

0.9016


In [114]:
#Function to generate errors
def generate_errors(classifier, dataset): 
    
    errors = [] 

    for (name, tag) in dataset:
        guess = classifier.classify(gender_features_finalized(name)) 
        if guess != tag: 
            errors.append((tag, guess, name))
            
    return errors
#Function to print error
def show_errors(errors, n = None):
   
    if n is not None: errors = errors[:n]
            
    for (tag, guess, name) in sorted(errors): 
        print('label=%-8s guess=%-8s name=%-30s' %(tag, guess, name))
    print(len(errors))

In [115]:
# Show error in devtest
show_errors(generate_errors(nb9, dev_test))


label=female   guess=male     name=Blondelle                     
label=female   guess=male     name=Demeter                       
label=female   guess=male     name=Devan                         
label=female   guess=male     name=Fabrice                       
label=female   guess=male     name=Franni                        
label=female   guess=male     name=Jammie                        
label=female   guess=male     name=Jessy                         
label=female   guess=male     name=Karol                         
label=female   guess=male     name=Lian                          
label=female   guess=male     name=Lilyan                        
label=female   guess=male     name=Lulu                          
label=female   guess=male     name=Marin                         
label=female   guess=male     name=Sile                          
label=female   guess=male     name=Tomi                          
label=female   guess=male     name=Ventura                       
label=fema

#### Out of 500 names in the validation set, we have incorrectly classified 89 names, which is essentially 82.2 accuracy entails. There are really no additional hyperparamter tuning that can be done. We can move onto the test set as it's very respectable and within acceptable tolerance.


#### Conclusions: 
##### The overall accuracy of the test set is 86.6 % while other performance metrics are as follows, 
| Gender | Metric | Percentage |
| --- | --- | --- |
| Male | precision | 89.36 |
| Female | precision | 85.52 |
| Male | recall | 70.79 |
| Female | recall | 95.34 |
| Male | F1 Score | 79 |
| Female | F1 Score | 90.16 |

#### and if you compare that with the results from the validation set, you can see that we got a set of better scores for F-measure, which is expected.

| Gender | Metric | Percentage |
| --- | --- | --- |
| Male | precision | 87.2 |
| Female | precision | 80.53 |
| Male | recall | 59.89 |
| Female | recall | 94.97 |
| Male | F1 Score | 71.01 |
| Female | F1 Score | 87.16 |