# Project: Classifying Gender Based on a Name

Three feature extractor functions and two classifier models were built in this project. The first model is a simple model that classifies gender based on the first letter of a given name, the second model classifies gender based on the bigrams of characters in a name, the third model classifies gender based on the trigrams of characters in a name, and all of these three are Naïve Bayes Classifier models. To see if a different classifier yields better output, a decision tree classification model is run at the end based on the character trigram features.

## 1. Baseline Model

Feature: First letter of a name

In [None]:
def gender_features(word):
    return {'first_letter': word[0]}
gender_features('Sam')

{'first_letter': 'S'}

The feature generation function “def gender_features(word)” takes a name as an input, pulls the first letter from the name (the feature), and returns the first letter as an output. This output is taken into consideration while creating a feature set in section 1.1.

In [None]:
import nltk
nltk.download('names')

from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
                 [(name, 'female') for name in names.words('female.txt')])

import random
random.shuffle(labeled_names)

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


### 1.1 Dividing the labeled_names list into training and test sets

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

### 1.2 Testing on unknown/test data

In [None]:
classifier.classify(gender_features('William'))

'male'

In [None]:
classifier.classify(gender_features('Katie'))

'female'

In [None]:
print(nltk.classify.accuracy(classifier, test_set))

0.628


The accuracy of this baseline model came out to be 62.8% on the test set, which is decent but not great. This means that about 38% of the time, the model predicts the wrong gender for every 100 names fed for prediction.

### 1.3 Fine-tuning: Dividing the labeled_names list into training, dev-test, and test sets

In [None]:
test_names = labeled_names[:500]
devtest_names = labeled_names[500:1500]
train_names = labeled_names[1500:]

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

In [None]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.641


### 1.3 Which features were most effective for the baseline model classification?

In [None]:
classifier.show_most_informative_features(5)

Most Informative Features
            first_letter = 'W'              male : female =      4.5 : 1.0
            first_letter = 'Q'              male : female =      2.9 : 1.0
            first_letter = 'U'              male : female =      2.5 : 1.0
            first_letter = 'K'            female : male   =      2.5 : 1.0
            first_letter = 'X'              male : female =      2.3 : 1.0


Some fine tuning is done to the existing model by adding a development test set. After fine tuning, the model performance slightly improves. Features like the first letter “W”, “Q”, “U”, “K”, and “X”, contribute most to the model’s prediction. The results show that male names are more likely to start with W, Q, U, K, and X than female names. Adding a dev-test set improved the accuracy from 62.8% to 64.1%, but this is still not a great accuracy score.

In [None]:
#displaying the predictions

baseline_predictions = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    baseline_predictions.append( (tag, guess, name) )
    
for (tag, guess, name) in sorted(baseline_predictions):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=female   name=Abigail                       
correct=female   guess=female   name=Acacia                        
correct=female   guess=female   name=Adeline                       
correct=female   guess=female   name=Adiana                        
correct=female   guess=female   name=Aila                          
correct=female   guess=female   name=Ailee                         
correct=female   guess=female   name=Aili                          
correct=female   guess=female   name=Ailina                        
correct=female   guess=female   name=Aime                          
correct=female   guess=female   name=Alexi                         
correct=female   guess=female   name=Alexina                       
correct=female   guess=female   name=Alexine                       
correct=female   guess=female   name=Alfie                         
correct=female   guess=female   name=Alidia                        
correct=female   guess=female   name=Alis       

Next, we'll look into improving feature selection.

## 2. Choosing the Right Features - Bigram Model

Feature: Character bigrams | With a speculation that better features will likely increase the model performance, the existing feature generation function is updated to extract character bigrams of a given name. This is to see if the sequence of characters in names will show some kind of pattern in them that will result in improved gender prediction.

In [None]:
#defining a new function to retrieve character bigrams

from nltk import bigrams
def gender_features_2(word):
    bigram_list = []
    c = 0
    for char in nltk.bigrams(word):
        bigram_list += [('bg'+str(c), (char[0] + char[1]))]
        c = c+1
    return dict(bigram_list)
gender_features_2('Shrek')

{'bg0': 'Sh', 'bg1': 'hr', 'bg2': 're', 'bg3': 'ek'}

### 2.1 Retraining the classifier with the new features (bigrams) and checking the accuracy on the dev-test set

In [None]:
train_set = [(gender_features_2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_2(n), gender) for (n, gender) in test_names]
classifier_2 = nltk.NaiveBayesClassifier.train(train_set) 
print(nltk.classify.accuracy(classifier_2, devtest_set))

0.774


In [None]:
classifier_2.show_most_informative_features(10)

Most Informative Features
                     bg2 = 'rk'             male : female =     17.2 : 1.0
                     bg6 = 'rd'             male : female =     16.4 : 1.0
                     bg0 = 'Fo'             male : female =     16.3 : 1.0
                     bg0 = 'Hu'             male : female =     16.3 : 1.0
                     bg2 = 'lb'             male : female =     16.1 : 1.0
                     bg4 = 'ss'           female : male   =     13.3 : 1.0
                     bg3 = 'to'             male : female =     13.1 : 1.0
                     bg1 = 'is'           female : male   =     12.9 : 1.0
                     bg5 = 'ta'           female : male   =     12.7 : 1.0
                     bg5 = 'rd'             male : female =     12.6 : 1.0


The train, dev-test, and test sets are updated with the character bigram feature function and the Naïve Bayes classifier is run on the dev-test set. This increases the accuracy of the model to 77.4%, which is a significant increase. The results show that names that have “ta”, and “ss”, in them such as Natasha and Alyssa tend to be more female names and names that have “rk” and “rd” in them such as Mark and Richard tend to be more male names. This is useful insight; however, we still want to see if this accuracy score can be increased to something like 85 or 90%.

## 3. Choosing the Right Features - Trigram Model

Feature: Character trigrams | To further improve model's performance, the feature generator function is updated to retrieve three-character sequences (trigrams) of a name, which is then used for training the existing the Naives Bayes classification model. 

In [None]:
#defining a new function to retrieve character bigrams

from nltk import trigrams

def gender_features_3(word):
    trigram_list = []
    c = 0
    for char in nltk.trigrams(word):
        trigram_list += [('bg'+str(c), (char[0] + char[1] + char[2]))]
        c = c+1
    return dict(trigram_list)
gender_features_3('Abigail')

{'bg0': 'Abi', 'bg1': 'big', 'bg2': 'iga', 'bg3': 'gai', 'bg4': 'ail'}

### 3.1 Retraining the classifier with the new features (trigrams) and checking the accuracy on the dev-test set

In [None]:
train_set = [(gender_features_3(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_3(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_3(n), gender) for (n, gender) in test_names]
classifier_3 = nltk.NaiveBayesClassifier.train(train_set) 
print(nltk.classify.accuracy(classifier_3, devtest_set))

0.794


In [None]:
#listing all predictions 

all_predictions = []
for (name, tag) in devtest_names:
    guess = classifier_3.classify(gender_features_3(name))
    all_predictions.append( (tag, guess, name) )
    
for (tag, guess, name) in sorted(all_predictions):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=female   name=Abigail                       
correct=female   guess=female   name=Acacia                        
correct=female   guess=female   name=Adeline                       
correct=female   guess=female   name=Adiana                        
correct=female   guess=female   name=Aila                          
correct=female   guess=female   name=Ailee                         
correct=female   guess=female   name=Aili                          
correct=female   guess=female   name=Ailina                        
correct=female   guess=female   name=Aime                          
correct=female   guess=female   name=Alexi                         
correct=female   guess=female   name=Alexina                       
correct=female   guess=female   name=Alexine                       
correct=female   guess=female   name=Alidia                        
correct=female   guess=female   name=Alis                          
correct=female   guess=female   name=Alla       

In [None]:
classifier_3.show_most_informative_features(15)

Most Informative Features
                     bg0 = 'Gar'            male : female =     21.5 : 1.0
                     bg2 = 'rre'            male : female =     20.3 : 1.0
                     bg3 = 'ett'          female : male   =     17.0 : 1.0
                     bg0 = 'Ros'          female : male   =     15.9 : 1.0
                     bg0 = 'Dor'          female : male   =     14.6 : 1.0
                     bg4 = 'ina'          female : male   =     14.5 : 1.0
                     bg4 = 'ard'            male : female =     12.8 : 1.0
                     bg3 = 'eli'          female : male   =     11.5 : 1.0
                     bg0 = 'Cat'          female : male   =     11.2 : 1.0
                     bg3 = 'man'            male : female =     11.1 : 1.0
                     bg0 = 'Tha'            male : female =     11.0 : 1.0
                     bg4 = 'lla'          female : male   =     10.8 : 1.0
                     bg3 = 'der'            male : female =     10.1 : 1.0

Use of trigram features result in an increase in the accuracy score to 79.4%. This is slightly higher than the accuracy of the bigram model and is the best accuracy score achieved so far after fine tuning the model three times. This score means that the trigram model accurately predicts the gender almost 80% of the time based on a given name. According to the model’s most informative features, character trigrams like “ett”, “Ros”, “Dor”, and “ina” are more likely to be present in female names by 17.0, 15.9, 14.6, and 14.5 times respectively than that in male names. 

In [None]:
#error analysis...displaying the wrong predictions
errors = []
for (name, tag) in devtest_names:
    guess = classifier_3.classify(gender_features_3(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Alfie                         
correct=female   guess=male     name=Aloysia                       
correct=female   guess=male     name=Angie                         
correct=female   guess=male     name=Auberta                       
correct=female   guess=male     name=Aubrey                        
correct=female   guess=male     name=Auguste                       
correct=female   guess=male     name=Becka                         
correct=female   guess=male     name=Briteny                       
correct=female   guess=male     name=Brooke                        
correct=female   guess=male     name=Clementia                     
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Cynthie                       
correct=female   guess=male     name=Darda                         
correct=female   guess=male     name=Dusty                         
correct=female   guess=male     name=Emanuela   

## 4. Decision Tree Classifier

To see if a different classifier results in better accuracy, a decision tree model is run on with the character trigram features as input. 

In [None]:
classifier_4 = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier_4, devtest_set))

0.698


In [None]:
print(nltk.classify.accuracy(classifier_4, test_set))

0.694


This model yields an accuracy score of 69.8% on the dev-test set and 69.4% on the test set, both of which are lower than the Naïve Bayes classifier model run with the same input features. 

## 5. Precision, Recall, and F-1 Score

### 5.1 Metrics - Baseline Naives Bayes Classifier Model

In [None]:
baseline_correct_gender = []
for (tag, guess, name) in sorted(baseline_predictions):
    baseline_correct_gender.append(tag)
print(len(baseline_correct_gender))
print(baseline_correct_gender[0:10])

y_true_baseline = []
for gender in baseline_correct_gender:
    if gender == "female":
        true = 0
    else:
        true = 1
    y_true_baseline.append(true)
print(y_true_baseline[0:10]) 

1000
['female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
baseline_predicted_gender = []
for (tag, guess, name) in sorted(baseline_predictions):
    baseline_predicted_gender.append(guess)
print(len(baseline_predicted_gender))
print(baseline_predicted_gender[0:10])

y_pred_baseline = []
for gender in baseline_predicted_gender:
    if gender == "female":
        pred = 0
    else:
        pred = 1
    y_pred_baseline.append(pred)
print(y_pred_baseline[0:10]) 

1000
['female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
from sklearn import metrics
target_labels = ['Female','Male']
print(metrics.classification_report(y_true_baseline, y_pred_baseline, target_names=target_labels))

              precision    recall  f1-score   support

      Female       0.64      0.95      0.77       624
        Male       0.61      0.12      0.20       376

    accuracy                           0.64      1000
   macro avg       0.63      0.54      0.49      1000
weighted avg       0.63      0.64      0.56      1000



### 5.2 Metrics - Trigrams based Naive Bayes Classifier Model

In [None]:
print(nltk.classify.accuracy(classifier_3, test_set))

0.77


In [None]:
correct_gender = []
for (tag, guess, name) in sorted(all_predictions):
    correct_gender.append(tag)
print(len(correct_gender))
print(correct_gender[0:10])

y_true_trigram_feature = []
for gender in correct_gender:
    if gender == "female":
        true = 0
    else:
        true = 1
    y_true_trigram_feature.append(true)
print(y_true_trigram_feature[0:10]) 

1000
['female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
predicted_gender = []
for (tag, guess, name) in sorted(all_predictions):
    predicted_gender.append(guess)
print(len(predicted_gender))
print(predicted_gender[0:10])

y_pred_trigram_feature = []
for gender in predicted_gender:
    if gender == "female":
        pred = 0
    else:
        pred = 1
    y_pred_trigram_feature.append(pred)
print(y_pred_trigram_feature[0:10]) 

1000
['female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
from sklearn import metrics
target_labels = ['Female','Male']
print(metrics.classification_report(y_true_trigram_feature, y_pred_trigram_feature, target_names=target_labels))

              precision    recall  f1-score   support

      Female       0.82      0.86      0.84       624
        Male       0.75      0.68      0.71       376

    accuracy                           0.79      1000
   macro avg       0.78      0.77      0.78      1000
weighted avg       0.79      0.79      0.79      1000



## 6. Baseline Model Vs. Trigram Naive Bayes Model Comparison

The trigram model [section 3] performed better in most metrics compared to the baseline model [section 1] in this project. The precision score at which the baseline model predicts for female is 64% and for male is 61% whereas it is 82% for female and 75% for male in the trigram model. This means that the percentage of correctly predicted instances over the amount of correct and incorrect predictions is higher in the trigram model for both genders. Similarly, the recall for the baseline model is 95% for female and 12% for male whereas it is 86% for female and 68% for male in the trigram model, which means that the percentage of correctly predicted instances over the amount of total predictions is higher for female and lower for male in the baseline model compared to the trigram model. In both models, the recall is higher for female predictions and higher recall is generally considered good. The f-1 score represents harmonic mean of both precision and recall. It is difficult to maximize both precision and recall at the same time, but we can do that in an easier way with the f-1 score metric. The f-1 score is higher for both female and male in the trigram model, with scores of 84% for female and 71% for male compared to 77% for female and 20% for male in the baseline model. 

Overall, the trigram model clearly stands out. Although it does not have a remarkably great accuracy like 95%, it does predict correctly 77% of the time on test data. In future, we can feed more training examples and/or select even better features than the character trigrams to increase the accuracy score of the model.

# 7. Some Test Examples on the Trigram N. Bayes Model

In [None]:
classifier_3.classify(gender_features_3('Jakob')) 

'male'

In [None]:
classifier_3.classify(gender_features_3('Whitney')) #wrong prediction

'male'

In [None]:
classifier_3.classify(gender_features_3('Carroll')) #wrong prediction

'male'

In [None]:
classifier_3.classify(gender_features_3('Aanand')) #wrong prediction

'female'

In [None]:
classifier_3.classify(gender_features_3('Collin')) #wrong prediction

'female'

In [None]:
classifier_3.classify(gender_features_3('Jacqueline'))

'female'

In [None]:
classifier_3.classify(gender_features_3('Alann')) #wrong prediction

'female'

In [None]:
classifier_3.classify(gender_features_3('Matt')) 

'male'

In [None]:
classifier_3.classify(gender_features_3('Agatha')) 

'female'

In [None]:
classifier_3.classify(gender_features_3('Louis')) #wrong prediction

'female'

In [None]:
classifier_3.classify(gender_features_3('Rita')) 

'female'

In [None]:
classifier_3.classify(gender_features_3('Harper')) 

'male'

In [None]:
classifier_3.classify(gender_features_3('John')) 

'male'

In [None]:
classifier_3.classify(gender_features_3('Sally')) 

'female'

In [None]:
classifier_3.classify(gender_features_3('Suzanne')) 

'female'

In [None]:
classifier_3.classify(gender_features_3('Ross')) 

'female'