# Gender Classifier - Project 3
**DATA620**
<br>
**Wilson Ng**

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python(or any classifiers you find on the Internet), and any features you can think of, build the best name gender classifier you can.
<br>
1. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set.
2. Then, starting with the example name gender classifier, make incremental improvements.
3. Use the dev-test set to check your progress.
4. Once you are satisfied with your classifier, check its final performance on the test set.
<br>

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?


In [159]:
from nltk.corpus import names
from nltk import NaiveBayesClassifier
from nltk import classify
import random

In [160]:
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(names)

Function provided by the book to extract features of the data. In this case, we have the last letter and last two letters of a given name.

In [161]:
def gender_features(word):
    return {
        'suffix1': word[-1:],
        'suffix2': word[-2:]
    }

gender_features('shrek')

{'suffix1': 'k', 'suffix2': 'ek'}

Splitting out the data into three sets.

In [184]:
featuresets = [(gender_features(name), g) for (name, g) in names]

train_set, test_set = featuresets[500:], featuresets[:500]

classifier = NaiveBayesClassifier.train(train_set)

Testing the classify method based on on gender_features output. So far so good.

In [188]:
classifier.classify(gender_features('Ronaldo'))

'male'

In [190]:
classifier.classify(gender_features('Angelina'))

'female'

Viewing accuracy of the classifier with the test_set.

In [187]:
print(classify.accuracy(classifier, test_set))

0.772


In [166]:
classifier.show_most_informative_features(5)

Most Informative Features
                 suffix2 = 'na'           female : male   =    166.7 : 1.0
                 suffix2 = 'la'           female : male   =     73.9 : 1.0
                 suffix2 = 'ia'           female : male   =     39.7 : 1.0
                 suffix2 = 'sa'           female : male   =     34.9 : 1.0
                 suffix1 = 'a'            female : male   =     34.4 : 1.0


In [167]:
# When working with large corpora, constructing a single list that contains the features of every instance
# can use up a large amount of memory.
# In these cases, use the function nltk.classfy.apply_features, which
# returns an object that acts like a list but does not store all the feature sets in memory:

train_set = classify.apply_features(gender_features, names[500:])
test_set = classify.apply_features(gender_features, names[:500])

In [168]:
# what not to do, overfitting

letters = [chr(x) for x in range(ord('a'), ord('z') + 1)]

def gender_features2(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    for letter in letters:
        features['count(%s)' % letter] = name.lower().count(letter)
        features['has(%s)' % letter] = (letter in name.lower())
        return features
    

In [169]:
gender_features2('John')

{'firstletter': 'j', 'lastletter': 'n', 'count(a)': 0, 'has(a)': False}

From my understanding of the book, we should split the data into three sets like below:

In [174]:
train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]

In [175]:
# The training set is used to train the model
# the dev-test set is used to perform error analysis
# the test set serves in our final evaluation of the system

train_set = [(gender_features(n), g) for (n, g) in train_names]
devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
test_set = [(gender_features(n), g) for (n, g) in test_names]

classifier = NaiveBayesClassifier.train(train_set)

print(classify.accuracy(classifier, devtest_set))

0.795


Once an initial set of features has been chosen, a very productive method for refining
the feature set is error analysis.


In [191]:
# Using the dev-test set, we can generate a list of the errors
# that the classifier makes when predicting name genders:

errors = []

for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [192]:
for (tags, guess, name) in sorted(errors):
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Adelind                       
correct=female   guess=male     name=Adrien                        
correct=female   guess=male     name=Aeriell                       
correct=female   guess=male     name=Ajay                          
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Alis                          
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Amabel                        
correct=female   guess=male     name=Anet                          
correct=female   guess=male     name=Aubry                         
correct=female   guess=male     name=Bamby                         
correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Berget                        
correct=female   guess=male     name=Bevvy      

I also experimented with Scikit-learn, however, the accuracy is not as high compared to following the instructions on the book.

In [111]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import pandas as pd

Scikit-learn works well with dataframes so I'm converting the names data into a dataframe instead of a list of tuples.

In [202]:
names_df = pd.DataFrame(names, columns=['Name', 'Gender'])

names_df.head()

Unnamed: 0,Name,Gender
0,Angelina,female
1,Brett,female
2,Adolfo,male
3,Fanechka,female
4,Maury,male


Getting the last two letters of a given name and adding it to the dataframe.

In [203]:
def get_last_two_letters(name):
    return name[-2:]

names_df['last_two_letters'] = names_df['Name'].apply(get_last_two_letters)

names_df

Unnamed: 0,Name,Gender,last_two_letters
0,Angelina,female,na
1,Brett,female,tt
2,Adolfo,male,fo
3,Fanechka,female,ka
4,Maury,male,ry
...,...,...,...
7939,Jacinta,female,ta
7940,Ed,male,Ed
7941,Kaia,female,ia
7942,Saul,male,ul


Using train_test_split to split the data into separate sets

In [212]:
y = names_df.Gender

X_train, X_test, y_train, y_test = train_test_split(names_df['last_two_letters'], y, test_size = 0.30, random_state = 53)

Created a count_vectorizer instance and trained on the given features, which are the last two letters of all the names in the training set.

In [220]:
count_vectorizer = CountVectorizer()

count_train = count_vectorizer.fit_transform(X_train)

count_test = count_vectorizer.transform(X_test)

print(count_vectorizer.get_feature_names()[:10])

['ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak']


Fitting the classifier with training data set(count_train) and training labels(y_train).
<br>
Also assessing accuracy score and producing a confusion matrix.
<br>
The confusing matrix shows that female names were classified correctly for 1330 of them and wrongly for 176 of them.
<br>
The matrix also shows that male names were classified correctly for 527 of them and wrongly for 351 of them.
<br>
This might indicate that there were female names to train the data so the correct results might have skewed due to more training data on female names.

In [234]:
nb_classifier = MultinomialNB()


nb_classifier.fit(count_train, y_train)


pred = nb_classifier.predict(count_test)


score = metrics.accuracy_score(y_test, pred)
print(score)


cm = metrics.confusion_matrix(y_test, pred, labels=['female', 'male'])
print(cm)

0.7789429530201343
[[1330  176]
 [ 351  527]]


One way to evaluate and improve the accuracy score of our classifier is by tweaking the alpha values.
<br>
However, I don't see any significant differences between values raning from 0.0 to 1.0 below. I might have implemented this function wrong.

In [235]:
alphas = np.arange(0, 1, 0.1)

def train_and_predict(alpha):
    nb_classifier = MultinomialNB(alpha=alpha)

    nb_classifier.fit(count_train, y_train)

    pred = nb_classifier.predict(count_test)

    score = metrics.accuracy_score(y_test, pred)
    return score

for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()


Alpha:  0.0
Score:  0.7789429530201343

Alpha:  0.1
Score:  0.7789429530201343

Alpha:  0.2
Score:  0.7789429530201343

Alpha:  0.30000000000000004
Score:  0.7789429530201343

Alpha:  0.4
Score:  0.7789429530201343

Alpha:  0.5
Score:  0.7789429530201343

Alpha:  0.6000000000000001
Score:  0.7789429530201343

Alpha:  0.7000000000000001
Score:  0.7789429530201343

Alpha:  0.8
Score:  0.7789429530201343

Alpha:  0.9
Score:  0.7789429530201343



  'setting alpha = %.1e' % _ALPHA_MIN)
