This notebook takes the gender names dataset from Kaggle (sourced from Social Security records) and uses it as training data for a bigram Naive Bayes classifier that probabilistically classifies lists of names into a likely split of the genders within by determining patterns within gendered names. 

Here, we import all of the needed libraries. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
import requests
from bs4 import BeautifulSoup
import matplotlib as mlp
#mlp.use("TKAgg")
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
mlp.rcParams.update({'font.family': "Open Sans", 'font.size' : 16})

Now we import the data file (the Social Security database) and do some basic data cleaning -- so we can display the first five entries and get a sense of the data.

In [None]:
#import social security names database from Kaggle in int form
names = pd.read_csv("../input/vindata/allBabyNamesUSA_2000s.csv", dtype = {'Count': np.int32})
names = names.fillna(0)
names.head()

Now let's groupby so we can get away from counting sums across years, and get the aggregated count of female and male occurrences for each name.

In [None]:
namechart = names.groupby(['Name', 'Gender'], as_index = False)['Number'].sum()
#namechart = names.groupby(['Name', 'Gender'], as_index = False)['Count'].sum()

namechart.head(5)

Now let's add a column to categorize different names into a male or female bucket based on whether or not the frequency of males for a name outnumbers the frequency of females. 

In [None]:
namechartdiff = namechart.reset_index().pivot('Name', 'Gender','Number')
namechartdiff = namechartdiff.fillna(0)

#namechartdiff = namechart
#namechartdiff["Mpercent"] = ((namechartdiff["M"] - namechartdiff["F"])/(namechartdiff["M"] + namechartdiff["F"]))
namechartdiff["Mpercent"] = ((namechartdiff["male"] - namechartdiff["female"])/(namechartdiff["male"] + namechartdiff["female"]))
namechartdiff['gender'] = np.where(namechartdiff['Mpercent'] > 0.001, 'male', 'female')
namechartdiff.head()

Let's now break down 'the strings of names into bigram blocks of characters with CountVectorizer.

In [None]:
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2))
X = char_vectorizer.fit_transform(namechartdiff.index)
X = X.tocsc()
y = (namechartdiff.gender == 'male').values.astype(np.int)
print(X)

Let's split our training and test data now. 

In [None]:
itrain, itest = train_test_split(range(namechartdiff.shape[0]), train_size=0.7)
mask=np.ones(namechartdiff.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

Now we train the model.

In [None]:
Xtrainthis=X[mask]
Ytrainthis=y[mask]
Xtestthis=X[~mask]
Ytestthis=y[~mask]
clf = MultinomialNB(alpha = 1)
clf.fit(Xtrainthis, Ytrainthis)
training_accuracy = clf.score(Xtrainthis,Ytrainthis)
test_accuracy = clf.score(Xtestthis,Ytestthis)
        
print(training_accuracy)
print(test_accuracy)

Now let's define a function that will allow us to easily look up and predict individual names. 

In [None]:
def lookup(x):
    str(x)
    new = char_vectorizer.transform([x])
    y_pred = clf.predict(new)
    if (y_pred == 1):
        print("This is most likely a male name!")
    else:
        print("This is most likely a female name!")
    

I've looked up my own name and determined that it's most likely male!

In [None]:
lookup("Ganesh")
lookup("Laxmi")
print(" ")