**This is a machine learning classifier to classify different names based on their genders in python using the scikit learn library**

In [1]:
import pandas as pd
import numpy as np

The data set used for this process contains a huge list of universal names and the respective genders. The list is multi-national so as to prevent specificity and promote generalization

Fortunately, the data I used already had the 'gender' part labelled with numeric values, with '1' for a male and a '0' for a female.
Also, on more exploration I found that the data had some names labelled with a value '3'. At first I thought these names might be unisex like 'Jaskirat' or 'Micky' but on further investigation I found it not to be true, so I had to remove those instances from the data.

In [2]:
df = pd.read_csv('gender_refine-csv.csv')
del df['score']
df.head()
df_temp = df
df_temp = df_temp[df_temp.gender != 3]
# encoded all male values with 1 and female values with 0

Now, one may wonder how do text and machine learning work with each other. Well, the scikit learn library provides a way to do that.
Each word is coded as vector of numbers, these vectors might signify the occurence of each character, or some other feature.
You can learn more about these methods here - [Text and Machine Learning](http://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

cv = CountVectorizer()
pre_feature = df_temp['name']
X = cv.fit_transform(pre_feature)


While writing this kernel, I had just started with machine learning so I couldn't tell just by looking at the type of data that which classifier would do better. So I intend to try most of the classifiers I know, and compare their results and then finally save the model with the best accuracy.

I'll test different classifiers in the following order -
* Naive Bayes Classifier
* Support Vector Machine
* Decision tree

I'll use the GridSearchCV method to tune the hyperparameters to get the best results out of each model.
You can read more about GridSearch here - [GridSearchCV](https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/)

In [4]:
from sklearn.model_selection import train_test_split
Y = df_temp['gender']
features_train, features_test, labels_train, labels_test = train_test_split(X, Y, test_size = 0.3)

I have used 70% of my data to train the model and the rest 30% of the data to test it.

In [5]:
from sklearn.naive_bayes import MultinomialNB

In [6]:
clf = MultinomialNB()
clf.fit(features_train, labels_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [7]:
clf.score(features_test, labels_test)

0.6150403375238747

So yeah, the MultinomialNB did quite a bad job here, 61% of accuracy wasn't something that I was looking for but still we have 2 more classifiers to go. 
Let's see where we can take this score from here.

Next up is the SVM.

In [8]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In a SVC the hyperparameters play an important role, so unlike the MultinomialNB, I would use the GridSearchCV here to get the best results.
(*note* *that this doesn't mean that hyperparameters in a MultinomialNB are useless*)

So, i tried to run , 

In [9]:

#  clf = SVC()
#  parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
#                     'C': [0.01,0.1,1,10,100,1000]},
#                    {'kernel': ['sigmoid'], 'gamma': [1e-1, 1e-2, 1, 5, 10],
#                    'C': [0.01,0.1,1,10,100,1000]},
#                    {'kernel': ['linear'], 'C': [0.01,0.1,1,10,100,1000]}]
#  g_search = GridSearchCV(estimator= clf, param_grid = parameters, scoring= 'accuracy', cv = 5)
#  g_search = g_search.fit(features_train, labels_train) 


Which I later realized that was too much of processing for this kernel. I got exit code 137 twice. 
So I thought of slicing the data by a significant factor,
and then choosing the best parameters to train my whole data.
I know this isn't a great practice but I  in had to compromise for some loss generality over time.

The following parameters gave the best result on my local machine for random 20,000 observations. {'kernel': 'rbf', 'C': 100, 'gamma': 0.01}.
And saved the trained classifier in a pickle file.

In [10]:
# from sklearn.externals import joblib
# clf = joblib.load('../input/name_to_gender_svm_pickle.pkl')
# print(clf.score(features_test, labels_test))

Even the SVM didn't give the best result for this case, now for the decision tree classifier we will build a custom function for feature analysis, this function will work on the following analogy, 
* Most of the female names have a vowel at their end, mostly 'A' or 'E' .
* Most of the male names have a consonant at their end. (Not always).

In [11]:
def pattern(name):
    name = name.lower()
    return {
        'first' : name[0],
        'second' : name[0:2],
        'third' : name[0:3], 
        'last' : name[-1],
        'second_last' : name[-2:], 
        'third_last' : name[-3:], 
    }
pattern = np.vectorize(pattern)
print(pattern(["shubham", "ankit", "somesh", "abhishek"]))

[{'third_last': 'ham', 'third': 'shu', 'second_last': 'am', 'first': 's', 'last': 'm', 'second': 'sh'}
 {'third_last': 'kit', 'third': 'ank', 'second_last': 'it', 'first': 'a', 'last': 't', 'second': 'an'}
 {'third_last': 'esh', 'third': 'som', 'second_last': 'sh', 'first': 's', 'last': 'h', 'second': 'so'}
 {'third_last': 'hek', 'third': 'abh', 'second_last': 'ek', 'first': 'a', 'last': 'k', 'second': 'ab'}]


In [12]:
X_features = pattern(df_temp['name']) 
Y_labels = df_temp['gender']
X_train, X_test,  Y_train, Y_test = train_test_split(X_features, Y_labels, test_size = 0.33, random_state = 42) 

In [13]:
from sklearn.feature_extraction import DictVectorizer 
sample = pattern(["shubham", "ankit", "somesh", "abhishek"]) 
dv = DictVectorizer()
dv.fit(sample)
transformed = dv.transform(sample) 
# print(transformed)
dv.get_feature_names()

['first=a',
 'first=s',
 'last=h',
 'last=k',
 'last=m',
 'last=t',
 'second=ab',
 'second=an',
 'second=sh',
 'second=so',
 'second_last=am',
 'second_last=ek',
 'second_last=it',
 'second_last=sh',
 'third=abh',
 'third=ank',
 'third=shu',
 'third=som',
 'third_last=esh',
 'third_last=ham',
 'third_last=hek',
 'third_last=kit']

In [14]:
dv = DictVectorizer() 
dv.fit_transform(X_train) 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier() 
trans_feat = dv.transform(X_train) 
clf.fit(trans_feat, Y_train) 
print(clf.score(dv.transform(X_test), Y_test))

0.852333687511338


So, by using the decision tree classifier and a custom function for feature generation we have achieved a score of 85.2. 
Now, we can try the features extracted by the custom function on other 2 classifiers and see how the results differ. 

In [15]:
clf = MultinomialNB() 
clf.fit(trans_feat, Y_train) 
print(clf.score(dv.transform(X_test), Y_test))

0.8451810195143442


84.5 with the MultinomialNB.