## Project 3: Gender Name Classifier
## CUNY MSDS DATA 620 Web Analytics, CUNY Spring 2018
---
### Team5: Christopher Estevez, Meaghan Burke, Rickidon Singh,  Ritesh Lohiya, Rose Koh
### 07/09/2018 (due date)
##### python version: 3.6
---



<h2> Assignment Details </h2>

For this project, please work with the entire class as one collaborative group!Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

<i>The ability to be an effective member of a virtual team is highly valued in the data science job market. </i>


-------------------------------------------------------------------------------------------------------------

Using any of the three classifiers described in chapter 6 of <b>Natural Language Processing with Python</b>,
and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest
set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.



## Load Packages

import nltk
nltk.download('names')

In [1]:
from nltk.corpus import names
import random
from nltk.classify import apply_features

## Get Data

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [3]:
names[0:10]

[('Matilda', 'female'),
 ('Cora', 'female'),
 ('Wilfred', 'male'),
 ('Gloriane', 'female'),
 ('Aphrodite', 'female'),
 ('Chrissa', 'female'),
 ('Sal', 'female'),
 ('Janeva', 'female'),
 ('Delora', 'female'),
 ('Clemence', 'female')]

In [4]:
import nltk
from nltk.classify import apply_features
import random
import math

In [5]:
len(names)

7944

In [6]:
test_set = names[:500]
devtest_set = names[500:1000]   # Error-analysis set
train_set = names[1000:]        # Training set

## Features - A

In [7]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["suffix2"]= name[-2:].lower()
    features["preffix2"]= name[:2].lower()
    for letter in 'aeiou':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## Split Data

In [8]:
featuresets = [(gender_features2(n), g) for (n,g) in names]
featuresets[0]


({'firstletter': 'm',
  'lastletter': 'a',
  'suffix2': 'da',
  'preffix2': 'ma',
  'count(a)': 2,
  'has(a)': True,
  'count(e)': 0,
  'has(e)': False,
  'count(i)': 1,
  'has(i)': True,
  'count(o)': 0,
  'has(o)': False,
  'count(u)': 0,
  'has(u)': False},
 'female')

In [9]:
train_set_fe = featuresets[1000:]
test_set_fe =featuresets[:500]
devtest_set_fe =featuresets[500:1000]

## Classifier - NaiveBayes

In [10]:
classifier = nltk.NaiveBayesClassifier.train(train_set_fe)

In [11]:
print(classifier.classify(gender_features2('Neo'))) #male
print(classifier.classify(gender_features2('Trinity'))) #female

male
male


In [12]:
# Show Accuracy
print("train_set: ", nltk.classify.accuracy(classifier, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier, devtest_set_fe))

train_set:  0.803139400921659
test_set:  0.802
devtest_set:  0.83


In [13]:
# Show important features
classifier.show_most_informative_features(20)

Most Informative Features
                 suffix2 = 'na'           female : male   =     94.6 : 1.0
                 suffix2 = 'la'           female : male   =     67.1 : 1.0
                 suffix2 = 'us'             male : female =     60.4 : 1.0
                 suffix2 = 'ia'           female : male   =     38.6 : 1.0
              lastletter = 'a'            female : male   =     37.4 : 1.0
                 suffix2 = 'ta'           female : male   =     30.7 : 1.0
              lastletter = 'k'              male : female =     30.2 : 1.0
                 suffix2 = 'rd'             male : female =     29.2 : 1.0
                 suffix2 = 'rt'             male : female =     28.3 : 1.0
                 suffix2 = 'do'             male : female =     26.1 : 1.0
                 suffix2 = 'ra'           female : male   =     24.1 : 1.0
                 suffix2 = 'ld'             male : female =     21.0 : 1.0
                 suffix2 = 'os'             male : female =     17.2 : 1.0

In [14]:
# Check errors
errors = []
for (name, tag) in devtest_set:
    guess = classifier.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )


In [15]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))


correct=female   guess=male     name=Aubry                         
correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Carrol                        
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Chrystel                      
correct=female   guess=male     name=Conney                        
correct=female   guess=male     name=Conny                         
correct=female   guess=male     name=Coreen                        
correct=female   guess=male     name=Courtney                      
correct=female   guess=male     name=Cynthy                        
correct=female   guess=male     name=Debor                         
correct=female   guess=male     name=Dew                           
correct=female   guess=male     name=Diamond                       
correct=female   guess=male     name=Doe                           
correct=female   guess=male     name=Dorey      

In [16]:
print("Error count: ", len(errors))

Error count:  85


## Classifier - DecisionTree

In [17]:
classifier_tree = nltk.DecisionTreeClassifier.train(train_set_fe)

print("train_set: ", nltk.classify.accuracy(classifier_tree, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier_tree, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier_tree, devtest_set_fe))

train_set:  0.935339861751152
test_set:  0.762
devtest_set:  0.776


In [18]:
errors2 = []
for (name, tag) in devtest_set:
    guess = classifier_tree.classify(gender_features2(name))
    if guess != tag:
        errors2.append( (tag, guess, name) )


In [19]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Aubry                         
correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Carrol                        
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Chrystel                      
correct=female   guess=male     name=Conney                        
correct=female   guess=male     name=Conny                         
correct=female   guess=male     name=Coreen                        
correct=female   guess=male     name=Courtney                      
correct=female   guess=male     name=Cynthy                        
correct=female   guess=male     name=Debor                         
correct=female   guess=male     name=Dew                           
correct=female   guess=male     name=Diamond                       
correct=female   guess=male     name=Doe                           
correct=female   guess=male     name=Dorey      

In [20]:
print("Error count")
len(errors2)

Error count


112

---

## Features - B

In [21]:
# extract name’s characteristics
def gender_features(name):
    name = name.lower()
    return{
        'first_letter': name[0],
        'first2_letter': name[0:2],
        'first3_letter': name[0:3],
        'last_letter': name[-1],
        'last2_letter': name[-2:],
        'last3_letter': name[-3:],
        'last_vowel': (name[-1] in 'aeiou')
    }

In [22]:
gender_features("Rose")

{'first_letter': 'r',
 'first2_letter': 'ro',
 'first3_letter': 'ros',
 'last_letter': 'e',
 'last2_letter': 'se',
 'last3_letter': 'ose',
 'last_vowel': True}

## Vectorize

In [23]:
import numpy as np

# Vectorize the features function
features = np.vectorize(gender_features)
print(features(['Rose', 'Mike']))

[{'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]


In [24]:
# Extract the features for entire dataset
X = np.array(features(names))[:, 0] # X contains the features

# Get the gender column
y = np.array(names)[:, 1]           # y contains the targets

print("Name: %s, features=%s, gender=%s" % (names[0][0], X[0], y[0]))

Name: Matilda, features={'first_letter': 'm', 'first2_letter': 'ma', 'first3_letter': 'mat', 'last_letter': 'a', 'last2_letter': 'da', 'last3_letter': 'lda', 'last_vowel': True}, gender=female


In [25]:
# Shuffle and split:  train, dev-test, test 
from sklearn.utils import shuffle
X,y = shuffle(X,y)

X_test, X_dev_test, X_train = X[:500], X[500:1000], X[1000:]
y_test, y_dev_test, y_train = y[:500], y[500:1000], y[1000:]

print("test: " , len(X_test))
print("devtest: ", len(X_dev_test))
print("train: ", len(X_train))

test:  500
devtest:  500
train:  6944


In [26]:
# Use vectorizer to transform the features into feature-vectors.
from sklearn.feature_extraction import DictVectorizer

print(features(['Rose', 'Mike']))

# train the vectorizer to know the possible features and values.
vectorizer = DictVectorizer()
vectorizer.fit(X_train)

transform = vectorizer.transform(features(['Rose', 'Mike']))
print(transform)
print(type(transform))
print(transform.toarray()[0][12])
print(vectorizer.feature_names_[12])

[{'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]
  (0, 166)	1.0
  (0, 1214)	1.0
  (0, 1515)	1.0
  (0, 1732)	1.0
  (0, 2679)	1.0
  (0, 3140)	1.0
  (0, 3161)	1.0
  (1, 128)	1.0
  (1, 995)	1.0
  (1, 1510)	1.0
  (1, 1646)	1.0
  (1, 2278)	1.0
  (1, 3140)	1.0
  (1, 3161)	1.0
<class 'scipy.sparse.csr.csr_matrix'>
0.0
first2_letter=ap


## Classifier - DecisionTree

In [27]:
from sklearn.tree import DecisionTreeClassifier
# DT classifier to extract discriminating rules from the features. 
DT_classifier = DecisionTreeClassifier()

DT_classifier.fit(vectorizer.transform(X_train), y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [28]:
print(DT_classifier.predict(vectorizer.transform(features(["Sebastian", "Amy"]))))

['male' 'female']


In [29]:
# Accuracy
print("Accuracy on training set: ", DT_classifier.score(vectorizer.transform(X_train), y_train))
print("Accuracy on dev-test set: ",DT_classifier.score(vectorizer.transform(X_dev_test), y_dev_test))
print("Accuracy on test set: ",DT_classifier.score(vectorizer.transform(X_test), y_test))


Accuracy on training set:  0.9582373271889401
Accuracy on dev-test set:  0.784
Accuracy on test set:  0.762


In [30]:
# cross validation
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

pred_train = cross_val_predict(DT_classifier, vectorizer.transform(X_train), y_train, cv = 100)
pred_dev_test = cross_val_predict(DT_classifier, vectorizer.transform(X_dev_test), y_dev_test, cv = 100)
pred_test = cross_val_predict(DT_classifier, vectorizer.transform(X_test), y_test, cv = 100)

score_train = accuracy_score(y_train, pred_train)
score_dev_test = accuracy_score(y_dev_test, pred_dev_test)
score_test = accuracy_score(y_test, pred_test)

print("Cross Validation")
print("Train Score = {0:5f}".format(score_train))
print("Dev Test Score = {0:5f}".format(score_dev_test))
print("Test Score = {0:5f}".format(score_test))


Cross Validation
Train Score = 0.781682
Dev Test Score = 0.740000
Test Score = 0.730000


## Conclusion

* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

The performance on the test set is lower compared to the performance of the dev-test set.  As we optimize against the dev-test set, we are likely to create overfitting, thus generalizing the outcome to the model will be ineffective as the test set is unseen data for the model.  The training accuracy is higher as the model performs better on the data it has seen before compared to the data it has unseen.

## Source

https://www.nltk.org/book/ch06.html