## Project 3: Gender Name Classifier
## CUNY MSDS DATA 620 Web Analytics, CUNY Spring 2018
---
### Team5: Christopher Estevez, Meaghan Burke, Rickidon Singh,  Ritesh Lohiya, Rose Koh
### 07/09/2018 (due date)
##### python version: 3.6
---



<h2> Assignment Details </h2>

For this project, please work with the entire class as one collaborative group!Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

<i>The ability to be an effective member of a virtual team is highly valued in the data science job market. </i>


-------------------------------------------------------------------------------------------------------------

Using any of the three classifiers described in chapter 6 of <b>Natural Language Processing with Python</b>,
and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest
set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.



## Load Packages

import nltk
nltk.download('names')

In [2]:
from nltk.corpus import names
import random
from nltk.classify import apply_features

## Get Data

In [3]:
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [4]:
names[0:10]

[('Ephraim', 'male'),
 ('Avivah', 'female'),
 ('Emlyn', 'female'),
 ('Maressa', 'female'),
 ('Margy', 'female'),
 ('Horacio', 'male'),
 ('Orren', 'male'),
 ('Binni', 'female'),
 ('Witty', 'male'),
 ('Stephanus', 'male')]

In [5]:
import nltk
from nltk.classify import apply_features
import random
import math

In [6]:
len(names)

7944

In [7]:
test_set = names[:500]
devtest_set = names[500:1000]   # Error-analysis set
train_set = names[1000:]        # Training set

## Features - A

In [8]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["suffix2"]= name[-2:].lower()
    features["preffix2"]= name[:2].lower()
    for letter in 'aeiou':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## Split Data - A

In [9]:
featuresets = [(gender_features2(n), g) for (n,g) in names]
featuresets[0]


({'count(a)': 1,
  'count(e)': 1,
  'count(i)': 1,
  'count(o)': 0,
  'count(u)': 0,
  'firstletter': 'e',
  'has(a)': True,
  'has(e)': True,
  'has(i)': True,
  'has(o)': False,
  'has(u)': False,
  'lastletter': 'm',
  'preffix2': 'ep',
  'suffix2': 'im'},
 'male')

In [10]:
train_set_fe = featuresets[1000:]
test_set_fe =featuresets[:500]
devtest_set_fe =featuresets[500:1000]

## Classifier - NaiveBayes

In [11]:
classifier = nltk.NaiveBayesClassifier.train(train_set_fe)

In [12]:
print(classifier.classify(gender_features2('Neo'))) #male
print(classifier.classify(gender_features2('Trinity'))) #female

male
male


In [13]:
# Show Accuracy
print("train_set: ", nltk.classify.accuracy(classifier, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier, devtest_set_fe))

train_set:  0.8120679723502304
test_set:  0.782
devtest_set:  0.782


In [14]:
# Show important features
classifier.show_most_informative_features(200)

Most Informative Features
                 suffix2 = 'na'           female : male   =     93.2 : 1.0
                 suffix2 = 'la'           female : male   =     69.0 : 1.0
                 suffix2 = 'rd'             male : female =     41.2 : 1.0
                 suffix2 = 'ia'           female : male   =     37.6 : 1.0
                 suffix2 = 'sa'           female : male   =     34.4 : 1.0
              lastletter = 'a'            female : male   =     33.4 : 1.0
              lastletter = 'k'              male : female =     27.6 : 1.0
                 suffix2 = 'us'             male : female =     24.7 : 1.0
                 suffix2 = 'ta'           female : male   =     24.6 : 1.0
                 suffix2 = 'ra'           female : male   =     24.3 : 1.0
                 suffix2 = 'ld'             male : female =     23.1 : 1.0
                 suffix2 = 'do'             male : female =     20.7 : 1.0
                 suffix2 = 'os'             male : female =     20.7 : 1.0

In [15]:
# Check errors
errors = []
for (name, tag) in devtest_set:
    guess = classifier.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )


In [16]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))


correct=female   guess=male     name=Avis                          
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Bab                           
correct=female   guess=male     name=Barry                         
correct=female   guess=male     name=Betty                         
correct=female   guess=male     name=Binny                         
correct=female   guess=male     name=Bliss                         
correct=female   guess=male     name=Brear                         
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Chris                         
correct=female   guess=male     name=Cris                          
correct=female   guess=male     name=Cyndy                         
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Dody                          
correct=female   guess=male     name=Donny      

In [17]:
print("Error count: ", len(errors))

Error count:  109


## Classifier - DecisionTree

In [18]:
classifier_tree = nltk.DecisionTreeClassifier.train(train_set_fe)

print("train_set: ", nltk.classify.accuracy(classifier_tree, train_set_fe))
print("test_set: ", nltk.classify.accuracy(classifier_tree, test_set_fe))
print("devtest_set: ", nltk.classify.accuracy(classifier_tree, devtest_set_fe))

train_set:  0.9362039170506913
test_set:  0.742
devtest_set:  0.736


In [19]:
errors2 = []
for (name, tag) in devtest_set:
    guess = classifier_tree.classify(gender_features2(name))
    if guess != tag:
        errors2.append( (tag, guess, name) )


In [20]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Avis                          
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Bab                           
correct=female   guess=male     name=Barry                         
correct=female   guess=male     name=Betty                         
correct=female   guess=male     name=Binny                         
correct=female   guess=male     name=Bliss                         
correct=female   guess=male     name=Brear                         
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Chris                         
correct=female   guess=male     name=Cris                          
correct=female   guess=male     name=Cyndy                         
correct=female   guess=male     name=Darb                          
correct=female   guess=male     name=Dody                          
correct=female   guess=male     name=Donny      

In [21]:
print("Error count")
len(errors2)

Error count


132

## Features - B

In [22]:
def gender_features(name):
    name = name.lower()
    return{
        'first_letter': name[0],
        'first2_letter': name[0:2],
        'first3_letter': name[0:3],
        'last_letter': name[-1],
        'last2_letter': name[-2:],
        'last3_letter': name[-3:],
        'last_vowel': (name[-1] in 'aeiou')
    }

In [23]:
gender_features("Rose")

{'first2_letter': 'ro',
 'first3_letter': 'ros',
 'first_letter': 'r',
 'last2_letter': 'se',
 'last3_letter': 'ose',
 'last_letter': 'e',
 'last_vowel': True}

## Vectorize

In [24]:
import numpy as np
features = np.vectorize(gender_features)
print(features(['Rose', 'Mike']))

[ {'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]


In [25]:
# Extract the features for entire dataset
X = np.array(features(names))[:, 0] # X contains the features

# Get the gender column
y = np.array(names)[:, 1]           # y contains the targets

print("Name: %s, features=%s, gender=%s" % (names[0][0], X[0], y[0]))

Name: Ephraim, features={'first_letter': 'e', 'first2_letter': 'ep', 'first3_letter': 'eph', 'last_letter': 'm', 'last2_letter': 'im', 'last3_letter': 'aim', 'last_vowel': False}, gender=male


In [26]:
# Shuffle and split:  train, dev-test, test 
from sklearn.utils import shuffle
X,y = shuffle(X,y)

X_test, X_dev_test, X_train = X[:500], X[500:1000], X[1000:]
y_test, y_dev_test, y_train = y[:500], y[500:1000], y[1000:]

print("test: " , len(X_test))
print("devtest: ", len(X_dev_test))
print("train: ", len(X_train))

test:  500
devtest:  500
train:  6944


In [27]:
from sklearn.feature_extraction import DictVectorizer

print(features(['Rose', 'Mike']))

vectorizer = DictVectorizer()
vectorizer.fit(X_train)

transform = vectorizer.transform(features(['Rose', 'Mike']))
print(transform)
print(type(transform))
print(transform.toarray()[0][12])
print(vectorizer.feature_names_[12])

[ {'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]
  (0, 169)	1.0
  (0, 1225)	1.0
  (0, 1525)	1.0
  (0, 1740)	1.0
  (0, 2683)	1.0
  (0, 3144)	1.0
  (0, 3165)	1.0
  (1, 130)	1.0
  (1, 1007)	1.0
  (1, 1520)	1.0
  (1, 1654)	1.0
  (1, 2284)	1.0
  (1, 3144)	1.0
  (1, 3165)	1.0
<class 'scipy.sparse.csr.csr_matrix'>
0.0
first2_letter=ap


## Classifier - DecisionTree

In [28]:
from sklearn.tree import DecisionTreeClassifier
 
DT_classifier = DecisionTreeClassifier()

DT_classifier.fit(vectorizer.transform(X_train), y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [29]:
print(DT_classifier.predict(vectorizer.transform(features(["Sebastian", "Amy"]))))

['male' 'female']


In [30]:
# Accuracy
print("Accuracy on training set: ", DT_classifier.score(vectorizer.transform(X_train), y_train))
print("Accuracy on dev-test set: ",DT_classifier.score(vectorizer.transform(X_dev_test), y_dev_test))
print("Accuracy on test set: ",DT_classifier.score(vectorizer.transform(X_test), y_test))


Accuracy on training set:  0.958381336406
Accuracy on dev-test set:  0.77
Accuracy on test set:  0.77


## Conclusion

* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

The performance on the test set is lower compare to that of the dev-test set.  As we optimize against the dev-test set, we are likely to create overfitting and generalizing the outcome to the model will be ineffective as the test set is a new dataset for the model.  The training accuracy is higher as the model performs better on the data it has experienced before compared to the data it has not.

# Source

https://www.nltk.org/book/ch06.html