## Project 3: Gender Name Classifier
## CUNY MSDS DATA 620 Web Analytics, CUNY Spring 2018
---
### Team5: Christopher Estevez, Meaghan Burke, Rickidon Singh,  Ritesh Lohiya, Rose Koh
### 07/09/2018 (due date)
##### python version: 3.6
---


<h2> Assignment Details </h2>

For this project, please work with the entire class as one collaborative group!Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

<i>The ability to be an effective member of a virtual team is highly valued in the data science job market. </i>


-------------------------------------------------------------------------------------------------------------

Using any of the three classifiers described in chapter 6 of <b>Natural Language Processing with Python</b>,
and any features you can think of, *build the best name gender classifier* you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest
set, and the remaining 6900 words for the training set. Then, starting with the example name gender
classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are
satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what
you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.



In [1]:
#import nltk
#nltk.download('names')

[nltk_data] Downloading package names to /Users/rosekoh/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [2]:
from nltk.corpus import names
import random
from nltk.classify import apply_features

#get data
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)

In [3]:
names[0:10]

[('Rahal', 'female'),
 ('Joannes', 'female'),
 ('Loreen', 'female'),
 ('Chelsae', 'female'),
 ('Dasie', 'female'),
 ('Imelda', 'female'),
 ('Hank', 'male'),
 ('Kostas', 'male'),
 ('Ulberto', 'male'),
 ('Elyse', 'female')]

In [76]:
#from NLP page 223 and modified for optimization
def gender_features(name):
    name = name.lower()
    return{
        'first_letter': name[0],
        'first2_letter': name[0:2],
        'first3_letter': name[0:3],
        'last_letter': name[-1],
        'last2_letter': name[-2:],
        'last3_letter': name[-3:],
        'last_vowel': (name[-1] in 'aeiou')
    }

In [78]:
gender_features("Rose")

{'first2_letter': 'ro',
 'first3_letter': 'ros',
 'first_letter': 'r',
 'last2_letter': 'se',
 'last3_letter': 'ose',
 'last_letter': 'e',
 'last_vowel': True}

In [103]:
# Vectorize the feature func
import numpy as np
features = np.vectorize(gender_features)
print(features(['Rose', 'Mike']))

[ {'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]


In [104]:
# Extract the features for entire dataset
X = np.array(features(names))[:, 0] # X contains the features

# Get the gender column
y = np.array(names)[:, 1]           # y contains the targets

print("Name: %s, features=%s, gender=%s" % (names[0][0], X[0], y[0]))

Name: Rahal, features={'first_letter': 'r', 'first2_letter': 'ra', 'first3_letter': 'rah', 'last_letter': 'l', 'last2_letter': 'al', 'last3_letter': 'hal', 'last_vowel': False}, gender=female


---------

In [105]:
# shuffle and split: train, dev-test, test 
from sklearn.utils import shuffle
X,y = shuffle(X,y)

X_test, X_dev_test, X_train = X[:500], X[500:1000], X[1000:]
y_test, y_dev_test, y_train = y[:500], y[500:1000], y[1000:]

print("test: " , len(X_test))
print("devtest: ", len(X_dev_test))
print("train: ", len(X_train))

test:  500
devtest:  500
train:  6944


In [106]:
from sklearn.feature_extraction import DictVectorizer

print(features(['Rose', 'Mike']))

vectorizer = DictVectorizer()
vectorizer.fit(X_train)

transform = vectorizer.transform(features(['Rose', 'Mike']))
print(transform)
print(type(transform))
print(transform.toarray()[0][12])
print(vectorizer.feature_names_[12])

[ {'first_letter': 'r', 'first2_letter': 'ro', 'first3_letter': 'ros', 'last_letter': 'e', 'last2_letter': 'se', 'last3_letter': 'ose', 'last_vowel': True}
 {'first_letter': 'm', 'first2_letter': 'mi', 'first3_letter': 'mik', 'last_letter': 'e', 'last2_letter': 'ke', 'last3_letter': 'ike', 'last_vowel': True}]
  (0, 169)	1.0
  (0, 1232)	1.0
  (0, 1529)	1.0
  (0, 1746)	1.0
  (0, 2688)	1.0
  (0, 3140)	1.0
  (0, 3161)	1.0
  (1, 130)	1.0
  (1, 1010)	1.0
  (1, 1524)	1.0
  (1, 1660)	1.0
  (1, 2293)	1.0
  (1, 3140)	1.0
  (1, 3161)	1.0
<class 'scipy.sparse.csr.csr_matrix'>
0.0
first2_letter=ap


In [107]:
from sklearn.tree import DecisionTreeClassifier
 
DT_classifier = DecisionTreeClassifier()
DT_classifier.fit(vectorizer.transform(X_train), y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [108]:
print(DT_classifier.predict(vectorizer.transform(features(["Sebastian", "Amy"]))))

['male' 'female']


In [112]:
print("Accuracy on training set: ", DT_classifier.score(vectorizer.transform(X_train), y_train))
print("Accuracy on dev-test set: ",DT_classifier.score(vectorizer.transform(X_dev_test), y_dev_test))
print("Accuracy on test set: ",DT_classifier.score(vectorizer.transform(X_test), y_test))


Accuracy on training set:  0.957661290323
Accuracy on dev-test set:  0.796
Accuracy on test set:  0.776


---------

* How does the performance on the test set compare to the performance on the dev-test set? 
  Is this what you'd expect?

The accuracy 

As we optimize against the dev-test set, the prformance is going to be lower. 
It is generally going to be lower, because we are going to be optimizing against the dev-test set, which means we will probably be overfitting a little and our results will not generalize well to data our model hasn't seen before (the test set). However, if we are doing a good job and not overfitting too much, it won't be significantly lower.

The dev-test set has an accuracy roughly in the range of 0.784 to 0.826. The test set has a range from 0.786 to 0.822, which is pretty similar. However, the training accuracy is higher (e.g., 0.828), which is what we would expect, since our model should perform better on data it has seen compared with data it hasn't seen.