# Text Classification Features and NLTK Classification Code #
This example is based on the NLTK book and uses the Names collection to guess gender of names.

In [3]:
import nltk
from nltk.corpus import names
import random

** A feature recognition function **

In [4]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Samantha')

{'last_letter': 'a'}

** Create name datasets ** 

In [5]:
def create_name_data():
    male_names = [(name, 'male') for name in names.words('male.txt')]
    female_names = [(name, 'female') for name in names.words('female.txt')]
    allnames = male_names + female_names
    
    # Randomize the order of male and female names, and de-alphabatize
    random.shuffle(allnames)
    return allnames

names_data = create_name_data()

** Make Training, Development, and Test Data Sets **

We  need a development set to test our features on before testing on the real test set. So let's redo our division of the data. In this case we do the dividing up before applying the feature selection so we can keep track of the names.

In [6]:
# This function allows experimentation with different feature definitions
# items is a list of (key, value) pairs from which features are extracted and training sets are made
# Feature sets returned are dictionaries of features

# This function also optionally returns the names of the training, development, 
# and test data for the purposes of error checking

def create_training_sets (feature_function, items, return_items=False):
    # Create the features sets.  Call the function that was passed in.
    # For names data, key is the name, and value is the gender
    featuresets = [(feature_function(key), value) for (key, value) in items]
    
    # Divided training and testing in thirds.  Could divide in other proportions instead.
    third = int(float(len(featuresets)) / 3.0)
    
    train_set, dev_set, test_set = featuresets[0:third], featuresets[third:third*2], featuresets[third*2:]
    train_items, dev_items, test_items = items[0:third], items[third:third*2], items[third*2:]
    if return_items == True:
        return train_set, dev_set, test_set, train_items, dev_items, test_items
    else:
        return train_set, dev_set, test_set

** Train the nltk classifier on the training data, with the first definition of features  **

In [7]:
# pass in a function name
train_set, dev_set, test_set = create_training_sets(gender_features, names_data)
cl = nltk.NaiveBayesClassifier.train(train_set)

** Test the classifier on some examples **

In [9]:
print ("Carl: " + cl.classify(gender_features('Carl')))
print ("Carla: " + cl.classify(gender_features('Carla')))
print ("Carly: " + cl.classify(gender_features('Carly')))
print ("Carlo: " + cl.classify(gender_features('Carlo')))
print ("Carlos: " + cl.classify(gender_features('Carlos')))

Carl: male
Carla: female
Carly: female
Carlo: male
Carlos: male


In [10]:
print ("Carli: " + cl.classify(gender_features('Carli')))
print ("Carle: " + cl.classify(gender_features('Carle')))
print ("Charles: " + cl.classify(gender_features('Charles')))
print ("Carlie: " + cl.classify(gender_features('Carlie')))
print ("Charlie: " + cl.classify(gender_features('Charlie')))

Carli: female
Carle: female
Charles: male
Carlie: female
Charlie: female


** Run the NLTK evaluation function on the development set **

In [11]:
print ("%.3f" % nltk.classify.accuracy(cl, dev_set))

0.751


** Run the NLTK feature inspection function on the classifier **

In [12]:
cl.show_most_informative_features(15)

Most Informative Features
             last_letter = 'a'            female : male   =     31.0 : 1.0
             last_letter = 'k'              male : female =     15.4 : 1.0
             last_letter = 'm'              male : female =     12.5 : 1.0
             last_letter = 'f'              male : female =     12.0 : 1.0
             last_letter = 'v'              male : female =      8.6 : 1.0
             last_letter = 'o'              male : female =      8.4 : 1.0
             last_letter = 'd'              male : female =      7.7 : 1.0
             last_letter = 'w'              male : female =      7.4 : 1.0
             last_letter = 'r'              male : female =      7.0 : 1.0
             last_letter = 's'              male : female =      4.4 : 1.0
             last_letter = 'g'              male : female =      4.4 : 1.0
             last_letter = 't'              male : female =      4.0 : 1.0
             last_letter = 'i'            female : male   =      3.8 : 1.0

** Let's add some more features to improve results **

In [181]:
def gender_features2(word):
    features = {}
    word = word.lower()
    features['last'] = word[-1]
    features['first'] = word[:1]
    features['second'] = word[1:2] # get the 'h' in Charlie?
    return features
gender_features2('Samantha')

def gender_features3(word):
    features = {}
    word = word.lower()
    features['first'] = word[:1]
    features['second'] = word[1:2] # get the 'h' in Charlie?
    features['firsttwo'] = word[:2]
    features['lasttwo'] = word[-2:]
    features['lastthree'] = word[-3:]
    features["second_is_vowel"] = word[1:2] in ['a', 'e', 'i', 'o', 'u']
    features["three_and_four"] = word[2:4] # account for ue, ie, etcs
    return features

gender_features3('Samantha')

{'first': 's',
 'firsttwo': 'sa',
 'lastthree': 'tha',
 'lasttwo': 'ha',
 'second': 'a',
 'second_is_vowel': True,
 'three_and_four': 'ma'}

** We wrote the code so that we can easily pass in the new feature function. Lets see if this improves the results on the development set.**

In [182]:
train_set2, dev_set2, test_set2 = create_training_sets(gender_features3, names_data)
cl2 = nltk.NaiveBayesClassifier.train(train_set2)
print ("%.3f" % nltk.classify.accuracy(cl2, dev_set2))

0.815


** Let's hand check some of the harder cases ... oops some are right but some are now wrong. **

In [183]:
print ("Carli: " + cl2.classify(gender_features('Carli')))
print ("Carle: " + cl2.classify(gender_features('Carle')))
print ("Charles: " + cl2.classify(gender_features('Charles')))
print ("Carlie: " + cl2.classify(gender_features('Carlie')))
print ("Charlie: " + cl2.classify(gender_features('Charlie')))

Carli: female
Carle: female
Charles: female
Carlie: female
Charlie: female


** We can see the influence of some of the new features **

In [184]:
cl2.show_most_informative_features(15)

Most Informative Features
                 lasttwo = 'na'           female : male   =     58.4 : 1.0
                 lasttwo = 'rd'             male : female =     36.9 : 1.0
                 lasttwo = 'us'             male : female =     25.9 : 1.0
                 lasttwo = 'ta'           female : male   =     19.3 : 1.0
                 lasttwo = 'sa'           female : male   =     15.1 : 1.0
                 lasttwo = 'rt'             male : female =     14.9 : 1.0
                 lasttwo = 'ld'             male : female =     14.9 : 1.0
                 lasttwo = 'ra'           female : male   =     14.2 : 1.0
               lastthree = 'lle'          female : male   =     13.1 : 1.0
                firsttwo = 'wa'             male : female =     12.9 : 1.0
                 lasttwo = 'am'             male : female =     12.7 : 1.0
               lastthree = 'tta'          female : male   =     12.2 : 1.0
                 lasttwo = 'as'             male : female =     11.6 : 1.0

**Below we use code from the NLTK chapter to print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. We use the feature of the training set function that let us get the original names from the training and development set**

In [186]:
train_set3, dev_set3, test_set3, train_items, dev_items, test_items = create_training_sets(gender_features3, names_data, True)
cl3 = nltk.NaiveBayesClassifier.train(train_set3)
# This is code from the NLTK chapter
errors = []
for (name, label) in dev_items:
    print(str(name) + " " + str(label))
    guess = cl3.classify(gender_features2(name))
    if guess != label:
        errors.append( (label, guess, name) )

Dagmar female
Shaine male
Jordan male
Halimeda female
Dody female
Vi female
Andrus male
Dugan male
Porter male
Bing male
Reggie male
Salvidor male
Audrey female
Nedda female
Darice female
Marjory female
Fernande female
Kerrill female
Nike female
Novelia female
Levy male
Germaine male
Lorettalorna female
Butler male
Morty male
Shelby male
Nelle female
Velma female
Oralie female
Christopher male
Coral female
Andrzej male
Ruben male
Kaye female
JoAnne female
Blancha female
Jeremiah male
Daisi female
Ninon female
Valida female
Alexis male
Etty female
Loreen female
Harvey male
Anallese female
Yank male
Ripley male
Brana female
Geri female
Westleigh male
Mellisa female
Dari female
Benni female
Carla female
Raynard male
Dede female
Cristopher male
Hilliard male
Leanne female
Kia female
Sivert male
Lemar male
Josi female
Ulrich male
Ari male
Florian male
Roanne female
Marlyn female
Luella female
Remus male
Veda female
Wade male
Bessy female
Lazar male
Nicole female
Paco male
Meara female
Verla

** Print out the correct vs. the guessed answer for the errors, in order to inspect those that were wrong. **

In [187]:
for (tag, guess, name) in sorted(errors): 
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Aphrodite                     
correct=female   guess=male     name=April                         
correct=female   guess=male     name=Bunnie                        
correct=female   guess=male     name=Burta                         
correct=female   guess=male     name=Gui                           
correct=female   guess=male     name=Guinevere                     
correct=female   guess=male     name=Gunilla                       
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Gusella                       
correct=female   guess=male     name=Gussi                         
correct=female   guess=male     name=Gusti                         
correct=female   guess=male     name=Gusty                         
correct=female   guess=male     name=Hyacintha                     
correct=female   guess=male     name=Ophelie                       
correct=female   guess=male     name=Oprah      

### Testing new Features

In [191]:
import pandas as pd
df = pd.DataFrame(errors)
df.columns = ["correct", "guess", "name"]
df["name_length"] = df["name"].map(len)
df["vowel_count"] = df["name"].map(lambda x: sum([char in ['a', 'e', 'i', 'o', 'u'] for char in x])) # count of vowels
df["ends_in_vowel"] = df["name"].map(lambda x: x[-1:] in ['a', 'e', 'i', 'o', 'u'])
df.groupby("correct").mean()

Unnamed: 0_level_0,name_length,vowel_count,ends_in_vowel
correct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,6.102362,2.346457,0.685039
male,5.881481,2.065432,0.253086


** Exercise** Rewrite the feature function above to add some additional features, and then rerun the classifier on the development set to evaluate if it improves or degrades results.  Check the results on the dev items to see where you still make errors and add or remove features.  When you are satisfied with the results, *freeze your algorithm* and ** run it one time only on the test collection ** and report the results with the evaluation function. 

Ideas for features:
* name length
* pairs of characters
* your idea goes here

### Explanation
I tried various features such as the number of vowels in the name, the length of the name, whether the name ends in a a vowel, etc. However, I didn't have much luck with any of them. I analyzed the errors by exploring them a bit in pandas and trying various features to see if I could differentiate between the genders that way.

In [195]:
def gender_features3(word):
    features = {}
    word = word.lower()
    features['first'] = word[:1]
    features['second'] = word[1:2] # get the 'h' in Charlie?
    features['firsttwo'] = word[:2]
    features['lasttwo'] = word[-2:]
    features['lastthree'] = word[-3:]
    features["second_is_vowel"] = word[1:2] in ['a', 'e', 'i', 'o', 'u']
    features["three_and_four"] = word[2:4] # account for ue, ie, etcs
    return features

In [196]:
train_set2, dev_set2, test_set2 = create_training_sets(gender_features3, names_data)
cl2 = nltk.NaiveBayesClassifier.train(train_set2)
print ("%.3f" % nltk.classify.accuracy(cl2, test_set2))

0.802
