# **Data 620 Project 3**
Seung Min Song, Krutika Patel<br>

03/31/2024

## **Project 3**

Using any of the three classifiers described in this chapter, and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6,900 words for the training set. 

Then, starting with the example name gender classifier, make incremental improvements. 

Use the devtest set to check your progress. 

Once you are satisfied with your classifier, check its final performance on the test set. 

How does the performance on the test set compare to the performance on the dev-test set? Is this what you’d expect?

## Supervised Classification

1. Training: Materials with correct answers (=already classified) → pattern learning
    * Feature: Criteria for identifying and describing patterns in data
    * Algorithm: A method of calculating classification results from feature values

2. Test = Prediction: Learned pattern → Classify new data
Gender Identification


## Gender Identification

### Name gender classification: feature extraction

Pattern:
* Ends with pattern a,e,i → female
* Ends with k,o,r,s,t → male

Feature:
* What is the last letter of the feature name?

Function definition: 
* Select the feature value of a given name (variable name: gender_features)

Input:
* string (variable name word)

output
* dictionary

Key: 
* last_letter

Value: 
* word[-1] (last one letter of word)

In [229]:
def gender_features(word):
    return {'last_letter': re.sub('[0-9]', '', word)[-1].lower()}

In [230]:
import random
import numpy as np
import nltk

random.seed(123)  
np.random.seed(123)

from nltk.corpus import names
labeled_names = (
[(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
print(labeled_names[:10])

[('Cordelie', 'female'), ('Peggie', 'female'), ('Solange', 'female'), ('Rana', 'female'), ('Jessy', 'female'), ('Lelia', 'female'), ('Dorothy', 'female'), ('Ulrick', 'male'), ('Roshelle', 'female'), ('Caitrin', 'female')]


### Name Gender Classification: Construct a classifier → learn from training set

1 Feature set composition: list (variable name featuresets)
* Element 2-tuple: (Feature =Dictionary, Label=Gender)
* Example ({'last_letter': n}, 'male') | ({'last_letter': e}, 'female')
    
     ▶ Note: Specific names such as Aaron and Zoe are reduced to qualities.

2 Corpus partitioning
* Test set: first 500
* Development Test Set: Next 500:1000]
* Training Set: Rest
    
3 Classifier (variable name classifier)
* Algorithm Naive Bayes: nltk.NaiveBayesClassifier

### Name Gender Classification: Classifier Performance Evaluation

* Function: nltk.classify.accuracy()
* Input: classifier, experiment set
* Output: accuracy
* Result 0.78 → higher than 0.5 that would be guessed by chance.


In [291]:
import nltk

# featuresets 
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

test_set = featuresets[:500]                   
dev_test_set = featuresets[500:1000]           
train_set = featuresets[1000:]                

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

Accuracy on the development test set:  0.78


### Name gender classification: Apply classifier

Function: classifier.classify()
* Input: feature dictionary
* Output: label=gender

In [232]:
classifier.classify(gender_features('Ahley'))
#classifier.classify(gender_features('John'))

'female'

### Name Gender Classification: Review Features

* Featyre: information quantity evaluation
* Function: classifier.show_most_informative_features()
* Inputs: number of n
* Print: Top n qualities with high output information amount
* Example: Ending with a, the probability of being female is 36 times higher than the probability of being male.

In [234]:
classifier.show_most_informative_features()

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     29.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0


Implement a function that predicts gender and returns the result as shown below.

In [235]:
def predict(usernames):
    return [{u: classifier.classify(gender_features(u))} for u in usernames]

In [236]:
input_data = [
    'Trump', 
    'Ashley',
    'Biden',
    'David',
    'Seungmin',
    'Krutika',     
]

Executing the predict() function 

In [237]:
predict(input_data)

[{'Trump': 'male'},
 {'Ashley': 'female'},
 {'Biden': 'male'},
 {'David': 'male'},
 {'Seungmin': 'male'},
 {'Krutika': 'female'}]

### Name Gender Category: Performance Comparison

gender_features() vs gender_features2()
1. Construct a feature set based on gender_features2()
2. Train a new classifier from the same training set
3. Apply the new classifier to the same set of experiments
4. Accuracy rating: 0.78 > 0.782

Feature have been added and accuracy is increased.

In [292]:
import string

def gender_features2(name):
    name_lower = name.lower() 
    features = {
        'first_letter': name_lower[0],
        'last_letter': name_lower[-1],
        **{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        **{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase}
    }
    return features


In [293]:
featuresets = [(gender_features2(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))


Accuracy on the development test set:  0.782


When vowels and consonants were included, the number reduced from 0.782 to 0.788

In [254]:
import string

def gender_features3(name):
    name_lower = name.lower()
    vowels = 'aeiou'
    consonants = ''.join(set(string.ascii_lowercase) - set(vowels))
    
    num_vowels = sum(name_lower.count(v) for v in vowels)
    num_consonants = sum(name_lower.count(c) for c in consonants)
    
    features = {
        'first_letter': name_lower[0],
        'last_letter': name_lower[-1],
        'num_vowels': num_vowels,  
        'num_consonants': num_consonants,  
        **{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        **{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase},
        
    }
    return features

In [284]:
featuresets = [(gender_features3(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

Accuracy on the development test set:  0.788


Name gender classification: Modify feature

* last letter of name
* Last 1 letter of name
* Last 2 letters of name
* length of name

As a result of test_set, accuracy improved from 0.796 to 0.81.

In [280]:
def gender_features4(name):
    name_lower = name.lower() 
    features = {
        'first_letter': name_lower[0],
        'last_letter': name_lower[-1],
        **{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        **{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase},
        'suffix1': name[-1:], 
        'suffix2': name[-2:],
        #'length': len(name),
    }
    return features

In [294]:
featuresets = [(gender_features4(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy on the test set: ", nltk.classify.accuracy(classifier, test_set))

Accuracy on the development test set:  0.794
Accuracy on the test set:  0.81


### Test-set vs Dev-test-set

Typically, during the process of developing and tuning a model, we continuously check its performance on the development dev-test set and adjust the model. Therefore, it is common for the dev-test set to have better results than the test set. 

However, higher performance on the test set than on the development test set may indicate that the model is not overfitting on the development test set and has good generalization ability. Because both the development test set and the test set are relatively small in size, there may be more variability in performance evaluations. In other words, small data sets are prone to greater variability in performance measurements.