# **Data 620 Project 3**
Seung Min Song, Krutika Patel<br>

03/31/2024

## **Project 3**

Using any of the three classifiers described in this chapter, and any features you can think of, build the best name gender classifier you can. 

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6,900 words for the training set. 

Then, starting with the example name gender classifier, make incremental improvements. 

Use the devtest set to check your progress. 

Once you are satisfied with your classifier, check its final performance on the test set. 

How does the performance on the test set compare to the performance on the dev-test set? Is this what you’d expect?

## *Supervised Learning*

1. Training: Materials with correct answers (=already classified) → pattern learning
    * Feature: Criteria for identifying and describing patterns in data
    * Algorithm: A method of calculating classification results from feature values

2. Test = Prediction: Learned pattern → Classify new data
Gender Identification


## Gender Identification

### Name gender classification: feature extraction

Pattern:
* Ends with pattern a,e,i → female
* Ends with k,o,r,s,t → male

Feature:
* What is the last letter of the feature name?

Function definition: 
* Select the feature value of a given name (variable name: gender_features)

Input:
* string (variable name word)

output
* dictionary

Key: 
* last_letter

Value: 
* word[-1] (last one letter of word)

In [412]:

import random
import numpy as np
from nltk.corpus import names

random.seed(123)
np.random.seed(123)

# 이름 데이터 로드 및 레이블링
labeled_names = (
    [(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')]
)

# 데이터 섞기
random.shuffle(labeled_names)

# 숫자가 포함된 이름을 찾아 출력
names_with_digits = [name for name, gender in labeled_names if any(char.isdigit() for char in name)]
print(names_with_digits[:10])

[]


In [414]:
import random
import numpy as np
from nltk.corpus import names

random.seed(123)  
np.random.seed(123)

from nltk.corpus import names
labeled_names = (
[(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
print(labeled_names[:10])



[('Cordelie', 'female'), ('Peggie', 'female'), ('Solange', 'female'), ('Rana', 'female'), ('Jessy', 'female'), ('Lelia', 'female'), ('Dorothy', 'female'), ('Ulrick', 'male'), ('Roshelle', 'female'), ('Caitrin', 'female')]


### Name Gender Classification: Construct a classifier → learn from training set

1 Feature set composition: list (variable name featuresets)
* Element 2-tuple: (Feature =Dictionary, Label=Gender)
* Example ({'last_letter': n}, 'male') | ({'last_letter': e}, 'female')
    
     ▶ Note: Specific names such as Aaron and Zoe are reduced to qualities.

2 Corpus partitioning
* Test set: first 500
* Development Test Set: Next 500:1000]
* Training Set: Rest
    
3 Classifier (variable name classifier)
* Algorithm Naive Bayes: nltk.NaiveBayesClassifier

### Name Gender Classification: Classifier Performance Evaluation

* Function: nltk.classify.accuracy()
* Input: classifier, experiment set
* Output: accuracy
* Result 0.78 → higher than 0.5 that would be guessed by chance.


In [427]:
import nltk

def gender_features(word):
    return {'last_letter': word[-1].lower()}

# featuresets 
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

test_set = featuresets[:500]                   
dev_test_set = featuresets[500:1000]           
train_set = featuresets[1000:]                

classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

Accuracy on the development test set:  0.78


### Name gender classification: Apply classifier

Function: classifier.classify()
* Input: feature dictionary
* Output: label=gender

In [416]:
classifier.classify(gender_features('Ahley'))


'female'

### Name Gender Classification: Review Features

* Featyre: information quantity evaluation
* Function: classifier.show_most_informative_features()
* Inputs: number of n
* Print: Top n qualities with high output information amount
* Example: Ending with a, the probability of being female is 36 times higher than the probability of being male.

In [417]:
classifier.show_most_informative_features()

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     29.2 : 1.0
             last_letter = 'p'              male : female =     18.6 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'd'              male : female =      9.8 : 1.0
             last_letter = 'm'              male : female =      9.2 : 1.0
             last_letter = 'o'              male : female =      8.0 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0


Implement a function that predicts gender and returns the result as shown below.

In [418]:
def predict(usernames):
    return [{u: classifier.classify(gender_features(u))} for u in usernames]

In [419]:
input_data = [
    'Trump', 
    'Ashley',
    'Biden',
    'Kim',
    'Mei',
    'April', 
    'Sonny'    
]

Executing the predict() function 

In [420]:
predict(input_data)

[{'Trump': 'male'},
 {'Ashley': 'female'},
 {'Biden': 'male'},
 {'Kim': 'male'},
 {'Mei': 'female'},
 {'April': 'male'},
 {'Sonny': 'female'}]

Name gender classification: Modify feature

* last letter of name
* Last 1 letter of name
* The number of occurrences and inclusions of each letter of the alphabet

As a result of test_set, accuracy is 0.786.

In [426]:

import string

def gender_features2(name):
    name_lower = name.lower() 
    

    features = {
        'first_letter': name_lower[0],
        'first_two_letters': name_lower[:2] if len(name_lower) >= 2 else name_lower[0],
        'last_letter': name_lower[-1],
        **{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        **{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase}
    }
    return features

featuresets = [(gender_features2(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))


Accuracy on the development test set:  0.786


* last letter of name
* Last 1 letter of name
* Last 2 letters of name
* vowels + consonants

As a result of test_set, accuracy is 0.816.

In [424]:

import string

def gender_features3(name):
    name_lower = name.lower()
    vowels = 'aeiou'
    consonants = ''.join(set(string.ascii_lowercase) - set(vowels))
    
    num_vowels = sum(name_lower.count(v) for v in vowels)
    num_consonants = sum(name_lower.count(c) for c in consonants)
    

    features = {
        'first_letter': name_lower[0],
        'first_two_letters': name_lower[:2] if len(name_lower) >= 2 else name_lower[0],
        'last_letter': name_lower[-1],
        'num_vowels': num_vowels,  
        'num_consonants': num_consonants,  
       # **{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
       # **{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase},
        'suffix2': name[-2:],
        
    }
    return features

featuresets = [(gender_features3(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

Accuracy on the development test set:  0.816


Name gender classification: Modify feature

* last letter of name
* Last 1 letter of name
* Last 2 letters of name
* length of name

As a result of test_set, accuracy is 0.818.

In [425]:
def gender_features4(name):
    name_lower = name.lower() 
    features = {
        'first_letter': name_lower[0],
        'first_two_letters': name_lower[:2] if len(name_lower) >= 2 else name_lower[0],
        'last_letter': name_lower[-1],
        #**{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        #**{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase},
        'suffix2': name[-2:],
        'length': len(name),
    }
    return features

featuresets = [(gender_features4(n), gender) 
            for (n, gender) in labeled_names]

test_set = featuresets[:500]  
dev_test_set = featuresets[500:1000] 
train_set = featuresets[1000:]  

classifier = nltk.NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.accuracy(classifier, dev_test_set)
print("Accuracy on the development test set: ", nltk.classify.accuracy(classifier, dev_test_set))

accuracy = nltk.classify.accuracy(classifier, test_set)
print("Accuracy on the test set: ", nltk.classify.accuracy(classifier, test_set))

Accuracy on the development test set:  0.818
Accuracy on the test set:  0.798


## *Decision Tree*

Modify the code to include additional information in the feature extraction step for the decision tree model.

Here, the **first and last letters of the name**, **the number of occurrences and inclusions of each letter of the alphabet**, and **the last 1 letter and last 2 letters of the name** are used as features.

Among the several feature extraction methods presented, the combination that showed the highest accuracy on the dev-test set is as follows:

'first_letter' and 'last_letter': 0.782

'first_letter', 'last_letter', 'length',  added: 0.796

These results show that the combination of the first and last letters of the name and the length of the name as features was the most sophisticated (highly accurate) combination in the development test set.

In [438]:

import numpy as np
import random
from nltk.corpus import names
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score


random.seed(123)
np.random.seed(123)

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

def gender_features5(name):
    name_lower = name.lower()
    features = {
        'first_letter': name_lower[0],
        #'first_two_letters': name_lower[:2] if len(name_lower) >= 2 else name_lower[0],
        'last_letter': name_lower[-1],
        #**{'count({})'.format(letter): name_lower.count(letter) for letter in string.ascii_lowercase},
        #**{'has({})'.format(letter): (letter in name_lower) for letter in string.ascii_lowercase},
        #'suffix2': name_lower[-2:],
        'length': len(name),
    }
    return features

featuresets = [(gender_features5(n), gender) for (n, gender) in labeled_names]
v = DictVectorizer(sparse=False)
X = v.fit_transform([feature for feature, gender in featuresets])
y = np.array([1 if gender == 'male' else 0 for _, gender in featuresets])

X_train, y_train = X[1000:], y[1000:]
X_dev_test, y_dev_test = X[500:1000], y[500:1000]
X_test, y_test = X[:500], y[:500]

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

accuracy_dev_test = accuracy_score(y_dev_test, clf.predict(X_dev_test))
print("Accuracy on the development test set:", accuracy_dev_test)

accuracy_test = accuracy_score(y_test, clf.predict(X_test))
print("Accuracy on the test set:", accuracy_test)


Accuracy on the development test set: 0.796
Accuracy on the test set: 0.768


## *Test-set vs Dev-test-set*

Typically, during the process of developing and tuning a model, we continuously check its performance on the development dev-test set and adjust the model. Therefore, it is common for the dev-test set to have better results than the test set. 

However, higher performance on the test set than on the development test set may indicate that the model is not overfitting on the development test set and has good generalization ability. Because both the development test set and the test set are relatively small in size, there may be more variability in performance evaluations. In other words, small data sets are prone to greater variability in performance measurements.