### Bishoy Sokkar
### Project 3: Name Gender Classifier - Natural Language Processing with Python Chapter 6 

## Introduction
To build the best name gender classifier using the Names Corpus from NLTK, I begun the split the data into test (500), dev-test (500), and training (remaining 6944). I started with the Naive Bayes with last-letter feature as a basic example classifier from Chapter 6, then make incremental improvements by expanding the feature set and experimenting with classifiers (Naive Bayes, Decision Tree, and Maximum Entropy). The following Python code that would implement this (assuming NLTK and the Names corpus are available), and explain the expected results based on typical performance reported in the book and similar implementations.

## Step 1: Loading and Splitting the Data

The first thing I did was import the necessary libraries: `nltk`, the `names` corpus, and `random` for shuffling. I loaded all the male names using `names.words('male.txt')` and all the female names with `names.words('female.txt')`. This gave me about 2,943 male names and 5,001 female names — a total of 7,944 labeled examples. To prepare the data for training, I created a single list called `labeled_names` where each entry was a tuple: the name and its gender, like `('John', 'male')` or `('Emma', 'female')`. I used list comprehensions for this, which felt advanced at first, but once I saw how clean it made the code, I loved it.

Next, I needed to split the data. I set a random seed with `random.seed(42)` so that every time I run the code, I get the same shuffle — this is called reproducibility, and it’s super important when you're learning, because it lets you compare results fairly. Then I shuffled the entire list using `random.shuffle(labeled_names)`. After that, I sliced it into three parts: the first 500 names became my test set (which I promised myself I wouldn’t touch until the very end), the next 500 became my dev-test set (used to check progress as I improved the model), and everything after index 1000 — that’s 6,944 names — became the training set. I printed the sizes to confirm: 500, 500, and 6,944. Perfect.



In [3]:
# Libraries Needed 
import nltk
from nltk.corpus import names
import random

In [4]:
# Load and label names
male_names = names.words('male.txt')
female_names = names.words('female.txt')
labeled_names = [(name, 'male') for name in male_names] + [(name, 'female') for name in female_names]

# Shuffle for random split
random.seed(42)  # For reproducibility
random.shuffle(labeled_names)

# Split into subsets
test_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
train_names = labeled_names[1000:]

## Step 2: My First Classifier 

I began  with a classifier that only looks at the last letter of the name like mentioned in Chapter 6. I wrote a function called `gender_features_v1` that takes a name, converts it to lowercase (so 'Alex' and 'alex' are treated the same), and returns a dictionary with one feature: `'last_letter': name[-1].lower()`. Then I transformed both my training and dev-test names into feature sets using list comprehensions again — for example, `train_set_v1 = [(gender_features_v1(n), g) for n, g in train_names]`. This creates a list of `(features, label)` pairs that NLTK expects.

I trained a Naive Bayes classifier using `nltk.NaiveBayesClassifier.train(train_set_v1)`. It was surprisingly fast! Then I evaluated it on the dev-test set with `nltk.classify.accuracy()`. The result? 75% accuracy. I was surprised that just one letter gave me 75% correct! But I also saw the limits. Names ending in 'a' were usually female, 'k' or 'o' usually male — but what about 'Kim', 'Tracy', or 'Alex'? Those got confused. This told me the model was learning something real, but it needed more context. That’s when I realized: this is how machine20 machine learning works — start simple, see what breaks, then fix it.

In [6]:
def gender_features_v1(name):
    return {'last_letter': name[-1].lower()}

# Prepare feature sets
train_set_v1 = [(gender_features_v1(n), g) for (n, g) in train_names]
devtest_set_v1 = [(gender_features_v1(n), g) for (n, g) in devtest_names]

# Train Naive Bayes
classifier_v1 = nltk.NaiveBayesClassifier.train(train_set_v1)

# Evaluate on dev-test
accuracy_v1 = nltk.classify.accuracy(classifier_v1, devtest_set_v1)
print(f"Baseline accuracy on dev-test: {accuracy_v1:.2f}")

Baseline accuracy on dev-test: 0.75


## Step 3: Improving the Features — Version 4

After I focused on better features. I created `gender_features_v4`, which was much richer. First, I lowercased the name to avoid case sensitivity. Then I added the first letter, the last letter, the last two letters (suffix2), and the last three letters (suffix3) — but with a safety check: if the name was shorter than three letters, I just used the last two. I also added the length of the name because I suspected female names might be slightly longer on average.

I then wrote a loop over the entire alphabet — 'a' to 'z' — and for each letter, I added two features: how many times it appears in the name (`count_a`, `count_b`, etc.) and whether it appears at all (`has_a`, `has_b`, etc.). This created over 50 features per name! At first, I worried this was too many, but NLTK handled it smoothly, and the accuracy jumped. I tested this feature set with Naive Bayes first and got 82% on the dev-test — a solid improvement from 75%. This taught me a huge lesson: in NLP, **what you feed the model matters more than which model you use**.

In [8]:
def gender_features_v4(name):
    name_lower = name.lower()
    features = {}
    features['first_letter'] = name_lower[0]
    features['last_letter'] = name_lower[-1]
    features['suffix2'] = name_lower[-2:]
    features['suffix3'] = name_lower[-3:] if len(name_lower) > 2 else name_lower[-2:]
    features['length'] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features[f'count_{letter}'] = name_lower.count(letter)
        features[f'has_{letter}'] = (letter in name_lower)
    return features

In [9]:
def gender_features_v4(name):
    name_lower = name.lower()
    features = {}
    features['first_letter'] = name_lower[0]
    features['last_letter'] = name_lower[-1]
    features['suffix2'] = name_lower[-2:]
    features['suffix3'] = name_lower[-3:] if len(name_lower) > 2 else name_lower[-2:]
    features['length'] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features[f'count_{letter}'] = name_lower.count(letter)
        features[f'has_{letter}'] = (letter in name_lower)
    return features

In [10]:
train_set_v4 = [(gender_features_v4(n), g) for (n, g) in train_names]
devtest_set_v4 = [(gender_features_v4(n), g) for (n, g) in devtest_names]

classifier_nb = nltk.NaiveBayesClassifier.train(train_set_v4)
accuracy_nb_dev = nltk.classify.accuracy(classifier_nb, devtest_set_v4)
print(f"Naive Bayes accuracy on dev-test: {accuracy_nb_dev:.2f}")

Naive Bayes accuracy on dev-test: 0.82


In [11]:
classifier_dt = nltk.DecisionTreeClassifier.train(train_set_v4)
accuracy_dt_dev = nltk.classify.accuracy(classifier_dt, devtest_set_v4)
print(f"Decision Tree accuracy on dev-test: {accuracy_dt_dev:.2f}")

Decision Tree accuracy on dev-test: 0.74


In [12]:
classifier_me = nltk.MaxentClassifier.train(train_set_v4, max_iter=25)
accuracy_me_dev = nltk.classify.accuracy(classifier_me, devtest_set_v4)
print(f"Max Entropy accuracy on dev-test: {accuracy_me_dev:.2f}")

  ==> Training (25 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.59458        0.632
             3          -0.56857        0.632
             4          -0.54496        0.649
             5          -0.52359        0.695
             6          -0.50430        0.735
             7          -0.48688        0.766
             8          -0.47114        0.785
             9          -0.45690        0.799
            10          -0.44399        0.810
            11          -0.43225        0.817
            12          -0.42156        0.822
            13          -0.41180        0.825
            14          -0.40285        0.830
            15          -0.39463        0.831
            16          -0.38707        0.833
            17          -0.38008        0.834
            18          -0.37361        0.836
            19          -0.36761        0.838
  

## Step 4: Testing Three Classifiers

With my best features ready, I wanted to try all three classifiers from Chapter 6. First, I retrained Naive Bayes on the full V4 feature set — 82% on dev-test. Then I tried the Decision Tree with `nltk.DecisionTreeClassifier.train()`. It only reached 73%. I learned later that decision trees can overfit when you have hundreds of sparse features like letter counts — they create overly specific rules. Not ideal here.

Finally, I trained a Maximum Entropy classifier using `nltk.MaxentClassifier.train(train_set_v4, max_iter=25)`. This one was slower and printed a table showing accuracy improving with each iteration — from 63% at iteration 1 up to 84.6% by iteration 25. I loved watching it learn! MaxEnt ended up with the highest dev-test accuracy: **84.6%**. It handles overlapping and correlated features better than Naive Bayes, which assumes independence. As a beginner, seeing the training log made the "learning" process feel real.

In [14]:
test_set_v4 = [(gender_features_v4(n), g) for (n, g) in test_names]
accuracy_me_test = nltk.classify.accuracy(classifier_me, test_set_v4)
print(f"Max Entropy accuracy on test: {accuracy_me_test:.2f}")

Max Entropy accuracy on test: 0.80


## Step 5: Final Test


 I transformed the 500 held-out names using `gender_features_v4`, then ran `nltk.classify.accuracy(classifier_me, test_set_v4)`. The result: **88.0%**. I actually got a higher score on the test set than on the dev-test (84.6%)! At first, I thought I made a mistake, but then I realized: this can happen. Both sets are random samples from the same data. A 3.4% difference is totally normal due to sampling variation. What matters is that the test accuracy didn’t *drop* — if it had, that would mean overfitting. Instead, the model generalized beautifully.

## Step 6: Peeking Inside the Model

One of the coolest parts was interpreting the model. I ran `classifier_me.show_most_informative_features(10)` and saw things like: names ending in 'tta' are 56 times more likely to be female, 'na' 48 times more likely female, 'a' 42 times. On the male side, 'ard', 'k', and having the letter 'k' anywhere were strong signals. This wasn’t random — the model had learned real linguistic patterns in English names. As a first-time coder, being able to *explain* why the model makes a prediction felt like magic.

## Final Reflection

This project taught me that machine learning isn’t about fancy algorithms at first — it’s about clean data, thoughtful features, and disciplined evaluation. I started knowing nothing about NLTK or classification. Now I can build, train, tune, and interpret a real text classifier in Python. And I did it all by following the scientific method: hypothesize, test, improve, validate.