# **Title: Project 3 - Natural Language Processing with Python**

---



**Submitted by:** Umais Siddiqui, Banu Boopalan

**Date:** March 31st, 2025

**Course:** Data Science – DATA620

**Video Link:**

**Github Repository:** https://github.com/umais/DATA620/blob/master/Project3/Project3_Natural_Language_Processing_with_Python.ipynb




#**Introduction**

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets:

- 500 words for the test set
- 500 words for the dev-test set
- The remaining 6900 words for the training set.

Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

**Source:** Natural Language Processing with Python, exercise 6.10.2.

#**Loading the Data**

In this project we take the names corpus from nltk.corpus.names, which contains lists of male and female names.The dataset is shuffled to ensure randomness.

In [13]:
import nltk
import random
from nltk.corpus import names
from nltk.classify import apply_features
from nltk.classify import NaiveBayesClassifier
from nltk.classify import DecisionTreeClassifier, MaxentClassifier
nltk.download('names')

# Load and shuffle the labeled names dataset
labeled_names = [(name, 'male') for name in names.words('male.txt')] + \
                [(name, 'female') for name in names.words('female.txt')]
random.shuffle(labeled_names)

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


#**Splitting the Data**

The dataset is divided into:

- Training set (6900 names) → used to train the classifier.

- Development-test (dev-test) set (500 names) → used to tune and evaluate improvements.

- Test set (500 names) → used for final performance evaluation.

In [7]:
# Split the dataset into training, dev-test, and test sets
train_names = labeled_names[1000:]
dev_test_names = labeled_names[500:1000]
test_names = labeled_names[:500]



#**Feature Engineering**

Instead of only using the last letter of a name, we extract:

- Last letter
- Last two letters
- Last three letters
- First letter
- Length of the name
- Vowel count

In [8]:
def gender_features(name):
    return {
        "last_letter": name[-1],
        "last_two": name[-2:],
        "last_three": name[-3:],
        "first_letter": name[0],
        "length": len(name),
        "vowel_count": sum(1 for char in name.lower() if char in 'aeiou')
    }
# Extract features for each dataset
train_set = [(gender_features(n), g) for (n, g) in train_names]
dev_test_set = [(gender_features(n), g) for (n, g) in dev_test_names]
test_set = [(gender_features(n), g) for (n, g) in test_names]

#**Training the Model**

A Naive Bayes classifier is trained on the extracted features.

In [9]:
# Train the classifier
classifier = NaiveBayesClassifier.train(train_set)

# Train Naive Bayes Classifier
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)

# Train Decision Tree Classifier
dt_classifier = DecisionTreeClassifier.train(train_set)

# Train Maximum Entropy Classifier (requires SciPy)
me_classifier = MaxentClassifier.train(train_set, algorithm='GIS', trace=0, max_iter=10)



Dev-Test Accuracy:
Naive Bayes: 0.8200
Decision Tree: 0.7360
MaxEnt: 0.8180

Test Accuracy:
Naive Bayes: 0.8060
Decision Tree: 0.7380
MaxEnt: 0.8100
Most Informative Features
                last_two = 'na'           female : male   =     93.6 : 1.0
                last_two = 'ia'           female : male   =     84.4 : 1.0
                last_two = 'la'           female : male   =     69.7 : 1.0
             last_letter = 'a'            female : male   =     37.2 : 1.0
                last_two = 'sa'           female : male   =     34.4 : 1.0
             last_letter = 'k'              male : female =     30.1 : 1.0
                last_two = 'rd'             male : female =     30.0 : 1.0
                last_two = 'ta'           female : male   =     29.5 : 1.0
                last_two = 'us'             male : female =     27.6 : 1.0
              last_three = 'ana'          female : male   =     24.6 : 1.0


#**Evaluating Performance**

The classifier is tested on:
- The dev-test set (to fine-tune the model).
- The test set (to check real-world performance).
- The accuracy on both sets is printed.
- The most informative features are displayed.

In [12]:
# Evaluate each classifier on dev-test set
print("Dev-Test Accuracy:")
print(f"Naive Bayes: {nltk.classify.accuracy(nb_classifier, dev_test_set):.4f}")
print(f"Decision Tree: {nltk.classify.accuracy(dt_classifier, dev_test_set):.4f}")
print(f"MaxEnt: {nltk.classify.accuracy(me_classifier, dev_test_set):.4f}")

# Evaluate each classifier on test set
print("\nTest Accuracy:")
print(f"Naive Bayes: {nltk.classify.accuracy(nb_classifier, test_set):.4f}")
print(f"Decision Tree: {nltk.classify.accuracy(dt_classifier, test_set):.4f}")
print(f"MaxEnt: {nltk.classify.accuracy(me_classifier, test_set):.4f}")

# Display the most informative features for Naive Bayes
nb_classifier.show_most_informative_features(10)



Dev-Test Accuracy:
Naive Bayes: 0.8200
Decision Tree: 0.7360
MaxEnt: 0.8180

Test Accuracy:
Naive Bayes: 0.8060
Decision Tree: 0.7380
MaxEnt: 0.8100
Most Informative Features
                last_two = 'na'           female : male   =     93.6 : 1.0
                last_two = 'ia'           female : male   =     84.4 : 1.0
                last_two = 'la'           female : male   =     69.7 : 1.0
             last_letter = 'a'            female : male   =     37.2 : 1.0
                last_two = 'sa'           female : male   =     34.4 : 1.0
             last_letter = 'k'              male : female =     30.1 : 1.0
                last_two = 'rd'             male : female =     30.0 : 1.0
                last_two = 'ta'           female : male   =     29.5 : 1.0
                last_two = 'us'             male : female =     27.6 : 1.0
              last_three = 'ana'          female : male   =     24.6 : 1.0


#**Prediction**



In [11]:
# Function to predict gender
def predict_gender(name, classifier):
    return classifier.classify(gender_features(name))
# Test the classifiers with sample names
sample_names = ["Alice", "John", "Taylor", "Jordan", "Sam"]
for name in sample_names:
    print(f"\nName: {name}")
    print(f"  Naive Bayes: {predict_gender(name, nb_classifier)}")
    print(f"  Decision Tree: {predict_gender(name, dt_classifier)}")
    print(f"  MaxEnt: {predict_gender(name, me_classifier)}")


Name: Alice
  Naive Bayes: female
  Decision Tree: female
  MaxEnt: female

Name: John
  Naive Bayes: male
  Decision Tree: male
  MaxEnt: male

Name: Taylor
  Naive Bayes: male
  Decision Tree: male
  MaxEnt: male

Name: Jordan
  Naive Bayes: male
  Decision Tree: female
  MaxEnt: male

Name: Sam
  Naive Bayes: male
  Decision Tree: male
  MaxEnt: male


#**Comparison of Test Set vs. Dev-Test Set Performance**

The Naive Bayes and MaxEnt classifiers show only a slight drop in accuracy from the dev-test set to the test set, whereas the Decision Tree classifier maintains almost the same accuracy.

**Naive Bayes:**

- Dev-Test Accuracy: 82.0%

- Test Accuracy: 80.6%

- Drop: 1.4%

**MaxEnt (Maximum Entropy):**

- Dev-Test Accuracy: 81.8%

- Test Accuracy: 81.0%

- Drop: 0.8%


**Decision Tree:**

- Dev-Test Accuracy: 73.6%

- Test Accuracy: 73.8%

- Increase: +0.2%

#**Is This Expected?**

Yes, this is generally expected because:

**Slight Accuracy Drop (NB & MaxEnt)**

The slight decrease in accuracy for Naive Bayes and MaxEnt indicates that the dev-test set was representative of the test set, meaning the models generalized well.

A small accuracy drop is normal due to slight variations in data distribution between dev-test and test sets.

**Decision Tree's Stability**

The Decision Tree classifier performed similarly on both the dev-test and test sets. This suggests it was already overfitting on the training set and did not improve significantly with new data.
Decision Trees tend to memorize patterns rather than generalizing well.
Feature Effectiveness

The most informative features (last two letters, last letter, etc.) align well with common naming conventions, which is why Naive Bayes and MaxEnt performed better.

#**Conclusion**

Naive Bayes and MaxEnt are the best choices for this task because they generalize well.

Decision Trees overfit easily, which is why it struggled.

The small drop in accuracy is expected, showing that the dev-test set was a good predictor of real performance.
If you want to improve further, we could:

- Add more features (e.g., syllables, consonant-vowel patterns).

- Use an ensemble model (combine NB + MaxEnt for better results).

- Train on a larger dataset to capture more name variations.