# Data 620 - Web Analytics Project 3
Yina Qiao

video link:


Using any of the three classifiers described in chapter 6 of NLP  with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

# Intro

 This project will forcus on gender classification on names using two classifiers, Naive Bayes and Decision Tree, and evaluates their performance on both the dev-test and test sets.

# Data import and wrangling

In [2]:
import nltk
nltk.download('names')
import random
from nltk.classify import apply_features
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Get data
names = ([(name, 'male') for name in nltk.corpus.names.words('male.txt')] +
         [(name, 'female') for name in nltk.corpus.names.words('female.txt')])
random.shuffle(names)

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


# Train-Test-Split

In [3]:

# Split into test, dev-test, and training sets
test_set = names[:500]
dev_test_set = names[500:1000]
train_set = names[1000:]

# Feature Extraction

In [4]:

# Feature extraction function
def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["last_is_vowel"] = (name[-1] in 'aeiouy')
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
        features["first(%s)" % letter] = name.lower().find(letter)
    features["suffix2"] = name[-2:].lower()
    features["last2"] = (name[-2:].lower())
    if len(name) >= 3:
        features["last3"] = (name[-3:].lower())
    else:
        features["last3"] = (" " + name[-2:].lower())
    features["length"] = len(name)
    return features

# Create feature sets
test_features = [(gender_features(n), g) for (n, g) in test_set]
dev_test_features = [(gender_features(n), g) for (n, g) in dev_test_set]
train_features = [(gender_features(n), g) for (n, g) in train_set]

# Classification & Accuracy

In [5]:


# Classify using Naive Bayes and Decision Tree
classifier_NB = nltk.NaiveBayesClassifier.train(train_features)
classifier_DT = nltk.DecisionTreeClassifier.train(train_features)

# Training accuracy
print("Training Accuracy (Naive Bayes):", nltk.classify.accuracy(classifier_NB, train_features))
print("Training Accuracy (Decision Tree):", nltk.classify.accuracy(classifier_DT, train_features))

# Check accuracy on dev-test set
dev_test_actual = [g for (_, g) in dev_test_features]
dev_test_NB_predicted = [classifier_NB.classify(gender_features(n)) for (n, _) in dev_test_set]
dev_test_DT_predicted = [classifier_DT.classify(gender_features(n)) for (n, _) in dev_test_set]

print("Dev-Test Accuracy (Naive Bayes):", accuracy_score(dev_test_actual, dev_test_NB_predicted))
print("Dev-Test Accuracy (Decision Tree):", accuracy_score(dev_test_actual, dev_test_DT_predicted))

# Check accuracy on test set
test_actual = [g for (_, g) in test_features]
test_NB_predicted = [classifier_NB.classify(gender_features(n)) for (n, _) in test_set]
test_DT_predicted = [classifier_DT.classify(gender_features(n)) for (n, _) in test_set]

print("Test Accuracy (Naive Bayes):", accuracy_score(test_actual, test_NB_predicted))
print("Test Accuracy (Decision Tree):", accuracy_score(test_actual, test_DT_predicted))


Training Accuracy (Naive Bayes): 0.8254608294930875
Training Accuracy (Decision Tree): 0.9544930875576036
Dev-Test Accuracy (Naive Bayes): 0.8
Dev-Test Accuracy (Decision Tree): 0.716
Test Accuracy (Naive Bayes): 0.842
Test Accuracy (Decision Tree): 0.746


Naive Bayes Classifier:

Training Accuracy: 0.825
Dev-Test Accuracy: 0.8
Test Accuracy: 0.842




Decision Tree Classifier:

Training Accuracy: 0.954
Dev-Test Accuracy: 0.716
Test Accuracy: 0.746

# Conclusion



Question:ow does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

1.  For the Naive Bayes Classifier, the performance on the test set (accuracy of 0.842) is slightly higher than the dev-test set (accuracy of 0.8). This indicates that the classifier generalizes well and performs slightly better on unseen data.
2.   For the Decision Tree Classifier, the performance on the test set (accuracy of 0.746) is slightly higher than the dev-test set (accuracy of 0.716) but lower than the training accuracy indicating model performance drop when dealing with unsean data.



Overall, the performance on the test set is relatively consistent with the dev-test set for both classifiers, aligning with our expectations. The Naive Bayes Classifier demonstrates good performance across all sets, while the Decision Tree Classifier shows a larger drop in performance, indicating potential overfitting or limited generalization ability