# Gender Prediction from Yelp Reviews
**Task**: Given a list of Yelp reviews all written by the same Yelp user, predict whether the user is female or male.

**Approach**: Guess gender of Yelp users by their names by match the names with a name lookup table. Use these guesses as training labels to fit a model for gender prediction.

In [1]:
import os
from timeit import default_timer as timer
from datetime import timedelta

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from GenderPredictionData import GenderPredictionData

In [2]:
pkl_file_path = 'gender_prediction_data.pkl'
yelp_dataset_dir = 'data/yelp_dataset'
name_list_path = 'data/names/yob2019.txt'

## Load training and test data
Load dataset from pickle file, if it was already created (and stored) before. Otherwise, create dataset from Yelp dataset JSON files.

In [3]:
gpd = GenderPredictionData(verbose=True)

if os.path.exists(pkl_file_path):
    gpd.unpickle(pkl_file_path)
else:
    gpd.read_data(yelp_dataset_dir, name_list_path)
    gpd.pickle(pkl_file_path)

Pickled data to gender_prediction_data.pkl


## Prepare Data

The data are prepared using the following pipeline:
- Shuffle all samples (sample = Yelp user)
- Drop all users with less than 10 reviews
- Shuffle the review list of each user
- Take only the first 10 reviews of each user
- Sanitize all reviews (removing numbers, punctuation, lower case)
- Concatenate all ten reviews of each user to one text

In [4]:
gpd.shuffle().min_review_num(10).shrink(20000, balance = True).shuffle_reviews().truncate(10).sanitize().merge()

Shuffle samples...
Drop all users with less than 10 reviews...
Shrink dataset to 20000 samples...
Balance dataset...
Shuffle review lists...
Truncate review list to 10 review per sample...
Sanitize review texts...
Merge reviews...


<GenderPredictionData.GenderPredictionData at 0x120a4f5e0>

In [6]:
print(f"Number of samples: {gpd.size}")

Number of samples: 20000


In [7]:
corpus = [reviews[0] for gender, reviews in gpd.data]
labels = [gender for gender, reviews in gpd.data]

In [8]:
corpus[0]



In [9]:
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.3)

In [10]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=2000)
X_train = vectorizer.fit_transform(corpus_train).toarray()
X_test = vectorizer.fit_transform(corpus_test).toarray()

In [11]:
start = timer()
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
end = timer()
print(f"Fit svm in {timedelta(seconds=end-start)}")

NameError: name 'timedelta' is not defined

In [None]:
start = timer()
y_pred = svm_clf.predict(X_test)
end = timer()
print(f"Predicted test data in {timedelta(seconds=end-start)}")