# Data Preparation
This notebook can be used to prepare the datasets used for the gender prediction task. Datasets are created for different numbers of reviews per user (e.g. a dataset where all samples consist of 5 reviews of one user) and each dataset is split into a training and a test subset.

### The dataset are created as follows:
- Read review data from Yelp JSON files and group by users
- Guess user gender based on children names list
- Skip all users with unknown/both gender
- Sanitize reviews (remove punctation and special character)
- Shuffle data (users)
- Select balanced (same number of M and F samples) training and test data sets (pick only users with at least required number of reviews)
- Serialize data

### The resulting datasets:
- `all`: A balanced subset of 300,000 samples, for each sample (user) the original number of reviews is kept (representative for whole dataset)
- `n = 1, 2, 5, 10, 20`: A balanced subset of max. 300,000 samples, only samples with at least `n` reviews are picked and exactly `n` random reviews are selected from each sample

Each dataset is split into 90% training and 10% test samples.

In [1]:
from datetime import timedelta
import json
from pathlib import Path
import pickle
import random
import re
from timeit import default_timer as timer

from GenderGuesser import GenderGuesser, Gender

In [2]:
data_dir = Path('data')
yelp_dataset_dir = data_dir / 'yelp_dataset'
name_list_file = data_dir / 'names/yob2019.txt'

In [3]:
dataset_size = 300_000
train_size = 0.9
number_of_reviews = ['all', 1, 2, 5, 10, 20]

In [4]:
def sanitize_review(text):
    sanitized_review = re.sub(r'[\s,+&%$!?.*-]+', ' ', text)
    sanitized_review = re.sub(r'(\s|^)\d+(\.\d+)?(\s|$)', ' ', sanitized_review)
    sanitized_review = sanitized_review.lower()
    return sanitized_review

In [5]:
def read_data(yelp_dataset_dir, name_list_file):
    """
    Reads the Yelp user and review JSON files and extracts the reviews of all users whose names are either male or
    female.

    :param yelp_dataset_dir: Path to yelp dataset JSON files .
    :return: List of tuples (gender, reviews) for each user with a male or female name.
    """
    gender_guesser = GenderGuesser(name_list_file)
    data = dict()

    start = timer()
    with open(f'{yelp_dataset_dir}/yelp_academic_dataset_user.json', 'r') as fd:
        for line in fd:
            record = json.loads(line)

            gender = gender_guesser.guess(record['name'])
            if gender in (Gender.M, Gender.F):
                data[record['user_id']] = (gender, [])

    with open(f'{yelp_dataset_dir}/yelp_academic_dataset_review.json', 'r') as fd:
        for line in fd:
            record = json.loads(line)

            if record['user_id'] in data:
                data[record['user_id']][1].append(sanitize_review(record['text']))

    end = timer()
    print(f"Read JSON data in {timedelta(seconds=end-start)}")

    return list(data.values())

In [6]:
data = read_data(yelp_dataset_dir, name_list_file)

Read JSON data in 0:08:23.352105


In [7]:
seed = 0
random.seed(seed)
random.shuffle(data)
for i in range(len(data)):
    random.shuffle(data[i][1])

In [8]:
data_f = [x for x in data if x[0] == Gender.F]
data_m = [x for x in data if x[0] == Gender.M]
del data

In [9]:
dataset_dir = data_dir / 'datasets'
dataset_dir.mkdir(parents=True, exist_ok=True)

for n in number_of_reviews:
    if n == 'all':
        n_reviews = max([len(reviews) for _, reviews in data_f] + [len(reviews) for _, reviews in data_m])
    else:
        n_reviews = n
        
    dataset_f = [
        (gender, reviews[:n_reviews]) for gender, reviews in data_f if len(reviews) >= n_reviews or n == 'all'
    ][:int(dataset_size/2)] 
    dataset_m = [
        (gender, reviews[:n_reviews]) for gender, reviews in data_m if len(reviews) >= n_reviews or n == 'all'
    ][:int(dataset_size/2)]
    
    size = min(len(dataset_f), len(dataset_m))
    
    dataset = dataset_f[:size] + dataset_m[:size]
    random.shuffle(dataset)
    
    with open(dataset_dir / f'dataset_{n}_train.pkl', 'wb') as fd:
        pickle.dump(dataset[:int(train_size * len(dataset))], fd)
                            
    with open(dataset_dir / f'dataset_{n}_test.pkl', 'wb') as fd:
        pickle.dump(dataset[int(train_size * len(dataset)):], fd)
                            
    print(
        f"{n}: Created dataset with {int(train_size * len(dataset))} training ",
        f"and {len(dataset) - int(train_size * len(dataset))} test samples ",
    )

all: Created dataset with 270000 training  and 30000 test samples 
1: Created dataset with 270000 training  and 30000 test samples 
2: Created dataset with 270000 training  and 30000 test samples 
5: Created dataset with 132503 training  and 14723 test samples 
10: Created dataset with 56871 training  and 6319 test samples 
20: Created dataset with 22140 training  and 2460 test samples 
