# Notebook description

This notebook contains the code used for generating different tables used by the rest of the noteeboks. In most cases, such tables are just samples of the original tables in the Challenge dataset, but some of them are bit more involved. Using these generated tables allows for lower running times in the rest of the notebooks. 

Having this code in a separate notebook allows for cleaner notebooks.

# Imports and main params

In [1]:
import numpy as np
import pandas as pd

In [2]:
base_path = '/media/sebastian/926CA79B6CA7791D/trabajo/busqueda/research/NewYorker/project/data/'

# Useful functions

In [3]:
def extract_json_to_csv(json_path, csv_path, keep_cols=None, chunksize = 100000, encoding=None):
    reader = pd.read_json(json_path, lines=True, chunksize=chunksize)
    df = pd.DataFrame()
    for chunk in reader:
        if keep_cols is not None:
            chunk = chunk[keep_cols]
        print('Concatenating new chunk...')
        df = pd.concat([df, chunk])
        print('Read {} rows'.format(df.shape[0]))
        
    print('Saving CSV...')
    df.to_csv(csv_path, index=False, encoding=encoding)

# Dataset generation

## Reviews

In [5]:
reviews_path = base_path + 'yelp_academic_dataset_review.json'
generated_reviews_path = base_path + 'generated/reviews.csv'
keep_cols = ['business_id', 'date', 'stars', 'user_id'] 
# extract_json_to_csv(reviews_path, generated_reviews_path, keep_cols, chunksize = 500000)

## Users

In [None]:
users_path = base_path + 'yelp_academic_dataset_user.json'
generated_users_path = base_path + 'generated/users.csv'
keep_cols = ['user_id', 'friends'] 
# extract_json_to_csv(users_path, generated_users_path, keep_cols, chunksize = 500000)

## Businesses

In [8]:
businesses_path = base_path + 'yelp_academic_dataset_business.json'
generated_businesses_path = base_path + 'generated/businesses.csv'
keep_cols = ['user_id', 'friends'] 
extract_json_to_csv(businesses_path, generated_businesses_path, chunksize = 500000, encoding='utf-8')

Concatenating new chunk...
Read 188593 rows
Saving CSV...


## Checkins

The following cell creates a checkings CSV with just two columns (apart from business_id), corresponding to the number of checkins on weekends and on working days.

In [178]:
checkins_path = base_path + 'yelp_academic_dataset_checkin.json'
generated_checkins_path = base_path + 'generated/checkins.csv'


reader = pd.read_json(checkins_path, lines=True, chunksize=20000)
df = pd.DataFrame()
for chunk in reader:
    print('Processing chunk')
    chunk = pd.concat([chunk.drop(['time'], axis=1), chunk['time'].apply(pd.Series)], axis=1).fillna(0)
    weekend_cols = [col for col in chunk.drop('business_id', axis=1).columns if \
                    (col.split('-')[0] in ['Sat', 'Sun'])]
    week_cols = [col for col in chunk.drop('business_id', axis=1).columns if \
                 (col.split('-')[0] in ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])]
    chunk['weekends'] = chunk[weekend_cols].sum(axis=1)
    chunk['week'] = chunk[week_cols].sum(axis=1)
    chunk = chunk[['business_id', 'weekends', 'week']]
    print('Concatenating new chunk...')
    df = pd.concat([df, chunk])
    print('Read {} rows'.format(df.shape[0]))
df.to_csv(generated_checkins_path, index=False, encoding='utf-8')

Processing chunk
Concatenating new chunk...
Read 20000 rows
Processing chunk
Concatenating new chunk...
Read 40000 rows
Processing chunk
Concatenating new chunk...
Read 60000 rows
Processing chunk
Concatenating new chunk...
Read 80000 rows
Processing chunk
Concatenating new chunk...
Read 100000 rows
Processing chunk
Concatenating new chunk...
Read 120000 rows
Processing chunk
Concatenating new chunk...
Read 140000 rows
Processing chunk
Concatenating new chunk...
Read 157075 rows


## Friends

The 3 cells below create a handy mapping between friends. The generated table is based on a sample from the generated users table (run the cell under _Users_ first if you are going to run the cells below). First, users with less than a certain number of friends are filtered out. Then, a number of users are randomly sampled. The final table contains two columns: 'user_id' and 'friend_id'. A row [user_id, friend_id] is present in the final table if and only if user_id belongs to the sampled table and user_id has friend_id among her friends. 

Params:

In [None]:
# Number of users to sample from the users table 
sample = 20000

# Min number of friends. Users with less friends than this value are filtered out
min_n_friends = 20

Read users, preprocess friends column, sample and filter out users with less than min_n_friends friends

In [None]:
users = pd.read_csv(generated_users_path)
users = users[users.friends != 'None'].copy()
users['friends'] = users.friends.apply(lambda x: [i.strip() for i in x.split(',')])
users = users[users.friends.apply(lambda x: len(x) >= 20)].copy()
users = users.sample(20000)

Explode list of friends and save

In [None]:
friends_path = base_path + 'generated/friends.csv'
friends = users.friends.apply(pd.Series) \
               .stack() \
               .reset_index(level=1, drop=True) \
               .to_frame('friend_id') \
               .join(users) \
               .drop('friends', axis=1) \
               .reset_index(level=0, drop=True)
friends = friends[friends.friend_id != '']
friends.to_csv(friends_path, index=False)