# CS396 Data Science: Project

### Project Overview
This project aims to understand any underlying biases or trends on Yelp by answering the following questions:
1. Is there a correlation between a user's activity/popularity and their review score distribution?
2. Do Yelp users have similar behaviors to their friends on the platform?
3. Is there a correlation between review score and whether it was made during the business' operating hours?
4. Is there a correlation between number of reviews on a business and its score distribution?

### Data Used
All of these questions were answered using various data sources included in the Yelp Open Dataset. The specific files used for each question are listed below:
1. user.json, checkin.json, review.json
2. user.json, checkin.json, review.json
3. business.json, review.json
4. business.json, review.json

## Environment Setup and Utility Functions

In [38]:
DATA_PATH = '../../yelp_dataset/'

import pandas as pd
import json



def load_data(path):
    """Load Yelp data file
    Currently only JSON and pickle format supported
    
    function(string) => pd.DataFrame"""
    f_type = path.split('.')[-1]
    
    if f_type == 'json': 
        return pd.DataFrame.from_records([json.loads(l) for l in open(DATA_PATH+path, encoding='utf-8')])
    elif f_type in ['pickle', 'pkl']:
        return pd.read_pickle(DATA_PATH+path)
    else:
        raise ValueError('Unsupported file type provided "{0}"'.format(f_type))
        

def save_data(df, path):
    """Save pd.DataFrame to specified file
    Currently only JSON and pickle format supported
    
    function(pd.DataFrame, string) => None"""
    f_type = path.split('.')[-1]
    
    if f_type == 'json':
        with open(DATA_PATH+path, 'w', encoding='utf-8') as out:
            for i, r in df.iterrows():
                print(r.to_json(), file=out)
        return
    elif f_type in ['pickle', 'pkl']:
        df.to_pickle(DATA_PATH+path)
        return
    else:
        raise ValueError('Unsupported file type provied "{0}"'.format(f_type))
            
        

## Data Cleaning

#### Convert JSON to Pickle
All of the data files that will be used are converted to pickle format for significantly faster load times and performance. This process takes a long time to complete (>20 minutes on my system) and requires significant computational resources (>8 GB RAM usage), but saves time in the long run with significantly decreased loading times for later access. This also slightly increases data density, saving about 1 GB in total across the files transcoded.

In [50]:
'''
# Commented out to prevent accidental execution
files = ['yelp_academic_dataset_user', 'yelp_academic_dataset_checkin', 'yelp_academic_dataset_review', 'yelp_academic_dataset_business']
for f in files:
    df = load_data(f+'.json')
    save_data(df, f+'.pickle')
'''