# CS396 Data Science: Project

### Project Overview
This project aims to understand any underlying biases or trends on Yelp by answering the following questions:
1. Is there a correlation between a user's activity/popularity and their review score distribution?
2. Do Yelp users have similar behaviors to their friends on the platform?
3. Is there a correlation between review score and whether it was made during the business' operating hours?
4. Is there a correlation between number of reviews on a business and its score distribution?

### Data Used
All of these questions were answered using various data sources included in the Yelp Open Dataset. The specific files used for each question are listed below:
1. user.json, checkin.json, review.json
2. user.json, checkin.json, review.json
3. business.json, review.json
4. business.json, review.json

## Environment Setup and Utility Functions

In [38]:
DATA_PATH = '../../yelp_dataset/'

import pandas as pd
import json



def load_data(path):
    """Load Yelp data file
    Currently only JSON and pickle format supported
    
    function(string) => pd.DataFrame"""
    f_type = path.split('.')[-1]
    
    if f_type == 'json': 
        return pd.DataFrame.from_records([json.loads(l) for l in open(DATA_PATH+path, encoding='utf-8')])
    elif f_type in ['pickle', 'pkl']:
        return pd.read_pickle(DATA_PATH+path)
    else:
        raise ValueError('Unsupported file type provided "{0}"'.format(f_type))
        

def save_data(df, path):
    """Save pd.DataFrame to specified file
    Currently only JSON and pickle format supported
    
    function(pd.DataFrame, string) => None"""
    f_type = path.split('.')[-1]
    
    if f_type == 'json':
        with open(DATA_PATH+path, 'w', encoding='utf-8') as out:
            for i, r in df.iterrows():
                print(r.to_json(), file=out)
        return
    elif f_type in ['pickle', 'pkl']:
        df.to_pickle(DATA_PATH+path)
        return
    else:
        raise ValueError('Unsupported file type provied "{0}"'.format(f_type))
            
        

## Data Cleaning

#### Convert JSON to Pickle
All of the data files that will be used are converted to pickle format for significantly faster load times and performance. This process takes a long time to complete (>20 minutes on my system) and requires significant computational resources (>8 GB RAM usage), but saves time in the long run with significantly decreased loading times for later access. This also slightly increases data density, saving about 1 GB in total across the files transcoded.

In [50]:
'''
# Commented out to prevent accidental execution
files = ['yelp_academic_dataset_user', 'yelp_academic_dataset_checkin', 'yelp_academic_dataset_review', 'yelp_academic_dataset_business']
for f in files:
    df = load_data(f+'.json')
    save_data(df, f+'.pickle')
'''

#### User Data Cleaning

In [86]:
def clean_user(df):
    """Apply cleaning to user dataset or a subset of it
    
    function(pd.DataFrame) => pd.DataFrame
    """
    print('=== User Data Cleaning Results ===')
    
    # Clean out entries with null values
    clean = df.dropna()
    
    # Ensure no duplicate user_ids
    num_duplicates = len(clean) - len(clean.user_id.unique())
    print('  {0} duplicate user_ids'.format(num_duplicates))
    
    return clean

In [92]:
df = load_data('yelp_academic_dataset_user.pickle')
df = clean_user(df)

=== User Data Cleaning Results ===
  0 duplicate user_ids


#### Business Data Cleaning

In [90]:
def clean_business(df):
    """Apply cleaning to business dataset or a subset of it
    
    function(pd.DataFrame) => pd.DataFrame
    """
    print("=== Business Data Cleaning Results ===")
    
    # Clean out null values
    clean = df.dropna()
    
    # Ensure no duplicate business_id's
    num_duplicates = len(clean) - len(clean.business_id.unique())
    print('{0} duplicate user_ids'.format(num_duplicates))
    
    return clean

In [93]:
df = load_data('yelp_academic_dataset_business.pickle')
df = clean_business(df)

=== Business Data Cleaning Results ===
0 duplicate user_ids


#### Checkin Data Cleaning

In [94]:
df = load_data('yelp_academic_dataset_checkin.pickle')

In [95]:
df

Unnamed: 0,business_id,date
0,--0r8K_AQ4FZfLsX3ZYRDA,2017-09-03 17:13:59
1,--0zrn43LEaB4jUWTQH_Bg,"2010-10-08 22:21:20, 2010-11-01 21:29:14, 2010..."
2,--164t1nclzzmca7eDiJMw,"2010-02-26 02:06:53, 2010-02-27 08:00:09, 2010..."
3,--2aF9NhXnNVpDV0KS3xBQ,"2014-11-03 16:35:35, 2015-01-30 18:16:03, 2015..."
4,--2mEJ63SC_8_08_jGgVIg,"2010-12-15 17:10:46, 2013-12-28 00:27:54, 2015..."
...,...,...
138871,zzoUa7lyeM-qKPKFYSrAhg,"2012-10-12 17:11:06, 2012-10-22 23:38:12, 2012..."
138872,zzpmoTVq4yn86U7ArHyFBQ,"2020-07-18 18:33:18, 2020-07-18 20:13:49, 2020..."
138873,zzqq8J7Pibxod1YcknlkWA,"2014-08-29 00:00:54, 2014-10-23 19:00:58, 2017..."
138874,zzwK-TJsCJX5wZrdtKemPg,"2010-08-29 17:39:58, 2010-10-25 22:58:03, 2011..."


In [96]:
df.iloc[0]

business_id    --0r8K_AQ4FZfLsX3ZYRDA
date              2017-09-03 17:13:59
Name: 0, dtype: object