# Yelp Dataset Challenge

![Yelp Data Challenge](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/6d323fc75cb1/assets/img/dataset/960x225_dataset@2x.png)

## Data processing

### 1. Load data into Pandas DataFrame

In [281]:
import json
import pandas as pd

#### Prepare dataset

The downloaded dataset is 3.13 GB. Very large!

Use smaller dataset for testing purpose.

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, iCQpiavjjPzJ5_3gPD5Ebg to AJWrjfJ0GcR5ar2oU4gbow
Data columns (total 8 columns):
cool         1000 non-null int64
date         1000 non-null object
funny        1000 non-null int64
review_id    1000 non-null object
stars        1000 non-null int64
text         1000 non-null object
useful       1000 non-null int64
user_id      1000 non-null object
dtypes: int64(4), object(4)
memory usage: 70.3+ KB


In [282]:
path = '/Users/ytshen/Desktop/code_tmp/practice/Yelp_Data_Challenge/'

# smaller data
smaller_business, smaller_checkin, smaller_review, smaller_tip, smaller_user = [
    'sample_business.json',
    'sample_checkin.json',
    'sample_review.json',
    'sample_tip.json',
    'sample_user.json'
]

# all data
all_business, all_checkin, all_review, all_tip, all_user = [
    'yelp_academic_dataset_business.json',
    'yelp_academic_dataset_checkin.json',
    'yelp_academic_dataset_review.json',
    'yelp_academic_dataset_tip.json',
    'yelp_academic_dataset_user.json',
]

In [283]:
# Use smaller data
file_business, file_checkin, file_review, file_tip, file_user = [
    path + smaller_business,
    path + smaller_checkin,
    path + smaller_review,
    path + smaller_tip,
    path + smaller_user
]

In [284]:
# # Use all data
# file_business, file_checkin, file_review, file_tip, file_user = [
#     path + all_business,
#     path + all_checkin,
#     path + all_review,
#     path + all_tip,
#     path + all_user
# ]

In [285]:
# # Print out file names
# print(file_business)
# print(file_checkin)
# print(file_review)
# print(file_tip)
# print(file_user)

#### Try to load json file into Pandas DataFrame
* Try to use `pandas.read_json()` to load file, but it doesn't work.
* Try to use `json.loads()` to load file. There is a 's' in `loads()`.

In [286]:
# # Test to load single file
# with open(file_business) as f:
#     df_test1 = pd.DataFrame(json.loads(line) for line in f) # 注意 loads() 要有 s

# # Check
# df_test1.head()

In [287]:
# df_test1.info()

In [288]:
# Loading a single file works, wrap in function
def read_json_file(input_file):
    with open(input_file) as fin:
        df = pd.DataFrame(json.loads(line) for line in fin)
    return df

In [289]:
# df_test2 = read_json_file(file_business)
# df_test2.head()

In [290]:
# # Compare two DataFrame, they should be equal.
# df_test2.equals(df_test1)

#### Load all files into DataFrame.

In [291]:
df_business = read_json_file(file_business)
df_checkin = read_json_file(file_checkin)
df_review = read_json_file(file_review)
df_tip = read_json_file(file_tip)
df_users = read_json_file(file_user)

In [292]:
# Check
df_business.head()

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
0,1314 44 Avenue NE,"{'BikeParking': 'False', 'BusinessAcceptsCredi...",Apn5Q_b6Nz61Tq4XzPdf9A,"Tours, Breweries, Pizza, Restaurants, Food, Ho...",Calgary,"{'Monday': '8:30-17:0', 'Tuesday': '11:0-21:0'...",1,51.091813,-114.031675,Minhas Micro Brewery,,T2E 6L6,24,4.0,AB
1,,"{'Alcohol': 'none', 'BikeParking': 'False', 'B...",AjEbIBw6ZFfln7ePHha9PA,"Chicken Wings, Burgers, Caterers, Street Vendo...",Henderson,"{'Friday': '17:0-23:0', 'Saturday': '17:0-23:0...",0,35.960734,-114.939821,CK'S BBQ & Catering,,89002,3,4.5,NV
2,1335 rue Beaubien E,"{'Alcohol': 'beer_and_wine', 'Ambience': '{'ro...",O8S5hYJ1SMc8fA4QBtVujA,"Breakfast & Brunch, Restaurants, French, Sandw...",Montréal,"{'Monday': '10:0-22:0', 'Tuesday': '10:0-22:0'...",0,45.540503,-73.5993,La Bastringue,Rosemont-La Petite-Patrie,H2G 1K7,5,4.0,QC
3,211 W Monroe St,,bFzdJJ3wp3PZssNEsyU23g,"Insurance, Financial Services",Phoenix,,1,33.449999,-112.076979,Geico Insurance,,85003,8,1.5,AZ
4,2005 Alyth Place SE,{'BusinessAcceptsCreditCards': 'True'},8USyCYqpScwiNEb58Bt6CA,"Home & Garden, Nurseries & Gardening, Shopping...",Calgary,"{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",1,51.035591,-114.027366,Action Engine,,T2H 0N5,4,2.0,AB


In [293]:
# df_business.equals(df_test1)

In [294]:
df_business.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
address         1000 non-null object
attributes      838 non-null object
business_id     1000 non-null object
categories      998 non-null object
city            1000 non-null object
hours           737 non-null object
is_open         1000 non-null int64
latitude        1000 non-null float64
longitude       1000 non-null float64
name            1000 non-null object
neighborhood    1000 non-null object
postal_code     1000 non-null object
review_count    1000 non-null int64
stars           1000 non-null float64
state           1000 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 117.3+ KB


In [295]:
# df_checkin.info()

In [296]:
df_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
business_id    1000 non-null object
cool           1000 non-null int64
date           1000 non-null object
funny          1000 non-null int64
review_id      1000 non-null object
stars          1000 non-null int64
text           1000 non-null object
useful         1000 non-null int64
user_id        1000 non-null object
dtypes: int64(4), object(5)
memory usage: 70.4+ KB


In [297]:
# df_tip.info()

In [298]:
# df_users.info()

### 2. Filter data by city and category

#### Create filters/masks

* create filters that selects business 
    * that are located in "Las Vegas"
    * that contains "Restaurants" in their category (You may need to filter null categories first)

In [299]:
# # Make there is Las Vages in city
# df_business['city'].value_counts()

In [300]:
# # Set the mask for city
# mask_city = df_business['city'] == 'Las Vegas'

In [301]:
# df_business[mask_city].head()

In [309]:
# # Check the categories
# df_business['categories'].head(30)

0     Tours, Breweries, Pizza, Restaurants, Food, Ho...
1     Chicken Wings, Burgers, Caterers, Street Vendo...
2     Breakfast & Brunch, Restaurants, French, Sandw...
3                         Insurance, Financial Services
4     Home & Garden, Nurseries & Gardening, Shopping...
5                                    Coffee & Tea, Food
6                                        Food, Bakeries
7                                     Restaurants, Thai
8                                  Mexican, Restaurants
9                 Flowers & Gifts, Gift Shops, Shopping
10                                Restaurants, Japanese
11                  Cajun/Creole, Southern, Restaurants
12    Bars, Sports Bars, Dive Bars, Burgers, Nightli...
13       Restaurants, Pakistani, Indian, Middle Eastern
14                               Beauty & Spas, Barbers
15                       Delis, Restaurants, Sandwiches
16    Nightlife, Bars, American (Traditional), Tapas...
17                 Shopping, Fashion, Department

In [310]:
df_business[df_business['categories'].isnull()]

Unnamed: 0,address,attributes,business_id,categories,city,hours,is_open,latitude,longitude,name,neighborhood,postal_code,review_count,stars,state
288,"Adelaide Centre, 130 Adelaide St West Concours...","{'BikeParking': 'False', 'BusinessParking': '{...",EBzr465prEffkpmE8Mk5AA,,Toronto,"{'Monday': '7:0-19:0', 'Tuesday': '7:0-19:0', ...",1,43.649592,-79.383394,Polish'd Nail Bar,Financial District,M5H 3P5,7,2.5,ON
603,7700 W Arrowhead Towne Ctr,,CN3BLZwfG4eqZjvKrIZoAg,,Glendale,,1,33.642064,-112.225217,Fuzziwigs Candy Factory,,85308,4,1.0,AZ


In [339]:
# # Set the mask for restaurant
# print(df_business['categories'].isnull().sum())
# null_categories = df_business['categories'].isnull()

In [340]:
# print(df_business[~null_categories]['categories'].count())
# print(df_business['categories'].notnull().sum())

In [341]:
# mask_restaurants = df_business[~null_categories]['categories'].apply(lambda x: True if 'Restaurants' in x else False)

In [342]:
# print(mask_restaurants)

In [343]:
# Create Pandas DataFrame filters
city = df_business['city'] == 'Las Vegas'

null_categories = df_business['categories'].isnull()
restaurants = df_business[~null_categories]['categories'].apply(lambda x: True if 'Restaurants' in x else False)

In [344]:
# Create filtered DataFrame, and name it df_filtered
df_filtered = df_business[city & restaurants]

In [347]:
# df_filtered.head()

#### Keep relevant columns

* only keep some useful columns
    * business_id
    * name
    * categories
    * stars

In [348]:
selected_features = [u'business_id', u'name', u'categories', u'stars']

In [349]:
# Make a DataFrame that contains only the abovementioned columns, and name it as df_selected_business
df_selected_business = df_filtered[selected_features]

In [350]:
# df_selected_business.head()

In [351]:
# Rename the column name "stars" to "avg_stars" to avoid naming conflicts with review dataset
df_selected_business.rename(columns={'stars':'avg_stars'}, inplace=True)

In [354]:
# Check
df_selected_business.head()

Unnamed: 0,business_id,name,categories,avg_stars
19,vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5
32,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5
33,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0
61,JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0
141,zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5


#### Save results to csv files

In [355]:
df_selected_business.to_csv(path + 'selected_business.csv', index=False)

In [356]:
# Try reload the csv file to check if everything works fine
df_selected_business_test = pd.read_csv(path + 'selected_business.csv')

In [357]:
df_selected_business_test.head()

Unnamed: 0,business_id,name,categories,avg_stars
0,vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5
1,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5
2,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0
3,JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0
4,zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5


### 3. Use the "business_id" column to filter review data

* We want to make a DataFrame that contain and only contain the reviews about the business entities we just obtained

#### Prepare dataframes to be joined, - on business_id

In [371]:
# Prepare the business dataframe and set index to column "business_id", and name it as df_left
df_left = pd.read_csv(path + 'selected_business.csv')

df_left.set_index('business_id', inplace=True)

In [372]:
# df_left.head()

In [373]:
# Prepare the review dataframe and set index to column "business_id", and name it as df_right
df_right = df_review.copy()

df_right.set_index('business_id', inplace=True)

In [374]:
# df_right.head()

In [375]:
# df_right.info()

#### convert data column from object to datatime

In [376]:
df_right['date'] = pd.to_datetime(df_right['date'])

In [377]:
# df_right.info()

In [378]:
# df_right.head()

#### Join! and reset index

In [379]:
# Join df_left and df_right.
df_joined = pd.merge(df_left, df_right, how='left', on='business_id')
df_joined.head()

Unnamed: 0_level_0,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
vJIuDBdu01vCA8y1fwR1OQ,CakesbyToi,"American (Traditional), Food, Bakeries, Restau...",1.5,,NaT,,,,,,
kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,,NaT,,,,,,
0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0,,NaT,,,,,,
JJEx5wIqs9iGGATOagE8Sg,Baja Fresh Mexican Grill,"Mexican, Restaurants",2.0,,NaT,,,,,,
zhxnD7J5_sCrKSw5cwI9dQ,Popeyes Louisiana Kitchen,"Chicken Wings, Restaurants, Fast Food",1.5,,NaT,,,,,,
2kWrSFkIes_d2BMg4YrRtA,Pizza Hut,"Restaurants, Pizza",2.5,,NaT,,,,,,
6llKs7K_tn8ChXcIM-oTvg,Sansei Japan,"Japanese, Restaurants",4.5,,NaT,,,,,,
YV9GVfmDSDM7HSV0jVdTOA,El Pollo Loco,"Restaurants, Salad, Fast Food, Mexican",3.0,,NaT,,,,,,
F7OsiFk9aLZtqZczA84xpw,Popeyes Louisiana Kitchen,"Southern, Chicken Wings, Fast Food, American (...",2.0,,NaT,,,,,,
XeDLyY2a7nZ3IEY4RYslXA,Chicago Brewing Company,"American (New), Restaurants, Food, Breweries, ...",3.5,,NaT,,,,,,


In [None]:
# reset the index
df_joined.reset_index(inplace=True)

#### We further filter data by date, e.g. keep comments from last 2 years

* Otherwise your laptop may crush on memory when running machine learning algorithms
* Purposefully ignoring the reviews made too long time ago

In [None]:
# Make a filter that selects date after 2015-01-20

In [None]:
# Filter the joined DataFrame and name it as df_final

#### Take a glance at the final dataset

In [None]:
import matplotlib.pyplot as plt

% matplotlib inline

In [None]:
# calculate counts of reviews per business entity, and plot it

## Save your preprocessed dataset to csv file

* Respect your laptop's hard work! You don't want to make it run everything again.