# YelpData Challenge 

## Dataset Introduction

4.1M reviews for 144k businesses
1.1M business attributes, e.g., hours, parking availability, ambience
111 unique cities from all around the world. This project will work with Las Vegas, the most popular city in the dataset. 


In [2]:
import datetime
import json
import pandas as pd
import numpy as np

In [2]:
business = pd.read_json('data/yelp_academic_dataset_business.json',lines=True)

KeyboardInterrupt: 

First, we need to check out various names that refer to Las Vegas, and standarize them

In [26]:
#check for all variations of "Las Vegas"

for name in business['city'].unique():
    if name.lower().startswith('las'):
        print(name)

Las Vegas
Lasalle
LaSalle
LAS VEGAS AP


In [6]:
#Standarize all into "Las Vegas"
city_names = ["las vegas", "las  vegas", "las vegas east",
                    "las vegass", "lasvegas", "las vegas,", "las vegas nv",
                    "las vegas, nv", "las vegas nevada",'las vegas, nevada']

def fix_city_name(name,names_to_replace,key=None):
    if name.lower().strip() in names_to_replace:
        return key
    return name

business.city = business.city.apply(fix_city_name,args=(city_names,"Las Vegas"))

In [7]:
LV = business[business['city'] == 'Las Vegas']
LV.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31689 entries, 6 to 209386
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   31689 non-null  object 
 1   name          31689 non-null  object 
 2   address       31689 non-null  object 
 3   city          31689 non-null  object 
 4   state         31689 non-null  object 
 5   postal_code   31689 non-null  object 
 6   latitude      31689 non-null  float64
 7   longitude     31689 non-null  float64
 8   stars         31689 non-null  float64
 9   review_count  31689 non-null  int64  
 10  is_open       31689 non-null  int64  
 11  attributes    27548 non-null  object 
 12  categories    31614 non-null  object 
 13  hours         25093 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 3.6+ MB


## Filter data by city and category

Extract all data associated with restaurants in Las Vegas

In [8]:
lv_res = business[(business.city == 'Las Vegas') & (business.categories.str.match('Restaurants'))]
lv_res

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
238,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,3759 E Desert Inn Rd,Las Vegas,NV,89121,36.129178,-115.092483,5.0,222,1,"{'WiFi': 'u'free'', 'RestaurantsPriceRange2': ...","Restaurants, Italian","{'Monday': '0:0-0:0', 'Tuesday': '16:0-21:0', ..."
246,AtD6B83S4Mbmq0t7iDnUVA,Veggie House,"5115 Spring Mountain Rd, Ste 203",Las Vegas,NV,89146,36.125569,-115.210911,4.5,1142,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Restaurants, Specialty Food, Japanese, Sushi B...","{'Monday': '11:30-21:30', 'Tuesday': '11:30-21..."
364,ZsKWULhwwB61RHzCrb1i9A,Blue Burrito Grille,5757 Wayne Newton Blvd,Las Vegas,NV,89119,36.080542,-115.146459,2.5,31,1,"{'RestaurantsDelivery': 'False', 'GoodForKids'...","Restaurants, Mexican",
450,c1JoHp602zilpDU_57DsMg,Starbucks,3950 Las Vegas Blvd So,Las Vegas,NV,89119,36.093270,-115.175966,2.0,57,1,"{'BikeParking': 'False', 'BusinessParking': '{...","Restaurants, Cafes, Food, Internet Cafes, Coff...","{'Monday': '6:0-23:0', 'Tuesday': '6:0-23:0', ..."
867,kUntNQ5P9IrRzEoHdRxV-w,Mark Rich's New York Pizza & Pasta,7930 W Tropical Pkwy,Las Vegas,NV,89149,36.271037,-115.267937,3.0,103,0,"{'RestaurantsReservations': 'False', 'Business...","Restaurants, Pizza","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209086,RhV7sraRUB3km-gF-tmDow,Roberto's Taco Shop,1101 S Fort Apache Rd,Las Vegas,NV,89117,36.158356,-115.292372,3.0,72,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","Restaurants, Mexican, Fast Food",
209137,Nk8rRsl9u4GrbxEeEVsedg,Thai Cuisine 2,"4420 E Charleston Blvd, Ste 5",Las Vegas,NV,89110,36.173634,-115.059814,2.5,3,0,"{'OutdoorSeating': 'False', 'RestaurantsTakeOu...","Restaurants, Thai",
209260,d4Mw96Hb6ZoHEL2AxqGrbg,Ice House America,1735 E Warm Springs Rd,Las Vegas,NV,89119,36.057189,-115.128511,5.0,4,1,"{'RestaurantsGoodForGroups': 'True', 'Restaura...","Restaurants, Food, American (New), Local Services",
209273,Q2pyp7Nb3cD4Br2BOnxUDA,Greek Delights,301 S MLK Blvd,Las Vegas,NV,89106,36.170901,-115.160635,4.0,28,0,"{'WiFi': ''no'', 'RestaurantsPriceRange2': '2'...","Restaurants, Food Trucks, Greek, Food, Mediter...","{'Monday': '0:0-0:0', 'Tuesday': '11:0-19:0', ..."


Next, drop the columns irrelevant to sentiment analysis

In [35]:
lv_res.columns

Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
       'latitude', 'longitude', 'stars', 'review_count', 'is_open',
       'attributes', 'categories', 'hours'],
      dtype='object')

hours: all NA; is_open only contains 9 values. Both are dropped because they give us minimal information.
For geographic information, only postal_code is kept, for potentially investigating district differences.

In [39]:
lv_res['is_open'].value_counts()

0    6
1    3
Name: is_open, dtype: int64

In [9]:
lv_res = lv_res.drop(columns = ['address','city','state','latitude','longitude','is_open','categories',
                                'hours'])

In [48]:
#check new dataframe
#lv_res.head()

Save sliced dataframe to CSV for easier operations in the future

In [None]:
# Join business dataframe and review. 
# Using inner join, because we only want restaurants with reviews attached to them
br_joined = df_left.merge(df_right, on="business_id", how="inner")

# Reset index; return DataFrame to normal
df_lv_joined = df_lv_joined.reset_index()

## Load review dataset

In [3]:
br = pd.read_csv('br_joined.csv')
br.head()

Unnamed: 0,business_id,name,postal_code,stars_x,review_count,attributes,review_id,user_id,stars_y,useful,funny,cool,text,date
0,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,89121.0,5.0,222,"{'WiFi': ""u'free'"", 'RestaurantsPriceRange2': ...",htSvAB0GEPZOdvebmVqg4g,GgCjStvmclW9uedJa_tTlA,5.0,1,0,0,"Very good restaurant, they have many choices a...",2018-09-03 02:54:29
1,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,89121.0,5.0,222,"{'WiFi': ""u'free'"", 'RestaurantsPriceRange2': ...",yoXlN_RAJAVuhvR4lLs_nw,4CR7rQLHuXZpfLzDvqlaIA,5.0,0,1,1,Awsome little Italian place. Never would have ...,2018-06-19 17:20:53
2,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,89121.0,5.0,222,"{'WiFi': ""u'free'"", 'RestaurantsPriceRange2': ...",sjGYJKxxNtKcFeZHwjfYLg,UkBp300T1dfvMK8BLq08qQ,5.0,0,0,0,We moved back to Vegas about a year ago and he...,2018-08-05 03:13:21
3,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,89121.0,5.0,222,"{'WiFi': ""u'free'"", 'RestaurantsPriceRange2': ...",3ccPyWbmFfR8T-Ev3qQDcg,IyQGS915aIQcwPtSwKLI8w,5.0,0,0,0,We came to Los Olivos to have our Festa Di Nat...,2017-12-16 15:10:08
4,AN0bWhisCf6LN9eHZ7DQ3w,Los Olivos Ristorante,89121.0,5.0,222,"{'WiFi': ""u'free'"", 'RestaurantsPriceRange2': ...",2lg2JofvDwl7zoRxJuZMmw,95RciOZdfdypm7DhNcn7cw,5.0,0,0,0,A hidden Gem! We enjoyed Albino's cooking seve...,2017-11-19 04:19:30


To speed up computer operations, only reviews from the most recent two years are kept

In [4]:
#Find out time when the latest review was made
br.date.max()

'2019-12-13 15:34:17'

In [4]:
#check whether slicing worked
br = br.query("`date`>'2017-12-13'")
br.date.min()

'2017-12-13 00:10:44'

In [5]:
br.to_csv('data/preprocessed.csv')