Experiments with Yelp API

Notes:
* Documentation: https://www.yelp.com/developers/documentation/v3
* Limit of 25,000 calls per day (see FAQ)
* Options for search: https://www.yelp.com/developers/documentation/v3/business_search
    * 'term', 'location', 'limit' (max 50), 'offset', 'price'
    * 'categories' - comma-delimited string, using category identifier (e.g. "bars,french")
    * 'sort_by': 'best_match', 'rating', 'review_count', 'distance'
        * sort by review count and rating breaks after 200 for NYC
        * sort by best_match throws error if ask above 1000
    * 'attributes': e.g. 'hot_and_new,cashback' also:
        * 'request_a_quote', 'waitlist_reservation', 'deals', 'gender_neutral_restrooms'
    * Returns: 'total' (# reviews); 
* City population data gotten from:
    * https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk
    * from wikipedia for populations of US cities
    
Scrapes
1. Get top 1000 restaurants in each city (761 * 20 = ~14000 scrapes)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import json
import time
import os
from json.decoder import JSONDecodeError

In [2]:
import util

### Load food categories and cities

In [3]:
# These are used for the 'category' input to the search function
df_categories = pd.read_json('/gh/data2/yelp/categories.json')
df_categories.head()

Unnamed: 0,alias,country_blacklist,country_whitelist,parents,title
0,3dprinting,,,[localservices],3D Printing
1,abruzzese,,[IT],[italian],Abruzzese
2,absinthebars,,[CZ],[bars],Absinthe Bars
3,acaibowls,"[AR, PL, TR, MX, CL, IT]",,[food],Acai Bowls
4,accessories,,,[fashion],Accessories


In [None]:
# Load cities info
df_cities = pd.read_csv('/gh/data2/yelp/city_pop.csv', index_col=0)
df_cities.head()

Unnamed: 0,city,state,population
0,New York,New York,8537673
1,Los Angeles,California,3976322
2,Chicago,Illinois,2704958
3,Houston,Texas,2303482
4,Phoenix,Arizona,1615017


# Query API

In [None]:
t_start = time.time()

# Define search term
search_term = 'food'
step_size = 50
N_steps = 20 # 1000 results max

# Prepare parameters and outputs for looping through each city
N_cities = len(df_cities)
search_params = {'term': search_term,
                'limit': step_size
                }
# Collect restaurant data from each city
for i, row in df_cities.iterrows():
    # Print city and time elapsed
    print('\n{:}, time = {:.2f} seconds'.format(row['city'], time.time()-t_start))
    
    # Check if dataframe exists
    json_name = '/gh/data2/yelp/food_by_city/places/'+row['city']+'.json'
    if os.path.isfile(json_name):
        print('Already scraped')

    else:
        # Update location
        search_params['location'] = row['city'] + ', ' + row['state']

        # Loop through the first 1000 in steps of 50
        total_temp = []
        lats_temp = []
        longs_temp = []
        businesses_temp = []
        for j in range(N_steps):
            # Determine range of restaurants to acquire
            search_params['offset'] = step_size*j

            # Scrape 50 restaurants
            try:
                t, lat, lon, bus = util.query_api(search_params, verbose=True)
            except JSONDecodeError:
                print('Got a JSON decode error. Try again.')
                time.sleep(5)
                try:
                    t, lat, lon, bus = util.query_api(search_params)
                except JSONDecodeError:
                    print('Another JSON decode error. Try the next block.')
                    break

            # Exit loop if no more restaurants
            if t is None:
                if verbose:
                    print('Finished getting restaurants after scraping:', search_params['offset'])
                break

            # Save business data
            total_temp.append(t)
            lats_temp.append(lat)
            longs_temp.append(lon)
            businesses_temp.append(bus)

        # Save the business data to a dataframe
        with open(json_name, 'w') as fout:
            json.dump(list(np.hstack(businesses_temp)), fout)

        # Save totals array
        totals_name = '/gh/data2/yelp/food_by_city/totals/'+row['city']+'.npy'
        np.save(totals_name, total_temp)

        # Save latitude
        lats_diff = np.sum(np.array(lats_temp) - lats_temp[0])
        if lats_diff > 0:
            print('Latitude not constant:')
            print(lats_temp)
            lats_name = '/gh/data2/yelp/food_by_city/lats/'+row['city']+'.npy'
            np.save(lats_name, lats_temp)
        else:
            lats_name = '/gh/data2/yelp/food_by_city/lats/'+row['city']+'.txt'
            with open(lats_name, "w") as f:
                f.write(str(lats_temp[0]))

        # Save longitude
        longs_diff = np.sum(np.array(longs_temp) - longs_temp[0])
        if longs_diff > 0:
            print('Longitude not constant:')
            print(longs_temp)
            longs_name = '/gh/data2/yelp/food_by_city/longs/'+row['city']+'.npy'
            np.save(longs_name, longs_temp)
        else:
            longs_name = '/gh/data2/yelp/food_by_city/longs/'+row['city']+'.txt'
            with open(longs_name, "w") as f:
                f.write(str(longs_temp[0]))


New York, time = 0.00 seconds
Already scraped

Los Angeles, time = 0.00 seconds
Already scraped

Chicago, time = 0.00 seconds
Already scraped

Houston, time = 0.00 seconds
Already scraped

Phoenix, time = 0.00 seconds
Already scraped

Philadelphia, time = 0.00 seconds
Already scraped

San Antonio, time = 0.00 seconds
Already scraped

San Diego, time = 0.00 seconds
Already scraped

Dallas, time = 0.00 seconds
Already scraped

San Jose, time = 0.01 seconds
Already scraped

Austin, time = 0.01 seconds
Already scraped

Jacksonville, time = 0.01 seconds
Already scraped

San Francisco, time = 0.01 seconds
Already scraped

Columbus, time = 0.01 seconds
Already scraped

Indianapolis, time = 0.01 seconds
Already scraped

Fort Worth, time = 0.01 seconds
Already scraped

Charlotte, time = 0.01 seconds
Already scraped

Seattle, time = 0.01 seconds
Already scraped

Denver, time = 0.01 seconds
Already scraped

El Paso, time = 0.01 seconds
Already scraped

Washington, time = 0.01 seconds
Already scr