# Yelp API - Business Search

## 0 Get ready
Import necessary libraries and set the parameters for api

In [1]:
# request setting
# https://github.com/Yelp/yelp-python
from yelp.client import Client
import requests
import pandas as pd
import numpy as np
import json
# For Python 3.0 and later
from urllib.error import HTTPError
from urllib.parse import quote
from urllib.parse import urlencode

In [69]:
# set api parameters
API_KEY = open('api.txt','r').read()
API_HOST = 'https://api.yelp.com'
SEARCH_PATH = '/v3/businesses/'
SEARCH_LIMIT = 50
SEARCH_REVIEW = 'v3/businesses/' #{id}/reviews

In [70]:
SEARCH_PATH+'search'

'/v3/businesses/search'

## 1 Create functions
Functions:
1. request: take the path, api_key ,and arguments as inputs, then send request to the api.
2. search: take in api_key, term, location, and offset number, then return a request to `request` function
3. parse_json: parse the json file downloaded from Postman and return as a panda dataframe.
4. url_arg: generate the url parameters for request
5. clean: clean the data requested from Yelp api and return a dataframe.

In [3]:
# define a function to send request
def request(host, path, api_key, url_params=None):
    """Given your API_KEY, send a GET request to the API.
    Args:
        host (str): The domain host of the API.
        path (str): The path of the API after the domain.
        API_KEY (str): Your API Key.
        url_params (dict): An optional set of query parameters in the request.
    Returns:
        dict: The JSON response from the request.
    Raises:
        HTTPError: An error occurs from the HTTP request.
    """
    url_params = url_params or {}
    url = '{0}{1}'.format(host, quote(path.encode('utf8')))
    headers = {
        'Authorization': 'Bearer %s' % api_key,
    }

    #print(u'Querying {0} ...'.format(url))

    response = requests.request('GET', url, headers=headers, params=url_params)

    return response.json()

In [4]:
# define a function to search business
def search(api_key, term, location, offset):
    """Query the Search API by a search term and location.
    Args:
        term (str): The search term passed to the API.
        location (str): The search location passed to the API.
    Returns:
        dict: The JSON response from the request.
    """

    url_params = {
        'term': term.replace(' ', '+'),
        'location': location.replace(' ', '+'),
        'limit': SEARCH_LIMIT,
        'offset':offset
    }
    return request(API_HOST, SEARCH_PATH, API_KEY, url_params=url_params)

In [5]:
# define a function to parse the json file from Postman
def parse_json(file_path, name=None):
    """parse the json file and output a pandas DataFrame. 
    If the file is downloaded from Postman, input the search type. Otherwise, leave that empty"""
    temp = []
    for line in open(file_path, 'r'):
        temp.append(json.loads(line))
    if name != None:
        temp = pd.DataFrame.from_dict(temp[0][name])
    return temp

In [6]:
def url_arg(location, term='restaurants', categories=None, offset=0):
    """Return the url parameters for search function."""
    url_params = {
            'term': term.replace(' ', '+'),
            'location': location.replace(' ', '+'),
            'limit': SEARCH_LIMIT,
            'offset':offset,
        }
    if categories is not None:
        url_params['categories']=categories.replace(' ','+')
    return url_params

### Parameters for business search
[Yelp Fusion | /businesses/search](https://www.yelp.com/developers/documentation/v3/business_search)

- term (string)    
Optional. Search term, for example "food" or "restaurants". The term may also be business names, such as "Starbucks". If term is not included the endpoint will default to searching across businesses from a small number of popular categories.  
- location (string)  
Required if either latitude or longitude is not provided. This string indicates the geographic area to be used when searching for businesses. Examples: "New York City", "NYC", "350 5th Ave, New York, NY 10118". Businesses returned in the response may not be strictly within the specified location.  
- latitude (decimal)  
Required if location is not provided. Latitude of the location you want to search nearby.  
- longitude (decimal)  
Required if location is not provided. Longitude of the location you want to search nearby.  
- radius (int)  
Optional. A suggested search radius in meters. This field is used as a suggestion to the search. The actual search radius may be lower than the suggested radius in dense urban areas, and higher in regions of less business density. If the specified value is too large, a AREA_TOO_LARGE error may be returned. The max value is 40000 meters (about 25 miles).  
- categories (string)  
Optional. Categories to filter the search results with. See the list of supported categories. The category filter can be a list of comma delimited categories. For example, "bars,french" will filter by Bars OR French. The category identifier should be used (for example "discgolf", not "Disc Golf").  
- locale (string)  
Optional. Specify the locale into which to localize the business information. See the list of supported locales. Defaults to en_US.  
- limit (int)  
Optional. Number of business results to return. By default, it will return 20. Maximum is 50.  
- offset (int)  
Optional. Offset the list of returned business results by this amount.  
- sort_by (string)  
Optional. Suggestion to the search algorithm that the results be sorted by one of the these modes: best_match, rating, review_count or distance. The default is best_match. Note that specifying the sort_by is a suggestion (not strictly enforced) to Yelp's search, which considers multiple input parameters to return the most relevant results. For example, the rating sort is not strictly sorted by the rating value, but by an adjusted rating value that takes into account the number of ratings, similar to a Bayesian average. This is to prevent skewing results to businesses with a single review.  
- price (string)  
Optional. Pricing levels to filter the search result with: 1 to 4 dolar signs. The price filter can be a list of comma delimited pricing levels. For example, "1, 2, 3".  
- open_now (boolean)  
Optional. Default to false. When set to true, only return the businesses open now. Notice that open_at and open_now cannot be used together.  
- open_at (int)  
Optional. An integer represending the Unix time in the same timezone of the search location. If specified, it will return business open at the given time. Notice that open_at and open_now cannot be used together.  
- attributes (string)  
Optional. Try these additional filters to return specific search results!  
    - hot_and_new - popular businesses which recently joined Yelp
    - request_a_quote - businesses which actively reply to Request a Quote inquiries
    - reservation - businesses with Yelp Reservations bookings enabled on their profile page
    - waitlist_reservation - businesses with Yelp Waitlist bookings enabled on their profile screen (iOS/Android)
    - cashback - businesses offering Yelp Cash Back to in-house customers
    - deals - businesses offering Yelp Deals on their profile page
    - gender_neutral_restrooms - businesses which provide gender neutral restrooms
    - open_to_all - businesses which are Open To All
    - wheelchair_accessible - businesses which are Wheelchair Accessible  
You can combine multiple attributes by providing a comma separated like "attribute1,attribute2". If multiple attributes are used, only businesses that satisfy ALL attributes will be returned in search results. For example, the attributes "hot_and_new,cashback" will return businesses that are Hot and New AND offer Cash Back.

## 2 Data preparation

In [8]:
# get the categories of the data
categories = pd.read_json('Data/categories.json')
for i in range(len(categories['parents'])):
    categories['parents'][i] = ' '.join(categories['parents'][i])

# extract food and restaurant categories
food_categories = categories[categories['parents']=='food']
restaurant_categories = categories[categories['parents']=='restaurants']

In [64]:
def clean(df):
    """Expand the coordinates and location data to individual column"""
    original = ['coordinates','coordinates','location','location','location','location','location','location','location']
    expand_columns = ['latitude','longitude','address1','address2','address3','city','zip_code','state','country']
    for i in expand_columns:
        df[i]=""
    for i in range(len(df)):
        for j in range(len(original)):
            df[expand_columns[j]].iloc[i]=df[original[j]].iloc[i][expand_columns[j]]
    df = df.drop(columns=['coordinates','location'])
    # delete duplicated restaurants
    bool_series = df['id'].duplicated()
    cleaned = df[~bool_series]
    # drop other cities
    cleaned = cleaned[cleaned['city']=='Seattle']
    return cleaned

## 3 Extract Data
### Restaurant by neighborhood
Manually break down the area of each neighborhood into multiple small circles and record the latitude and longitude of the circle center and the radius of the circle in the `Neighborhood_locations.csv` file. Import the location and pull all the restaurants within that range. 

In [99]:
neighborhood = pd.read_csv('Data/Neighborhood_locations.csv')
neighborhood = neighborhood[['Neighborhood','central_lat','central_long','radius_m']].dropna()

In [101]:
records = range(0,50,50)
restaurant_by_neighborhood=pd.DataFrame()
for j in range(len(neighborhood)):
    list_temp = []
    for i in records:
        url_params = {
            'term': 'restaurants',
            'location': 'seattle',
            'latitude':float(neighborhood.iloc[j]['central_lat']),
            'longitude':float(neighborhood.iloc[j]['central_long']),
            'radius':int(neighborhood.iloc[j]['radius_m']),
            'limit': SEARCH_LIMIT,
            'offset':i
        }
        temp = request(API_HOST, SEARCH_PATH+'search', API_KEY, url_params)['businesses']
        #temp = search(api_key, type_temp,'seattle',i)['businesses']
        list_temp = list_temp + temp
    list_temp = pd.DataFrame(list_temp)
    list_temp['neighborhood']=neighborhood.iloc[j]['Neighborhood']
    restaurant_by_neighborhood = pd.concat([restaurant_by_neighborhood,list_temp],sort=False)
# export the data to csv
restaurant_by_neighborhood.to_csv('restaurant_by_neighborhood.csv')

### Restaurant by cuisine
Extract the first 150 data in each restaurant category from yelp.

In [9]:
records = range(0,100,50)
restaurant_by_cuisine=pd.DataFrame()
for rest_type in list(restaurant_categories['alias']):
    list_temp = []
    for i in records:
        url_params = {
            'term': 'restaurants',
            'location': 'seattle',
            'limit': SEARCH_LIMIT,
            'categories': rest_type,
            'offset':i
        }
        temp = request(API_HOST, SEARCH_PATH+'search', API_KEY, url_params)['businesses']
        #temp = search(api_key, type_temp,'seattle',i)['businesses']
        list_temp = list_temp + temp
    list_temp = pd.DataFrame(list_temp)
    list_temp['cuisine']=rest_type
    restaurant_by_cuisine = pd.concat([restaurant_by_cuisine,list_temp],sort=False)
# export the data to csv
restaurant_by_cuisine.to_csv('restaurant_by_cuisine.csv')

### Reviews

In [79]:
# create subset from by_cuisine
african = by_cuisine[by_cuisine['cuisine']=='african']
for i in range(len(african)):
    idtemp = african.iloc[i].id
    url_params = {}
    request(API_HOST, SEARCH_PATH+idtemp+'/reviews', API_KEY, url_params)['reviews']
    

In [96]:
african.to_csv('african.csv')

In [93]:
idtemp = african.iloc[0].id
idtemp
#url_params = {}
#reviews = request(API_HOST, SEARCH_PATH+idtemp+'/reviews', API_KEY, url_params)['reviews']

't9rfY_0J9YrsjAHw1FcupA'

In [92]:
reviews

[{'id': 'gqi50EECFOtP17Hx2D7FUw',
  'url': 'https://www.yelp.com/biz/salare-seattle?adjust_creative=1IljKHTYmX99yWQTpvY92g&hrid=gqi50EECFOtP17Hx2D7FUw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_reviews&utm_source=1IljKHTYmX99yWQTpvY92g',
  'text': "Met a friend for Sunday brunch, and she chose Salare. What a lovely spot! From the moment you walk in, it has a very 'brunch' feel (if that makes sense)....",
  'rating': 4,
  'time_created': '2020-03-01 16:27:34',
  'user': {'id': 'TgG3J58v5bEhMB8YCQ1TbQ',
   'profile_url': 'https://www.yelp.com/user_details?userid=TgG3J58v5bEhMB8YCQ1TbQ',
   'image_url': 'https://s3-media1.fl.yelpcdn.com/photo/jXI9yPC82saiTkNmA0ztNA/o.jpg',
   'name': 'Cristina P.'}},
 {'id': 'k-itO6leoEZWyNpowFeYWw',
  'url': 'https://www.yelp.com/biz/salare-seattle?adjust_creative=1IljKHTYmX99yWQTpvY92g&hrid=k-itO6leoEZWyNpowFeYWw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_reviews&utm_source=1IljKHTYmX99yWQTpvY92g',
  'text': "We came here to celebrate 

## 4 Clean the data

In [56]:
# expand location and coordinates, remove duplicates
by_cuisine = clean(restaurant_by_cuisine)
# export the data to csv
by_cuisine.to_csv('by_cuisine_cleaned.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [103]:
by_neighborhood = clean(restaurant_by_neighborhood)
# export the data to csv
by_neighborhood.to_csv('by_neighborhood_cleaned.csv')

In [108]:
by_neighborhood[['neighborhood','name','address1','city','zip_code','review_count','categories','id']].to_csv('by_neighborhood_OED.csv')
by_neighborhood['categories']

0                [{'alias': 'pizza', 'title': 'Pizza'}]
1     [{'alias': 'bars', 'title': 'Bars'}, {'alias':...
2     [{'alias': 'thai', 'title': 'Thai'}, {'alias':...
3     [{'alias': 'filipino', 'title': 'Filipino'}, {...
4     [{'alias': 'mediterranean', 'title': 'Mediterr...
                            ...                        
45    [{'alias': 'hotpot', 'title': 'Hot Pot'}, {'al...
46    [{'alias': 'vietnamese', 'title': 'Vietnamese'...
47    [{'alias': 'persian', 'title': 'Persian/Irania...
48                 [{'alias': 'thai', 'title': 'Thai'}]
49    [{'alias': 'shavedice', 'title': 'Shaved Ice'}...
Name: categories, Length: 364, dtype: object

In [104]:
pd.DataFrame(by_neighborhood.groupby(['neighborhood']).id.count().sort_values(ascending=False))

Unnamed: 0_level_0,id
neighborhood,Unnamed: 1_level_1
Central District,75
Chinatown/International District,72
Columbia City,55
University District,49
Delridge,32
Beacon Hill,23
Rainier Beach,18
South Park,17
Little Saigon,9
Seward Park,6


In [17]:
pd.DataFrame(by_cuisine.groupby(['cuisine']).id.count().sort_values(ascending=False))

Unnamed: 0_level_0,id
cuisine,Unnamed: 1_level_1
burgers,98
breakfast_brunch,97
italian,95
delis,91
cafes,85
...,...
brazilian,1
catalan,1
polynesian,1
persian,1
