# Project 4: New Light Technologies - Yelp Affluence Model
### Britt Allen, Bernard Kurka, Thomas Ludlow - NY-DSI-6

# Notebook 2: Yelp Fusion API Exploration - 1/2/19

Figure out how to pull `price` and supporting data directly from Yelp using *Fusion API*. 

This notebook contains the following functions designed to interact with the Yelp Fusion API to extract data required for development of our Affluence classification system.
 - `zip_query_to_df`
  - Pulls summary zip code data for input location and saves to Pandas dataframe.
 - `zip_query`
  - Pulls summary zip code data, limited to establishments physically located within zip
 - `query_to_df`
  - **Final API Function**
  - More general API pull function that returns output in dataframe, including:
   -'categories', 'alias', 'city', 'state', 'zip_code', 'price', 'review_count', 'latitude', 'longitude'
  - Used as main API function in `yelpaffluence_nyc` class

### Resources

GitHub: 
 - https://github.com/Yelp/yelp-python
 - https://github.com/gfairchild/yelpapi *(Best library)*
  - https://github.com/gfairchild/yelpapi/blob/master/examples/examples.py

Endpoint Documentation: https://www.yelp.com/developers/documentation/v3/business

Using regular search, a location-based query is formatted like this:
`https://www.yelp.com/search?find_loc=10128`

```
My App
Client ID
ea2TodAq4YX-4W3lzSJrcA

API Key
21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx
```

### Libraries

In [1]:
import numpy as np
import pandas as pd
import json
from yelpapi import YelpAPI


In [28]:
api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
yelp_api = YelpAPI(api_key, timeout_s=1.0)

`search_results = yelp_api.search_query(args)`

Search Query:
`response = yelp_api.search_query(term='ice cream', location='austin, tx', sort_by='rating', limit=5)`

In [13]:
response = yelp_api.search_query(location='10128')

In [None]:
response

In [18]:
len(response['businesses'])

20

In [37]:
response = yelp_api.search_query(location='dresher, PA', sort_by='review_count', limit=50)

In [None]:
response['businesses']

In [39]:
ydf = pd.DataFrame(response['businesses'])

In [40]:
ydf.head()

Unnamed: 0,alias,categories,coordinates,display_phone,distance,id,image_url,is_closed,location,name,phone,price,rating,review_count,transactions,url
0,mad-mex-willow-grove-willow-grove-2,"[{'alias': 'mexican', 'title': 'Mexican'}, {'a...","{'latitude': 40.147965, 'longitude': -75.1282592}",(267) 495-5000,3454.414192,rAe-1HU5Z-DuUXEbzASXDA,https://s3-media4.fl.yelpcdn.com/bphoto/Y5AKNV...,False,"{'address1': '2862 W Moreland Rd', 'address2':...",Mad Mex - Willow Grove,12674955000,$$,3.0,363,[delivery],https://www.yelp.com/biz/mad-mex-willow-grove-...
1,cantina-feliz-fort-washington,"[{'alias': 'mexican', 'title': 'Mexican'}]","{'latitude': 40.136374, 'longitude': -75.213808}",(215) 646-1320,3950.728609,w_B0phzyFXPmeiVlzQrq6A,https://s3-media2.fl.yelpcdn.com/bphoto/eCg8z6...,False,"{'address1': '424 S Bethlehem Pike', 'address2...",Cantina Feliz,12156461320,$$,4.0,358,[],https://www.yelp.com/biz/cantina-feliz-fort-wa...
2,ooka-restaurant-willow-grove,"[{'alias': 'japanese', 'title': 'Japanese'}, {...","{'latitude': 40.15676, 'longitude': -75.12175}",(215) 659-7688,4300.544616,2Q1R2OhBbAQ581vK_r7NhA,https://s3-media1.fl.yelpcdn.com/bphoto/vlL1qC...,False,"{'address1': '1109 Easton Rd', 'address2': '',...",Ooka Restaurant,12156597688,$$,4.0,333,[],https://www.yelp.com/biz/ooka-restaurant-willo...
3,kitchen-bar-abington,"[{'alias': 'tradamerican', 'title': 'American ...","{'latitude': 40.1249122619629, 'longitude': -7...",(215) 576-9766,4562.159885,JmzNw0WCPmZPZdq5nx9brg,https://s3-media2.fl.yelpcdn.com/bphoto/PTtRRJ...,False,"{'address1': '1482 Old York Rd', 'address2': '...",Kitchen Bar,12155769766,$$,3.0,324,"[restaurant_reservation, delivery]",https://www.yelp.com/biz/kitchen-bar-abington?...
4,the-cheesecake-factory-willow-grove,"[{'alias': 'newamerican', 'title': 'American (...","{'latitude': 40.1402781186111, 'longitude': -7...",(215) 659-0270,3589.628567,uP42QDUxC2lxz15BUDbFnA,https://s3-media4.fl.yelpcdn.com/bphoto/g4RnPr...,False,"{'address1': '2500 W Moreland Rd', 'address2':...",The Cheesecake Factory,12156590270,$$,3.0,290,[delivery],https://www.yelp.com/biz/the-cheesecake-factor...


In [175]:
ydf.shape

(50, 16)

In [43]:
ydf.loc[0]['categories']

[{'alias': 'mexican', 'title': 'Mexican'},
 {'alias': 'tex-mex', 'title': 'Tex-Mex'}]

In [None]:
ydf['location']

`YelpAPI.search_query` arguments:
- `location`
- `sort_by`
- `limit`
- `offset`
- `categories` *comma-separated list*

Category dictionary: https://www.yelp.com/developers/documentation/v3/all_category_list

In [46]:
with open('./categories.json', encoding='utf-8') as json_file:
    categories = json.load(json_file)

In [None]:
categories

In [52]:
type(categories)

list

### Parent Categories

In [53]:
parent_cats = []

for cat in categories:
    if cat['parents']==[]:
        print(cat['alias'])
        parent_cats.append(cat['alias'])

active
arts
auto
beautysvc
bicycles
education
eventservices
financialservices
food
health
homeservices
hotelstravel
localflavor
localservices
massmedia
nightlife
pets
professional
publicservicesgovt
religiousorgs
restaurants
shopping


In [54]:
len(parent_cats)

22

## Function to pull results from a ZIP code

In [153]:
def zip_query_to_df(zip_in, cats=['restaurants','shopping','nightlife',
                                  'hotelstravel','localservices','food',
                                  'eventservices','beautysvc','dryclean',
                                  'hair','bar']):
    
    api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
    api_obj = YelpAPI(api_key, timeout_s=3.0)
    
    zip_results = pd.DataFrame(index=cats, columns=['one','two','three','four'])
    #print(zip_results)
    
    for cat in cats:
        print(cat)
        response = api_obj.search_query(location=str(zip_in), categories=[cat], 
                                        sort_by='distance', limit=50)
        res_df = pd.DataFrame(response['businesses'])
        #print(res_df.columns)
        for i in range(len(res_df)):
            if res_df['location'][i]['zip_code'] != str(zip_in):
                res_df.iloc[i, 11] = np.NaN
        
        try:
#             print(res_df[res_df.price.str.strip()=='$'])
#             print(res_df[res_df.price.str.strip()=='$$'])
#             print(res_df[res_df.price.str.strip()=='$$$'])
#             print(res_df[res_df.price.str.strip()=='$$$$'])
            
            one_ct = res_df[res_df.price.str.strip()=='$'].shape[0]
            two_ct = res_df[res_df.price.str.strip()=='$$'].shape[0]
            three_ct = res_df[res_df.price.str.strip()=='$$$'].shape[0]
            four_ct = res_df[res_df.price.str.strip()=='$$$$'].shape[0]
            
            zip_results.loc[cat,'one'] = res_df[res_df.price.str.strip()=='$'].shape[0]
            zip_results.loc[cat,'two'] = res_df[res_df.price.str.strip()=='$$'].shape[0]
            zip_results.loc[cat,'three'] = res_df[res_df.price.str.strip()=='$$$'].shape[0]
            zip_results.loc[cat,'four'] = res_df[res_df.price.str.strip()=='$$$$'].shape[0]
        except:
            pass
        #if one_ct + two_ct + three_ct + four_ct < 40:
        print(cat, '2')
        res2 = api_obj.search_query(location=str(zip_in), categories=[cat], 
                                    sort_by='distance', offset=50, limit=50)
        res2_df = pd.DataFrame(res2['businesses'])
        for j in range(len(res2_df)):
            if res2_df['location'][j]['zip_code'] != str(zip_in):
                res2_df.iloc[j, 11] = np.NaN
        try:
            zip_results.loc[cat,'one'] += res2_df[res2_df.price.str.strip()=='$'].shape[0]
            zip_results.loc[cat,'two'] += res2_df[res2_df.price.str.strip()=='$$'].shape[0]
            zip_results.loc[cat,'three'] += res2_df[res2_df.price.str.strip()=='$$$'].shape[0]
            zip_results.loc[cat,'four'] += res2_df[res2_df.price.str.strip()=='$$$$'].shape[0]
        except:
            pass 

    return zip_results

In [149]:
zip_query_to_df('19025')

restaurants
restaurants 2
shopping
shopping 2
nightlife
nightlife 2
hotelstravel
hotelstravel 2
localservices
localservices 2
food
food 2
eventservices
eventservices 2
beautysvc
beautysvc 2
dryclean
dryclean 2
hair
hair 2
bar
bar 2


Unnamed: 0,one,two,three,four
restaurants,9,5,0,0
shopping,0,2,0,0
nightlife,0,0,0,0
hotelstravel,0,0,0,0
localservices,0,0,0,0
food,2,4,1,0
eventservices,0,0,0,0
beautysvc,1,2,0,0
dryclean,0,0,0,0
hair,0,1,0,0


In [150]:
zip_query_to_df('10128')

restaurants
restaurants 2
shopping
shopping 2
nightlife
nightlife 2
hotelstravel
hotelstravel 2
localservices
localservices 2
food
food 2
eventservices
eventservices 2
beautysvc
beautysvc 2
dryclean
dryclean 2
hair
hair 2
bar
bar 2


Unnamed: 0,one,two,three,four
restaurants,14,47,5,0
shopping,6,32,5,5
nightlife,6,18,2,0
hotelstravel,0,3,1,0
localservices,5,13,2,1
food,17,32,2,5
eventservices,1,10,1,1
beautysvc,13,35,7,3
dryclean,0,4,1,2
hair,4,20,1,1


In [119]:
df2.sum().sum()

8

In [208]:
def zip_query(zip_in, cats=['restaurants','shopping','localservices']):
    """Available arguments for zip_query: 
    zip_in: zip code (str)
    cats: categories (list) - default is ['restaurants','shopping','localservices']
    """
    
    api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
    api_obj = YelpAPI(api_key, timeout_s=3.0)
    
    zip_results = pd.DataFrame(index=cats, columns=['1','2','3','4','returned'])
    
    for cat in cats:
        in_zip = True
        
        offset_val = 0
        return_ct = 0
        
        while in_zip:
            response = api_obj.search_query(location=zip_in, categories=[cat], 
                                            sort_by='distance', offset=offset_val, 
                                            limit=50)
            cat_df = pd.DataFrame(response['businesses'])
            return_ct += cat_df.shape[0]
            
            for i in range(len(cat_df)):
                if cat_df['location'][i]['zip_code'] != zip_in:
                    cat_df.iloc[i, 11] = '0' # index location for 'price' key is 11
                    in_zip = False
            
            zip_results.loc[cat,'1'] = cat_df[cat_df.price.str.strip()=='$'].shape[0]
            zip_results.loc[cat,'2'] = cat_df[cat_df.price.str.strip()=='$$'].shape[0]
            zip_results.loc[cat,'3'] = cat_df[cat_df.price.str.strip()=='$$$'].shape[0]
            zip_results.loc[cat,'4'] = cat_df[cat_df.price.str.strip()=='$$$$'].shape[0]
            
            offset_val += 50
        zip_results.loc[cat,'returned'] = return_ct
    return zip_results
    

In [209]:
zip_query('10128')

Unnamed: 0,1,2,3,4,returned
restaurants,14,31,3,0,50
shopping,5,26,5,5,50
localservices,4,10,4,2,100


## Search query to dataframe

In [235]:
def query_to_df(loc_in, cat_in=['restaurants','shopping','localservices'], 
                sort_in='distance', limit_in=50, 
                cols=['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']):
    """Available arguments:
    loc_in (str): location (zip, city, neighborhood, etc.)
    cat_in (list): categories - default is ['restaurants','shopping','localservices']
    sort_in (str): sort criterion of 'distance','best_match','review_count' - default is 'distance'
    limit_in (int): number of results to pull per category, max is 50 - default is 50
    cols (list): columns for dataframe, matching API results key names - default is
    ['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']
    """
    api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
    api_obj = YelpAPI(api_key, timeout_s=3.0)
    
    output_df = pd.DataFrame(columns=['search_term']+cols)
    
    limit_list = []
    if limit_in > 50:
        req = limit_in
        while req > 50:
            limit_list.append(50)
            req -= 50
        limit_list.append(req)
    else:
        limit_list.append(limit_in)
    
    for cat in cat_in:
        cat_df = pd.DataFrame(columns=['search_term']+cols)
        for j, limit in enumerate(limit_list):
            response = api_obj.search_query(location=loc_in, categories=[cat], sort_by=sort_in, limit=limit, offset=(j*50))
            response_df = pd.DataFrame(response['businesses'])            
            iter_df = pd.DataFrame(columns=['search_term']+cols)
            iter_df['search_term'] = [cat for i in range(len(response_df))]

            for col_name in cols:
                if col_name == 'categories':
                    for k, cell in enumerate(response_df['categories']):
                        iter_cat_str = ''
                        for d in cell:
                            iter_cat_str += str(d['alias']+', ')
                        iter_df.loc[k, 'categories'] = iter_cat_str[:-2]                    
                elif col_name in ('city','state','zip_code'):
                    iter_df[col_name] = [response_df['location'][i][col_name] for i in range(response_df.shape[0])]
                elif col_name in ('latitude','longitude'):
                    iter_df[col_name] = [response_df['coordinates'][i][col_name] for i in range(response_df.shape[0])]
                else:
                    iter_df[col_name] = response_df[col_name]
            cat_df = cat_df.append(iter_df)
        output_df = output_df.append(cat_df)
    output_df.index = range(output_df.shape[0])
    
    return output_df


In [236]:
test_df = query_to_df('10128', limit_in=70, cat_in=['restaurants'])

In [215]:
test_df.shape

(70, 9)

In [200]:
zip_query('10128')

Unnamed: 0,1,2,3,4,returned
restaurants,14,31,3,0,50
shopping,5,26,5,5,50
localservices,4,10,4,2,100


In [206]:
test_df.groupby('search_term').price.value_counts()

search_term    price
localservices  $$        7
               $         4
               $$$       2
               $$$$      1
restaurants    $$       31
               $        14
               $$$       3
shopping       $$       27
               $         5
               $$$       5
               $$$$      5
Name: price, dtype: int64

In [216]:
test_df.groupby('search_term').price.value_counts()

search_term  price
restaurants  $$       47
             $        14
             $$$       5
Name: price, dtype: int64

In [218]:
test_df.groupby('search_term').zip_code.value_counts()

search_term  zip_code
restaurants  10128       69
             10028        1
Name: zip_code, dtype: int64

In [238]:
test_df.describe()

Unnamed: 0,latitude,longitude
count,70.0,70.0
mean,40.781166,-73.951206
std,0.00141,0.001578
min,40.779055,-73.954674
25%,40.78004,-73.952674
50%,40.780795,-73.950862
75%,40.782462,-73.950127
max,40.78412,-73.947887


In [239]:
test_df.describe(include=[np.object])

Unnamed: 0,search_term,categories,alias,city,state,zip_code,price,review_count
count,70,70,70,70,70,70,66,70
unique,1,57,70,2,1,2,3,61
top,restaurants,italian,the-filmore-delicatessen-new-york,New York,NY,10128,$$,139
freq,70,4,1,69,70,69,47,3
