# Yelp Fusion API - Training Data Pull 1/11/19

## Britt Allen, Bernard Kurka, Thomas Ludlow - NY-DSI-6

Figure out how to pull `price` and supporting data directly from Yelp using *Fusion API*.  

### Resources

GitHub: 
 - https://github.com/Yelp/yelp-python
 - https://github.com/gfairchild/yelpapi *(Best library)*
  - https://github.com/gfairchild/yelpapi/blob/master/examples/examples.py

Endpoint Documentation: https://www.yelp.com/developers/documentation/v3/business

Using regular search, a location-based query is formatted like this:
`https://www.yelp.com/search?find_loc=10128`

```
My App
Client ID
ea2TodAq4YX-4W3lzSJrcA

API Key
21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx
```

### Libraries

In [3]:
import numpy as np
import pandas as pd
import json
import time
from yelpapi import YelpAPI


## Search query to dataframe

In [4]:
def query_to_df(loc_in, cat_in=['restaurants','shopping','localservices'], 
                sort_in='distance', limit_in=50, 
                cols=['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']):
    """Available arguments:
    loc_in (str): location (zip, city, neighborhood, etc.)
    cat_in (list): categories - default is ['restaurants','shopping','localservices']
    sort_in (str): sort criterion of 'distance','best_match','review_count' - default is 'distance'
    limit_in (int): number of results to pull per category, max is 50 - default is 50
    cols (list): columns for dataframe, matching API results key names - default is
    ['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']
    """
    
    # Set Yelp Fusion API Key and establish API connection
    api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
    api_obj = YelpAPI(api_key, timeout_s=3.0)
    
    # Instantiate empty DataFrame with desired output columns
    output_df = pd.DataFrame(columns=['search_term']+cols)
    
    # Create iterable list of limit amounts <= 50 so that full limit argument is covered
    # ex. 70 -> [50,20]
    limit_list = []
    if limit_in > 50:
        req = limit_in  # req starts at limit argument and counts down by 50 until < 50
        while req > 50:
            limit_list.append(50)
            req -= 50
        limit_list.append(req)
    else:
        limit_list.append(limit_in) # if req < 50 append remaining amount to list
    
    # Loop through category argument list items
    for cat in cat_in:
        cat_df = pd.DataFrame(columns=['search_term']+cols) # Create empty DataFrame with addl col for category
        for j, limit in enumerate(limit_list): # Perform API pulls with all limits in limit_list
            
            # API call saved to json dict
            if cat=='none':
                response = api_obj.search_query(location=loc_in, sort_by=sort_in, limit=limit, offset=(j*50))
            else:
                response = api_obj.search_query(location=loc_in, categories=[cat], sort_by=sort_in, limit=limit, offset=(j*50))
            response_df = pd.DataFrame(response['businesses']) # Save business data to DataFrame
            
            # Create iteration DataFrame to process each API response (up to 50 results)
            iter_df = pd.DataFrame(columns=['search_term']+cols)
            iter_df['search_term'] = [cat for i in range(len(response_df))] # Add category value for each row

            # Iterate through each requested column argument and format for storage in output DataFrame
            for col_name in cols:
                # Convert list of categories into single comma-separated string
                if col_name == 'categories':
                    # Exception handling: not all responses include all categories
                    try:
                        for k, cell in enumerate(response_df['categories']):
                            iter_cat_str = ''
                            for d in cell:
                                iter_cat_str += str(d['alias']+', ')
                            iter_df.loc[k, 'categories'] = iter_cat_str[:-2] # Save final string, without final ', ' 
                    except:
                        pass
                elif col_name in ('city','state','zip_code'): # Access location data through 'location' key value
                    try:
                        iter_df[col_name] = [response_df['location'][i][col_name] for i in range(response_df.shape[0])]
                    except:
                        pass
                elif col_name in ('latitude','longitude'): # Access latitude/longitude through 'coordinates' key value
                    try:
                        iter_df[col_name] = [response_df['coordinates'][i][col_name] for i in range(response_df.shape[0])]
                    except:
                        pass
                else:
                    try:
                        iter_df[col_name] = response_df[col_name] # Anything else access directly
                    except:
                        pass
            cat_df = cat_df.append(iter_df)
        output_df = output_df.append(cat_df)
    output_df.index = range(output_df.shape[0])
    
    return output_df


In [55]:
test_df = query_to_df('10128', limit_in=70, cat_in=['restaurants'])

In [56]:
test_df.head()

Unnamed: 0,search_term,categories,alias,city,state,zip_code,price,review_count,latitude,longitude
0,restaurants,"catering, delis, grocery",3rd-avenue-garden-new-york,New York,NY,10128,$$,15,40.78193,-73.95194
1,restaurants,"wine_bars, southafrican, tapas",kaia-wine-bar-new-york,New York,NY,10128,$$,376,40.7819,-73.95197
2,restaurants,"japanese, korean",maroo-new-york,New York,NY,10128,$$,120,40.782476,-73.951333
3,restaurants,ramen,naruto-ramen-new-york,New York,NY,10128,$$,853,40.78117,-73.9525
4,restaurants,tradamerican,the-corner-restaurant-new-york,New York,NY,10128,$$$,13,40.78263,-73.95121


In [4]:
test_df.shape

(70, 10)

In [5]:
test_df.groupby('search_term').price.value_counts()

search_term  price
restaurants  $$       47
             $        14
             $$$       5
Name: price, dtype: int64

In [7]:
test_df.groupby('search_term').zip_code.value_counts()

search_term  zip_code
restaurants  10128       69
             10028        1
Name: zip_code, dtype: int64

## API Pull from List of ZIP codes and categories

In [17]:
zip_list = ['10128','19025']
cats = ['restaurants, shopping, localservices']

## RESET RESULTS DATAFRAME `api_data`

In [13]:
api_data = pd.DataFrame(columns=['zip','city','state','cat','pr_1','rv_1','pr_2','rv_2','pr_3','rv_3','pr_4','rv_4','avg_lat','avg_long'])


In [14]:
api_data.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long


In [5]:
def api_pull(zip_list, cats, sort='best_match', limit=50):
    column_list = ['zip','city','state','cat',
                   'pr_1','rv_1','pr_2','rv_2',
                   'pr_3','rv_3','pr_4','rv_4',
                   'avg_lat','avg_long']
    
    api_data = pd.DataFrame(columns=column_list)
    
    for z in zip_list:
        df = query_to_df(z, cats, limit_in=limit, sort_in=sort)
        
        for c in cats:
            loop_df = pd.Series(index=column_list)
            in_zip = df[df.zip_code==z]

            loop_df['zip'] = z
            try:
                loop_df['city'] = in_zip.city[0]
                loop_df['state'] = in_zip.state[0]
            except: 
                pass
            loop_df['cat'] = c
            
            in_cat = in_zip[in_zip.search_term==c]
            
            loop_df['pr_1'] = in_cat[df.price=='$'].shape[0]
            loop_df['rv_1'] = in_cat[df.price=='$'].review_count.sum()
            loop_df['pr_2'] = in_cat[df.price=='$$'].shape[0]
            loop_df['rv_2'] = in_cat[df.price=='$$'].review_count.sum()
            loop_df['pr_3'] = in_cat[df.price=='$$$'].shape[0]
            loop_df['rv_3'] = in_cat[df.price=='$$$'].review_count.sum()
            loop_df['pr_4'] = in_cat[df.price=='$$$$'].shape[0]
            loop_df['rv_4'] = in_cat[df.price=='$$$$'].review_count.sum()

            loop_df['avg_lat'] = in_cat.latitude.mean()
            loop_df['avg_long'] = in_cat.longitude.mean()

            api_data = api_data.append(loop_df, ignore_index=True)
    
    api_data.zip = api_data.zip.astype(str)    
    return api_data

In [46]:
new_test = api_pull(zip_list, cats, limit=100)



# Main Data Pull - Manhattan ZIPs, 100 best match, no category

In [6]:
nyc_zips = pd.read_csv('../nyc_zip.csv', header=None, names=['zip'], dtype={'zip':str})

In [7]:
nyc_zips.head()

Unnamed: 0,zip
0,10001
1,10002
2,10003
3,10004
4,10005


In [8]:
zips = nyc_zips.zip
cats = ['none']

In [11]:
yelp_manh_1 = api_pull(zips[:80], cats, limit=100)



In [12]:
yelp_manh_1.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001.0,New York,NY,none,8,9740,28,29095,4,5507,0,0,40.747709,-73.990216
1,10002.0,,,none,18,15201,39,43425,4,4740,0,0,40.719057,-73.989387
2,10003.0,New York,NY,none,16,19139,54,91952,6,8836,3,5125,40.730866,-73.988554
3,10004.0,New York,NY,none,12,1980,26,10361,6,2128,2,1153,40.704432,-74.011839
4,10005.0,,,none,7,1425,11,2268,1,89,1,177,40.706222,-74.008576


In [13]:
yelp_manh_1.shape

(80, 14)

In [14]:
yelp_manh_2 = api_pull(zips[80:], cats, limit=100)



In [15]:
yelp_manh_2.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10117.0,,,none,0,0,0,0,0,0,0,0,,
1,10118.0,,,none,0,0,0,0,0,0,0,0,,
2,10119.0,,,none,2,346,7,1807,0,0,0,0,40.751207,-73.992701
3,10120.0,,,none,0,0,0,0,0,0,0,0,,
4,10121.0,,,none,4,171,7,605,0,0,0,0,40.75013,-73.992074


In [16]:
yelp_manh_2.shape

(85, 14)

In [17]:
yelp_manh = yelp_manh_1.append(yelp_manh_2, ignore_index=True)

In [18]:
yelp_manh.reindex(axis=0)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001.0,New York,NY,none,8,9740,28,29095,4,5507,0,0,40.747709,-73.990216
1,10002.0,,,none,18,15201,39,43425,4,4740,0,0,40.719057,-73.989387
2,10003.0,New York,NY,none,16,19139,54,91952,6,8836,3,5125,40.730866,-73.988554
3,10004.0,New York,NY,none,12,1980,26,10361,6,2128,2,1153,40.704432,-74.011839
4,10005.0,,,none,7,1425,11,2268,1,89,1,177,40.706222,-74.008576
5,10006.0,New York,NY,none,16,1041,26,5108,4,389,0,0,40.708468,-74.013175
6,10007.0,New York,NY,none,7,1620,24,6247,5,1059,1,256,40.714133,-74.008538
7,10008.0,,,none,0,0,0,0,0,0,0,0,,
8,10009.0,,,none,8,8015,25,23741,5,4901,2,851,40.726357,-73.982569
9,10010.0,,,none,0,0,0,0,0,0,0,0,,


In [19]:
yelp_manh['city'] = 'New York'
yelp_manh['state'] = 'NY'

In [20]:
yelp_manh.zip = yelp_manh.zip.str.split('.', expand=True)[0]

In [21]:
yelp_manh.zip = yelp_manh.zip.map(lambda x: '0'+str(x) if int(x) <= 9999 else x)

In [22]:
yelp_manh.sample(40)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
7,10008,New York,NY,none,0,0,0,0,0,0,0,0,,
3,10004,New York,NY,none,12,1980,26,10361,6,2128,2,1153,40.704432,-74.011839
85,10122,New York,NY,none,0,0,0,0,0,0,0,0,40.751865,-73.991707
106,10158,New York,NY,none,0,0,0,0,0,0,0,0,,
50,10069,New York,NY,none,1,45,0,0,0,0,0,0,40.776627,-73.989094
111,10163,New York,NY,none,0,0,0,0,0,0,0,0,,
80,10117,New York,NY,none,0,0,0,0,0,0,0,0,,
57,10087,New York,NY,none,0,0,0,0,0,0,0,0,,
155,10276,New York,NY,none,0,0,0,0,0,0,0,0,,
15,10016,New York,NY,none,3,1299,36,26966,7,6431,1,897,40.745673,-73.981732


ZIP Code Database: https://www.unitedstateszipcodes.org/zip-code-database/

In [23]:
zip_db = pd.read_csv('../Data/zip_code_database.csv', dtype={'zip':str})

In [24]:
zip_db.head()

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


In [25]:
lat_map = {zip_db.zip[i]: zip_db.latitude[i] for i in range(zip_db.shape[0])}
long_map = {zip_db.zip[i]: zip_db.longitude[i] for i in range(zip_db.shape[0])}

In [37]:
yelp = yelp_manh[yelp_manh.pr_1 + yelp_manh.pr_2 + yelp_manh.pr_3 + yelp_manh.pr_4 > 0]

In [38]:
yelp.to_csv('../Data/yelp.csv', index=False)

# Main Data Pull - Manhattan ZIPs, 50 Best Match for Rest/Shop/Lcl Svc

In [4]:
nyc_zips = pd.read_csv('../nyc_zip.csv', header=None, names=['zip'], dtype={'zip':str})

In [5]:
nyc_zips.head()

Unnamed: 0,zip
0,10001
1,10002
2,10003
3,10004
4,10005


In [7]:
zips = nyc_zips.zip
cats = ['restaurants','shopping','localservices']

In [10]:
yelp_manh_1 = api_pull(zips[:80], cats)



In [11]:
yelp_manh_1.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001.0,New York,NY,restaurants,6,9124,15,16306,2,3570,0,0,40.7479,-73.990748
1,10001.0,New York,NY,shopping,1,311,22,5811,2,201,0,0,40.749071,-73.990807
2,10001.0,New York,NY,localservices,0,0,5,847,0,0,0,0,40.749225,-73.991876
3,10002.0,,,restaurants,7,7448,24,30513,3,4269,0,0,40.719261,-73.98943
4,10002.0,,,shopping,5,409,13,1163,6,437,1,135,40.718379,-73.990161


In [12]:
yelp_manh_1.shape

(240, 14)

In [13]:
yelp_manh_2 = api_pull(zips[80:], cats)



In [14]:
yelp_manh_2.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10117.0,,,restaurants,0,0,0,0,0,0,0,0,,
1,10117.0,,,shopping,0,0,0,0,0,0,0,0,,
2,10117.0,,,localservices,0,0,0,0,0,0,0,0,,
3,10118.0,,,restaurants,0,0,0,0,0,0,0,0,,
4,10118.0,,,shopping,0,0,0,0,0,0,0,0,,


In [15]:
yelp_manh_2.shape

(255, 14)

In [16]:
yelp_manh = yelp_manh_1.append(yelp_manh_2, ignore_index=True)

In [21]:
yelp_manh.reindex(axis=0)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001.0,New York,NY,restaurants,6,9124,15,16306,2,3570,0,0,40.747900,-73.990748
1,10001.0,New York,NY,shopping,1,311,22,5811,2,201,0,0,40.749071,-73.990807
2,10001.0,New York,NY,localservices,0,0,5,847,0,0,0,0,40.749225,-73.991876
3,10002.0,,,restaurants,7,7448,24,30513,3,4269,0,0,40.719261,-73.989430
4,10002.0,,,shopping,5,409,13,1163,6,437,1,135,40.718379,-73.990161
5,10002.0,,,localservices,3,261,5,209,0,0,0,0,40.718603,-73.988761
6,10003.0,New York,NY,restaurants,8,12100,29,66493,3,6702,2,4192,40.730844,-73.988139
7,10003.0,New York,NY,shopping,6,2392,22,3836,5,1147,0,0,40.731217,-73.988863
8,10003.0,New York,NY,localservices,3,341,2,830,0,0,0,0,40.731231,-73.987854
9,10004.0,New York,NY,restaurants,6,1242,14,7096,4,1626,2,1150,40.704391,-74.011904


In [23]:
yelp_manh['city'] = 'New York'
yelp_manh['state'] = 'NY'

In [24]:
yelp_manh.zip = yelp_manh.zip.str.split('.', expand=True)[0]

In [25]:
yelp_manh.zip = yelp_manh.zip.map(lambda x: '0'+str(x) if int(x) <= 9999 else x)

In [26]:
yelp_manh.sample(40)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
365,10173,New York,NY,localservices,0,0,0,0,0,0,0,0,,
479,10280,New York,NY,localservices,2,84,0,0,1,1,1,23,40.708535,-74.01729
214,10108,New York,NY,shopping,0,0,0,0,0,0,0,0,,
54,10019,New York,NY,restaurants,1,9081,19,27836,9,7599,6,9652,40.76415,-73.983546
189,10099,New York,NY,restaurants,0,0,0,0,0,0,0,0,,
399,10200,New York,NY,restaurants,0,0,0,0,0,0,0,0,,
143,10055,New York,NY,localservices,0,0,0,0,0,0,0,0,,
27,10010,New York,NY,restaurants,0,0,0,0,0,0,0,0,,
217,10109,New York,NY,shopping,0,0,0,0,0,0,0,0,,
403,10203,New York,NY,shopping,0,0,0,0,0,0,0,0,,


ZIP Code Database: https://www.unitedstateszipcodes.org/zip-code-database/

In [27]:
zip_db = pd.read_csv('../Data/zip_code_database.csv', dtype={'zip':str})

In [28]:
zip_db.head()

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


In [29]:
lat_map = {zip_db.zip[i]: zip_db.latitude[i] for i in range(zip_db.shape[0])}
long_map = {zip_db.zip[i]: zip_db.longitude[i] for i in range(zip_db.shape[0])}

In [30]:
yelp_manh['avg_lat'] = yelp_manh.zip.map(lambda x: lat_map[x])
yelp_manh['avg_long'] = yelp_manh.zip.map(lambda x: long_map[x])


In [31]:
yelp_manh.to_csv('../Data/yelp_manh.csv', index=False)

## Neighbor functions for Regression modeling

In [7]:
def get_neighbor_data(zip_in, cats=['restaurants, shopping, localservices']):
    column_list = ['zip','city','state','cat',
                   'pr_1','rv_1','pr_2','rv_2',
                   'pr_3','rv_3','pr_4','rv_4',
                   'avg_lat','avg_long']
    
    neighbors = query_to_df(zip_in, cats, sort_in='best_match', limit_in=50)
    n_zips = neighbors[neighbors.zip_code!=zip_in].zip_code.unique()
    print(zip_in)
    print(n_zips)
    print('')
    n_df = pd.DataFrame(columns=column_list)
    for z in n_zips:
        z_df = api_pull(z, cats, limit=50)
        n_df = n_df.append(z_df)

    return n_df

In [8]:
def pull_zip_w_neighbors(zip_in, cats=['restaurants', 'shopping', 'localservices'], limit_in=150):
    column_list = ['zip','city','state','cat',
                   'pr_1','rv_1','pr_2','rv_2',
                   'pr_3','rv_3','pr_4','rv_4',
                   'avg_lat','avg_long','n_pr_1',
                   'n_rv_1','n_pr_2','n_rv_2',
                   'n_pr_3','n_rv_3','n_pr_4','n_rv_4']
    
    df = pd.DataFrame(columns=column_list)
    for z in zip_in:
        target_df = api_pull(z, cats, limit_in)
        n_df = get_neighbor_data(z, cats)
        target_df['n_pr_1'] = n_df['pr_1'].sum()
        target_df['n_rv_1'] = n_df['rv_1'].sum()
        target_df['n_pr_2'] = n_df['pr_2'].sum()
        target_df['n_rv_2'] = n_df['rv_2'].sum()
        target_df['n_pr_3'] = n_df['pr_3'].sum()
        target_df['n_rv_3'] = n_df['rv_3'].sum()
        target_df['n_pr_4'] = n_df['pr_4'].sum()
        target_df['n_rv_4'] = n_df['rv_4'].sum()
        df = df.append(target_df)
    return df
        

In [26]:
def get_neighbor_list(zip_in, cats=['restaurants', 'shopping', 'localservices']):
    n_list = {}
    for z in zip_in:
        neighbors = query_to_df(z, cats, sort_in='best_match', limit_in=50)
        n_list[z] = neighbors[neighbors.zip_code!=z].zip_code.unique()
    return n_list

In [9]:
zip_list = ['10128','19025']
cats = ['restaurants','shopping']

In [10]:
pull_zip_w_neighbors(zip_list, cats)



10128
['10120' '10028' '10075' '10029']

19025
['19090' '19034' '19038' '19001' '19046' '19044' '19075' '19002' '19137']



Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,...,avg_lat,avg_long,n_pr_1,n_rv_1,n_pr_2,n_rv_2,n_pr_3,n_rv_3,n_pr_4,n_rv_4
0,1.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
1,0.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
2,1.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
3,2.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
4,8.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
0,1.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
1,9.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
2,0.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
3,2.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
4,5.0,,,"restaurants, shopping, localservices",0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0


In [18]:
nyc_zips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 1 columns):
zip    165 non-null object
dtypes: object(1)
memory usage: 1.4+ KB


In [21]:
%%time
nyc_api_40 = api_pull(nyc_zips.zip[0:40], cats=['restaurants', 'shopping', 'localservices'], limit=150)



CPU times: user 11 s, sys: 181 ms, total: 11.2 s
Wall time: 3min 50s


In [22]:
%%time
nyc_api_41_80 = api_pull(nyc_zips.zip[40:80], cats=['restaurants', 'shopping', 'localservices'], limit=150)



CPU times: user 9.54 s, sys: 175 ms, total: 9.71 s
Wall time: 3min 21s


In [23]:
%%time
nyc_api_81_120 = api_pull(nyc_zips.zip[80:120], cats=['restaurants', 'shopping', 'localservices'], limit=150)



CPU times: user 9.01 s, sys: 172 ms, total: 9.18 s
Wall time: 3min 39s


In [24]:
%%time
nyc_api_121_end = api_pull(nyc_zips.zip[120:], cats=['restaurants', 'shopping', 'localservices'], limit=150)



CPU times: user 10.7 s, sys: 199 ms, total: 10.9 s
Wall time: 4min 20s


In [27]:
nyc_df = nyc_api_40

In [29]:
nyc_df = nyc_df.append(nyc_api_41_80)

In [30]:
nyc_df = nyc_df.append(nyc_api_81_120)

In [31]:
nyc_df = nyc_df.append(nyc_api_121_end)

In [32]:
nyc_df

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001.0,New York,NY,restaurants,47,4032,47,7811,1,195,3,76,40.749689,-73.993514
1,10001.0,New York,NY,shopping,9,35,52,1212,15,271,10,91,40.749305,-73.993129
2,10001.0,New York,NY,localservices,1,9,16,876,2,11,1,12,40.749581,-73.992892
3,10002.0,New York,NY,restaurants,51,11313,65,22793,3,790,2,325,40.718464,-73.989811
4,10002.0,New York,NY,shopping,18,182,42,1036,19,436,11,214,40.718233,-73.990031
5,10002.0,New York,NY,localservices,12,339,13,339,3,62,0,0,40.717862,-73.989638
6,10003.0,New York,NY,restaurants,40,9453,71,46659,7,4198,2,257,40.731164,-73.988603
7,10003.0,New York,NY,shopping,14,2713,57,2931,25,1134,5,75,40.731148,-73.989332
8,10003.0,New York,NY,localservices,12,434,20,1086,12,296,1,9,40.731351,-73.989617
9,10004.0,New York,NY,restaurants,47,2869,42,8986,4,1627,3,1159,40.704193,-74.011952


# API Pulls

In [33]:
nyc_df.zip = nyc_df.zip.str.split('.', expand=True)[0]

In [34]:
nyc_df.zip = nyc_df.zip.map(lambda x: '0'+str(x) if int(x) <= 9999 else x)

In [35]:
nyc_df.state = 'NY'

In [36]:
nyc_df.city = 'New York'

In [37]:
nyc_df.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10001,New York,NY,restaurants,47,4032,47,7811,1,195,3,76,40.749689,-73.993514
1,10001,New York,NY,shopping,9,35,52,1212,15,271,10,91,40.749305,-73.993129
2,10001,New York,NY,localservices,1,9,16,876,2,11,1,12,40.749581,-73.992892
3,10002,New York,NY,restaurants,51,11313,65,22793,3,790,2,325,40.718464,-73.989811
4,10002,New York,NY,shopping,18,182,42,1036,19,436,11,214,40.718233,-73.990031


ZIP Code Database: https://www.unitedstateszipcodes.org/zip-code-database/

In [39]:
zip_db = pd.read_csv('../Data/zip_code_database.csv', dtype={'zip':str})

In [40]:
zip_db.head()

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


In [41]:
lat_map = {zip_db.zip[i]: zip_db.latitude[i] for i in range(zip_db.shape[0])}
long_map = {zip_db.zip[i]: zip_db.longitude[i] for i in range(zip_db.shape[0])}

In [45]:
lat_nulls = nyc_df[nyc_df.avg_lat.isnull()].index
long_nulls = nyc_df[nyc_df.avg_long.isnull()].index

In [47]:
nyc_df.to_csv('./nyc_df_orig_lat_long.csv', index=False)

In [49]:
nyc_df['avg_lat'] = nyc_df.zip.map(lambda x: lat_map[x])
nyc_df['avg_long'] = nyc_df.zip.map(lambda x: long_map[x])


In [50]:
nyc_df.sample(50)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
29,10126,New York,NY,localservices,0,0,0,0,0,0,0,0,40.71,-73.99
95,10032,New York,NY,localservices,6,49,5,33,3,26,0,0,40.84,-73.94
65,10153,New York,NY,localservices,0,0,0,0,1,982,0,0,40.76,-73.97
2,10117,New York,NY,localservices,0,0,0,0,0,0,0,0,40.71,-73.99
67,10023,New York,NY,shopping,7,54,61,1849,44,1386,1,10,40.78,-73.98
89,10270,New York,NY,localservices,0,0,0,0,0,0,0,0,40.71,-73.99
107,10112,New York,NY,localservices,0,0,1,14,0,0,0,0,40.76,-73.98
13,10121,New York,NY,shopping,0,0,1,5,0,0,0,0,40.71,-73.99
74,10156,New York,NY,localservices,0,0,0,0,0,0,0,0,40.71,-73.99
90,10107,New York,NY,restaurants,0,0,0,0,0,0,0,0,40.71,-73.99


In [51]:
nyc_df.isnull().sum()

zip         0
city        0
state       0
cat         0
pr_1        0
rv_1        0
pr_2        0
rv_2        0
pr_3        0
rv_3        0
pr_4        0
rv_4        0
avg_lat     0
avg_long    0
dtype: int64

In [52]:
nyc_df.to_csv('./nyc_df.csv', index=False)