# Yelp Fusion API - Training Data Pull 1/8/19

## Britt Allen, Bernard Kurka, Thomas Ludlow - NY-DSI-6

Figure out how to pull `price` and supporting data directly from Yelp using *Fusion API*.  

### Resources

GitHub: 
 - https://github.com/Yelp/yelp-python
 - https://github.com/gfairchild/yelpapi *(Best library)*
  - https://github.com/gfairchild/yelpapi/blob/master/examples/examples.py

Endpoint Documentation: https://www.yelp.com/developers/documentation/v3/business

Using regular search, a location-based query is formatted like this:
`https://www.yelp.com/search?find_loc=10128`

```
My App
Client ID
ea2TodAq4YX-4W3lzSJrcA

API Key
21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx
```

### Libraries

In [82]:
import numpy as np
import pandas as pd
import json
import time
from yelpapi import YelpAPI


## Search query to dataframe

In [54]:
def query_to_df(loc_in, cat_in=['restaurants','shopping','localservices'], 
                sort_in='distance', limit_in=50, 
                cols=['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']):
    """Available arguments:
    loc_in (str): location (zip, city, neighborhood, etc.)
    cat_in (list): categories - default is ['restaurants','shopping','localservices']
    sort_in (str): sort criterion of 'distance','best_match','review_count' - default is 'distance'
    limit_in (int): number of results to pull per category, max is 50 - default is 50
    cols (list): columns for dataframe, matching API results key names - default is
    ['categories','alias','city','state','zip_code','price','review_count','latitude','longitude']
    """
    
    # Set Yelp Fusion API Key and establish API connection
    api_key = '21Pt2l8__qgIdL0ZpgYC_yWblJ_O8_vJ3_-tIybHDyuQl9oVBXAzAXQWqMmIrz7idLyc7owv4-lfSON0QjKJN4pvQei4rUQAGSZcGcVTQc4HtBseUcztUPkVrAItXHYx'
    api_obj = YelpAPI(api_key, timeout_s=3.0)
    
    # Instantiate empty DataFrame with desired output columns
    output_df = pd.DataFrame(columns=['search_term']+cols)
    
    # Create iterable list of limit amounts <= 50 so that full limit argument is covered
    # ex. 70 -> [50,20]
    limit_list = []
    if limit_in > 50:
        req = limit_in  # req starts at limit argument and counts down by 50 until < 50
        while req > 50:
            limit_list.append(50)
            req -= 50
        limit_list.append(req)
    else:
        limit_list.append(limit_in) # if req < 50 append remaining amount to list
    
    # Loop through category argument list items
    for cat in cat_in:
        cat_df = pd.DataFrame(columns=['search_term']+cols) # Create empty DataFrame with addl col for category
        for j, limit in enumerate(limit_list): # Perform API pulls with all limits in limit_list
            
            # API call saved to json dict
            response = api_obj.search_query(location=loc_in, categories=[cat], sort_by=sort_in, limit=limit, offset=(j*50))
            response_df = pd.DataFrame(response['businesses']) # Save business data to DataFrame
            
            # Create iteration DataFrame to process each API response (up to 50 results)
            iter_df = pd.DataFrame(columns=['search_term']+cols)
            iter_df['search_term'] = [cat for i in range(len(response_df))] # Add category value for each row

            # Iterate through each requested column argument and format for storage in output DataFrame
            for col_name in cols:
                # Convert list of categories into single comma-separated string
                if col_name == 'categories':
                    # Exception handling: not all responses include all categories
                    try:
                        for k, cell in enumerate(response_df['categories']):
                            iter_cat_str = ''
                            for d in cell:
                                iter_cat_str += str(d['alias']+', ')
                            iter_df.loc[k, 'categories'] = iter_cat_str[:-2] # Save final string, without final ', ' 
                    except:
                        pass
                elif col_name in ('city','state','zip_code'): # Access location data through 'location' key value
                    try:
                        iter_df[col_name] = [response_df['location'][i][col_name] for i in range(response_df.shape[0])]
                    except:
                        pass
                elif col_name in ('latitude','longitude'): # Access latitude/longitude through 'coordinates' key value
                    try:
                        iter_df[col_name] = [response_df['coordinates'][i][col_name] for i in range(response_df.shape[0])]
                    except:
                        pass
                else:
                    try:
                        iter_df[col_name] = response_df[col_name] # Anything else access directly
                    except:
                        pass
            cat_df = cat_df.append(iter_df)
        output_df = output_df.append(cat_df)
    output_df.index = range(output_df.shape[0])
    
    return output_df


In [55]:
test_df = query_to_df('10128', limit_in=70, cat_in=['restaurants'])

In [56]:
test_df.head()

Unnamed: 0,search_term,categories,alias,city,state,zip_code,price,review_count,latitude,longitude
0,restaurants,"catering, delis, grocery",3rd-avenue-garden-new-york,New York,NY,10128,$$,15,40.78193,-73.95194
1,restaurants,"wine_bars, southafrican, tapas",kaia-wine-bar-new-york,New York,NY,10128,$$,376,40.7819,-73.95197
2,restaurants,"japanese, korean",maroo-new-york,New York,NY,10128,$$,120,40.782476,-73.951333
3,restaurants,ramen,naruto-ramen-new-york,New York,NY,10128,$$,853,40.78117,-73.9525
4,restaurants,tradamerican,the-corner-restaurant-new-york,New York,NY,10128,$$$,13,40.78263,-73.95121


In [4]:
test_df.shape

(70, 10)

In [5]:
test_df.groupby('search_term').price.value_counts()

search_term  price
restaurants  $$       47
             $        14
             $$$       5
Name: price, dtype: int64

In [7]:
test_df.groupby('search_term').zip_code.value_counts()

search_term  zip_code
restaurants  10128       69
             10028        1
Name: zip_code, dtype: int64

## API Pull from List of ZIP codes and categories

In [17]:
zip_list = ['10128','19025']
cats = ['restaurants','shopping','localservices']

## RESET RESULTS DATAFRAME `api_data`

In [13]:
api_data = pd.DataFrame(columns=['zip','city','state','cat','pr_1','rv_1','pr_2','rv_2','pr_3','rv_3','pr_4','rv_4','avg_lat','avg_long'])


In [14]:
api_data.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long


In [57]:
def api_pull(zip_list, cats, limit=50):
    column_list = ['zip','city','state','cat',
                   'pr_1','rv_1','pr_2','rv_2',
                   'pr_3','rv_3','pr_4','rv_4',
                   'avg_lat','avg_long']
    
    api_data = pd.DataFrame(columns=column_list)
    
    for z in zip_list:
        df = query_to_df(z, cats, limit_in=limit)
        
        for c in cats:
            loop_df = pd.Series(index=column_list)
            in_zip = df[df.zip_code==z]

            loop_df['zip'] = z
            try:
                loop_df['city'] = in_zip.city[0]
                loop_df['state'] = in_zip.state[0]
            except: 
                pass
            loop_df['cat'] = c
            
            in_cat = in_zip[in_zip.search_term==c]
            
            loop_df['pr_1'] = in_cat[df.price=='$'].shape[0]
            loop_df['rv_1'] = in_cat[df.price=='$'].review_count.sum()
            loop_df['pr_2'] = in_cat[df.price=='$$'].shape[0]
            loop_df['rv_2'] = in_cat[df.price=='$$'].review_count.sum()
            loop_df['pr_3'] = in_cat[df.price=='$$$'].shape[0]
            loop_df['rv_3'] = in_cat[df.price=='$$$'].review_count.sum()
            loop_df['pr_4'] = in_cat[df.price=='$$$$'].shape[0]
            loop_df['rv_4'] = in_cat[df.price=='$$$$'].review_count.sum()

            loop_df['avg_lat'] = in_cat.latitude.mean()
            loop_df['avg_long'] = in_cat.longitude.mean()

            api_data = api_data.append(loop_df, ignore_index=True)
    
    api_data.zip = api_data.zip.astype(str)    
    return api_data

In [46]:
new_test = api_pull(zip_list, cats, limit=100)



In [44]:
new_test

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,10128.0,New York,NY,restaurants,20,1694,58,15178,6,973,0,0,40.781171,-73.950907
1,10128.0,New York,NY,shopping,8,229,37,592,5,97,7,81,40.781718,-73.951206
2,10128.0,New York,NY,localservices,8,86,17,364,6,77,3,47,40.781682,-73.951285
3,19025.0,Dresher,PA,restaurants,9,176,5,354,0,0,0,0,40.139697,-75.167804
4,19025.0,Dresher,PA,shopping,0,0,2,11,0,0,0,0,40.145581,-75.168302
5,19025.0,Dresher,PA,localservices,0,0,0,0,0,0,0,0,40.141426,-75.168589


In [6]:
test_df.head()

Unnamed: 0,search_term,categories,alias,city,state,zip_code,price,review_count,latitude,longitude
0,restaurants,"catering, delis, grocery",3rd-avenue-garden-new-york,New York,NY,10128,$$,15,40.78193,-73.95194
1,restaurants,"wine_bars, southafrican, tapas",kaia-wine-bar-new-york,New York,NY,10128,$$,376,40.7819,-73.95197
2,restaurants,"japanese, korean",maroo-new-york,New York,NY,10128,$$,120,40.782476,-73.951333
3,restaurants,ramen,naruto-ramen-new-york,New York,NY,10128,$$,852,40.78117,-73.9525
4,restaurants,tradamerican,the-corner-restaurant-new-york,New York,NY,10128,$$$,13,40.78263,-73.95121


## List of Random ZIP codes

In [32]:
zips_to_test = pd.read_csv('./Data/random_1000_zips.csv')

In [33]:
zips_to_test.head()

Unnamed: 0,STATE,zipcode
0,AL,35016
1,AL,35071
2,AL,35210
3,AL,35674
4,AL,35677


In [34]:
zips_to_test.shape

(1000, 2)

In [37]:
zips_to_test.sample(50)

Unnamed: 0,STATE,zipcode
987,WI,54441
507,MT,59414
315,KY,41862
555,NJ,7927
43,CA,90062
299,KY,40068
445,MN,56216
0,AL,35016
953,WA,99344
280,IA,52755


In [36]:
zips_to_test['zipcode'] = zips_to_test.zipcode.map(lambda x: '0'+str(x) if x <= 9999 else x)

In [38]:
zips_to_test.zipcode = zips_to_test.zipcode.astype(str)

In [39]:
tl_zips = zips_to_test[500:]
tl_zips.shape

(500, 2)

In [40]:
tl_zips.head()

Unnamed: 0,STATE,zipcode
500,MO,65733
501,MO,65768
502,MO,65772
503,MT,59001
504,MT,59074


In [41]:
tl_zips.sample(50)

Unnamed: 0,STATE,zipcode
562,NJ,8081
949,WA,98828
736,PA,16037
526,NE,68958
930,WA,98031
833,TX,76011
554,NJ,7821
606,NY,12917
591,NY,12095
620,NY,13862


# API Pulls

In [49]:
cats = ['restaurants','shopping','localservices']

In [61]:
tl_zips.head()

Unnamed: 0,STATE,zipcode
500,MO,65733
501,MO,65768
502,MO,65772
503,MT,59001
504,MT,59074


In [62]:
first_20 = api_pull(tl_zips.zipcode[:20], cats, limit=100)



In [64]:
first_20.shape

(60, 14)

In [65]:
first_20

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,65733.0,Protem,MO,restaurants,0,0,0,0,0,0,0,0,36.501162,-92.804624
1,65733.0,Protem,MO,shopping,0,0,0,0,0,0,0,0,,
2,65733.0,Protem,MO,localservices,0,0,0,0,0,0,0,0,,
3,65768.0,,,restaurants,0,0,0,0,0,0,0,0,,
4,65768.0,,,shopping,0,0,0,0,0,0,0,0,,
5,65768.0,,,localservices,0,0,0,0,0,0,0,0,,
6,65772.0,Washburn,MO,restaurants,1,2,0,0,0,0,0,0,36.587369,-93.964885
7,65772.0,Washburn,MO,shopping,0,0,0,0,0,0,0,0,,
8,65772.0,Washburn,MO,localservices,0,0,0,0,0,0,0,0,,
9,59001.0,Absarokee,MT,restaurants,1,8,1,8,0,0,0,0,45.507814,-109.443925


In [66]:
tl_zip_df = first_20

In [67]:
second_80 = api_pull(tl_zips.zipcode[20:100], cats, limit=100)



In [69]:
second_80.shape[0]

240

In [70]:
second_80.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,68812.0,Amherst,NE,restaurants,1,4,0,0,0,0,0,0,40.83758,-99.27096
1,68812.0,Amherst,NE,shopping,0,0,0,0,0,0,0,0,,
2,68812.0,Amherst,NE,localservices,0,0,0,0,0,0,0,0,,
3,68850.0,Lexington,NE,restaurants,9,136,3,71,0,0,0,0,40.77036,-99.742391
4,68850.0,Lexington,NE,shopping,1,1,1,6,1,5,0,0,40.765793,-99.741132


In [71]:
tl_zip_df = tl_zip_df.append(second_80, ignore_index=True)
tl_zip_df.index = range(tl_zip_df.shape[0])

In [72]:
tl_zip_df

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,65733.0,Protem,MO,restaurants,0,0,0,0,0,0,0,0,36.501162,-92.804624
1,65733.0,Protem,MO,shopping,0,0,0,0,0,0,0,0,,
2,65733.0,Protem,MO,localservices,0,0,0,0,0,0,0,0,,
3,65768.0,,,restaurants,0,0,0,0,0,0,0,0,,
4,65768.0,,,shopping,0,0,0,0,0,0,0,0,,
5,65768.0,,,localservices,0,0,0,0,0,0,0,0,,
6,65772.0,Washburn,MO,restaurants,1,2,0,0,0,0,0,0,36.587369,-93.964885
7,65772.0,Washburn,MO,shopping,0,0,0,0,0,0,0,0,,
8,65772.0,Washburn,MO,localservices,0,0,0,0,0,0,0,0,,
9,59001.0,Absarokee,MT,restaurants,1,8,1,8,0,0,0,0,45.507814,-109.443925


In [73]:
first_20.to_csv('zips_500_519.csv', index=False)

In [74]:
tl_zip_100_199 = api_pull(tl_zips.zipcode[100:200], cats, limit=100)



In [75]:
tl_zip_100_199

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,12531.0,Holmes,NY,restaurants,1,11,0,0,0,0,0,0,41.518810,-73.696010
1,12531.0,Holmes,NY,shopping,0,0,0,0,0,0,0,0,41.522060,-73.688500
2,12531.0,Holmes,NY,localservices,0,0,0,0,0,0,0,0,41.527733,-73.646484
3,12563.0,Patterson,NY,restaurants,3,30,9,499,0,0,0,0,41.500319,-73.584300
4,12563.0,Patterson,NY,shopping,0,0,5,18,1,3,0,0,41.509317,-73.585370
5,12563.0,Patterson,NY,localservices,0,0,0,0,0,0,0,0,41.499751,-73.588353
6,12592.0,,,restaurants,0,0,0,0,0,0,0,0,,
7,12592.0,,,shopping,0,0,0,0,0,0,0,0,,
8,12592.0,,,localservices,0,0,0,0,0,0,0,0,,
9,12732.0,Eldred,NY,restaurants,0,0,2,15,0,0,0,0,41.527291,-74.885411


In [76]:
tl_zip_df = tl_zip_df.append(tl_zip_100_199, ignore_index=True)
tl_zip_df.index = range(tl_zip_df.shape[0])

In [77]:
tl_zip_df

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,65733.0,Protem,MO,restaurants,0,0,0,0,0,0,0,0,36.501162,-92.804624
1,65733.0,Protem,MO,shopping,0,0,0,0,0,0,0,0,,
2,65733.0,Protem,MO,localservices,0,0,0,0,0,0,0,0,,
3,65768.0,,,restaurants,0,0,0,0,0,0,0,0,,
4,65768.0,,,shopping,0,0,0,0,0,0,0,0,,
5,65768.0,,,localservices,0,0,0,0,0,0,0,0,,
6,65772.0,Washburn,MO,restaurants,1,2,0,0,0,0,0,0,36.587369,-93.964885
7,65772.0,Washburn,MO,shopping,0,0,0,0,0,0,0,0,,
8,65772.0,Washburn,MO,localservices,0,0,0,0,0,0,0,0,,
9,59001.0,Absarokee,MT,restaurants,1,8,1,8,0,0,0,0,45.507814,-109.443925


In [78]:
tl_zip_200_500 = api_pull(tl_zips.zipcode[200:], cats, limit=100)



In [79]:
tl_zip_200_500

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,73722.0,Byron,OK,restaurants,0,0,0,0,0,0,0,0,36.900241,-98.358069
1,73722.0,Byron,OK,shopping,0,0,0,0,0,0,0,0,,
2,73722.0,Byron,OK,localservices,0,0,0,0,0,0,0,0,,
3,73742.0,Hennessey,OK,restaurants,2,9,2,7,0,0,0,0,36.105977,-97.898997
4,73742.0,Hennessey,OK,shopping,0,0,0,0,0,0,0,0,36.051781,-97.899981
5,73742.0,Hennessey,OK,localservices,0,0,0,0,0,0,0,0,,
6,73835.0,,,restaurants,0,0,0,0,0,0,0,0,,
7,73835.0,,,shopping,0,0,0,0,0,0,0,0,,
8,73835.0,,,localservices,0,0,0,0,0,0,0,0,,
9,73841.0,Fort Supply,OK,restaurants,0,0,0,0,0,0,0,0,36.573359,-99.569699


In [80]:
tl_zip_df = tl_zip_df.append(tl_zip_200_500, ignore_index=True)
tl_zip_df.index = range(tl_zip_df.shape[0])

In [81]:
tl_zip_df.shape

(1500, 14)

In [148]:
%%time
first_500 = api_pull(zips_to_test.zipcode[:500], cats, limit=100)



CPU times: user 1min 38s, sys: 1.66 s, total: 1min 39s
Wall time: 35min 47s


In [149]:
first_500.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,35016.0,Arab,AL,restaurants,4,19,7,55,0,0,0,0,34.333496,-86.500985
1,35016.0,Arab,AL,shopping,2,2,1,1,0,0,0,0,34.332486,-86.5051
2,35016.0,Arab,AL,localservices,0,0,0,0,0,0,0,0,34.344083,-86.487672
3,35071.0,,,restaurants,11,99,10,275,0,0,0,0,33.650618,-86.818072
4,35071.0,,,shopping,1,8,12,33,1,4,0,0,33.653353,-86.816955


In [150]:
first_500.shape

(1500, 14)

In [151]:
tl_zip_df.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long,db_city
0,65733,Protem,MO,restaurants,0,0,0,0,0,0,0,0,36.501162,-92.804624,Protem
1,65733,Protem,MO,shopping,0,0,0,0,0,0,0,0,36.52,-92.85,Protem
2,65733,Protem,MO,localservices,0,0,0,0,0,0,0,0,36.52,-92.85,Protem
3,65768,Vanzant,MO,restaurants,0,0,0,0,0,0,0,0,36.96,-92.3,Vanzant
4,65768,Vanzant,MO,shopping,0,0,0,0,0,0,0,0,36.96,-92.3,Vanzant


In [89]:
tl_zip_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 14 columns):
zip         1500 non-null object
city        984 non-null object
state       984 non-null object
cat         1500 non-null object
pr_1        1500 non-null object
rv_1        1500 non-null object
pr_2        1500 non-null object
rv_2        1500 non-null object
pr_3        1500 non-null object
rv_3        1500 non-null object
pr_4        1500 non-null object
rv_4        1500 non-null object
avg_lat     945 non-null float64
avg_long    945 non-null float64
dtypes: float64(2), object(12)
memory usage: 164.1+ KB


In [155]:
st_ref_dict = {zips_to_test.zipcode[i]: zips_to_test.STATE[i] for i in range(1000)}

In [156]:
st_ref_dict

{'35016': 'AL',
 '35071': 'AL',
 '35210': 'AL',
 '35674': 'AL',
 '35677': 'AL',
 '35806': 'AL',
 '35972': 'AL',
 '36088': 'AL',
 '36093': 'AL',
 '36426': 'AL',
 '36441': 'AL',
 '36444': 'AL',
 '36529': 'AL',
 '36585': 'AL',
 '36605': 'AL',
 '36618': 'AL',
 '36784': 'AL',
 '36877': 'AL',
 '99578': 'AK',
 '99621': 'AK',
 '99826': 'AK',
 '85003': 'AZ',
 '85021': 'AZ',
 '85209': 'AZ',
 '85335': 'AZ',
 '86331': 'AZ',
 '86406': 'AZ',
 '86413': 'AZ',
 '71646': 'AR',
 '71837': 'AR',
 '71956': 'AR',
 '72068': 'AR',
 '72135': 'AR',
 '72415': 'AR',
 '72425': 'AR',
 '72450': 'AR',
 '72459': 'AR',
 '72569': 'AR',
 '72577': 'AR',
 '72633': 'AR',
 '72940': 'AR',
 '90039': 'CA',
 '90047': 'CA',
 '90062': 'CA',
 '91042': 'CA',
 '91325': 'CA',
 '91768': 'CA',
 '91790': 'CA',
 '91792': 'CA',
 '92056': 'CA',
 '92067': 'CA',
 '92121': 'CA',
 '92230': 'CA',
 '92242': 'CA',
 '92411': 'CA',
 '92532': 'CA',
 '92571': 'CA',
 '92704': 'CA',
 '92831': 'CA',
 '92865': 'CA',
 '93105': 'CA',
 '93117': 'CA',
 '93243'

In [104]:
tl_zip_df.zip = tl_zip_df.zip.str.split('.', expand=True)[0]

In [152]:
first_500.zip = first_500.zip.str.split('.', expand=True)[0]

In [107]:
tl_zip_df.zip = tl_zip_df.zip.map(lambda x: '0'+str(x) if int(x) <= 9999 else x)

In [153]:
first_500.zip = first_500.zip.map(lambda x: '0'+str(x) if int(x) <= 9999 else x)

In [108]:
tl_zip_df.state = tl_zip_df.zip.map(lambda x: st_ref_dict[x])

In [157]:
first_500.state = first_500.zip.map(lambda x: st_ref_dict[x])

In [109]:
tl_zip_df.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,65733,Protem,MO,restaurants,0,0,0,0,0,0,0,0,36.501162,-92.804624
1,65733,Protem,MO,shopping,0,0,0,0,0,0,0,0,,
2,65733,Protem,MO,localservices,0,0,0,0,0,0,0,0,,
3,65768,,MO,restaurants,0,0,0,0,0,0,0,0,,
4,65768,,MO,shopping,0,0,0,0,0,0,0,0,,


In [158]:
first_500.head()

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long
0,35016,Arab,AL,restaurants,4,19,7,55,0,0,0,0,34.333496,-86.500985
1,35016,Arab,AL,shopping,2,2,1,1,0,0,0,0,34.332486,-86.5051
2,35016,Arab,AL,localservices,0,0,0,0,0,0,0,0,34.344083,-86.487672
3,35071,,AL,restaurants,11,99,10,275,0,0,0,0,33.650618,-86.818072
4,35071,,AL,shopping,1,8,12,33,1,4,0,0,33.653353,-86.816955


ZIP Code Database: https://www.unitedstateszipcodes.org/zip-code-database/

In [113]:
zip_db = pd.read_csv('./zip_code_database.csv', dtype={'zip':str})

In [114]:
zip_db.head()

Unnamed: 0,zip,type,decommissioned,primary_city,acceptable_cities,unacceptable_cities,state,county,timezone,area_codes,world_region,country,latitude,longitude,irs_estimated_population_2015
0,501,UNIQUE,0,Holtsville,,I R S Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,562
1,544,UNIQUE,0,Holtsville,,Irs Service Center,NY,Suffolk County,America/New_York,631,,US,40.81,-73.04,0
2,601,STANDARD,0,Adjuntas,,"Colinas Del Gigante, Jard De Adjuntas, Urb San...",PR,Adjuntas Municipio,America/Puerto_Rico,787939,,US,18.16,-66.72,0
3,602,STANDARD,0,Aguada,,"Alts De Aguada, Bo Guaniquilla, Comunidad Las ...",PR,Aguada Municipio,America/Puerto_Rico,787939,,US,18.38,-67.18,0
4,603,STANDARD,0,Aguadilla,Ramey,"Bda Caban, Bda Esteves, Bo Borinquen, Bo Ceiba...",PR,Aguadilla Municipio,America/Puerto_Rico,787,,US,18.43,-67.15,0


In [120]:
city_ref_dict = {zip_db.zip[i]: zip_db.primary_city[i] for i in range(zip_db.shape[0])}

In [136]:
lat_map = {zip_db.zip[i]: zip_db.latitude[i] for i in range(zip_db.shape[0])}
long_map = {zip_db.zip[i]: zip_db.longitude[i] for i in range(zip_db.shape[0])}

In [121]:
tl_zip_df['db_city'] = tl_zip_df.zip.map(lambda x: city_ref_dict[x])


In [159]:
first_500['db_city'] = first_500.zip.map(lambda x: city_ref_dict[x])

In [126]:
tl_zip_df[tl_zip_df.city.notnull()&(tl_zip_df.city.str.lower().str.strip()!=tl_zip_df.db_city.str.lower().str.strip())]

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long,db_city
96,89412,Black Rock City,NV,restaurants,3,82,0,0,0,0,1,2,40.949669,-119.442716,Gerlach
97,89412,Black Rock City,NV,shopping,0,0,0,0,0,0,0,0,,,Gerlach
98,89412,Black Rock City,NV,localservices,0,0,0,0,0,0,0,0,40.652578,-119.352507,Gerlach
141,7078,Millburn,NJ,restaurants,4,173,4,454,1,61,0,0,40.721942,-74.326457,Short Hills
142,7078,Millburn,NJ,shopping,1,10,7,67,6,26,1,4,40.724981,-74.33261,Short Hills
143,7078,Millburn,NJ,localservices,0,0,1,9,0,0,0,0,40.724539,-74.326581,Short Hills
150,7444,Pequannock,NJ,restaurants,10,267,7,955,0,0,0,0,40.966394,-74.289717,Pompton Plains
151,7444,Pequannock,NJ,shopping,3,19,10,90,4,29,0,0,40.96918,-74.288886,Pompton Plains
152,7444,Pequannock,NJ,localservices,0,0,0,0,0,0,0,0,40.968186,-74.291872,Pompton Plains
192,8812,Green Brook Township,NJ,restaurants,21,674,21,1371,0,0,0,0,40.595225,-74.474226,Dunellen


In [130]:
tl_zip_df.loc[tl_zip_df[tl_zip_df.city.isnull()].index, 'city'] = tl_zip_df[tl_zip_df.city.isnull()]['db_city']

In [160]:
first_500.loc[first_500[first_500.city.isnull()].index, 'city'] = first_500[first_500.city.isnull()]['db_city']

In [143]:
tl_zip_df.loc[tl_zip_df[tl_zip_df.avg_lat.isnull()].index, 'avg_lat'] = tl_zip_df[tl_zip_df.avg_lat.isnull()].zip.map(lambda x: lat_map[x])
tl_zip_df.loc[tl_zip_df[tl_zip_df.avg_long.isnull()].index, 'avg_long'] = tl_zip_df[tl_zip_df.avg_long.isnull()].zip.map(lambda x: long_map[x])


In [161]:
first_500.loc[first_500[first_500.avg_lat.isnull()].index, 'avg_lat'] = first_500[first_500.avg_lat.isnull()].zip.map(lambda x: lat_map[x])
first_500.loc[first_500[first_500.avg_long.isnull()].index, 'avg_long'] = first_500[first_500.avg_long.isnull()].zip.map(lambda x: long_map[x])


In [144]:
tl_zip_df.sample(50)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long,db_city
607,73835,Camargo,OK,shopping,0,0,0,0,0,0,0,0,36.01,-99.28,Camargo
1424,26273,Huttonsville,WV,localservices,0,0,0,0,0,0,0,0,38.71,-79.97,Huttonsville
340,13673,Philadelphia,NY,shopping,0,0,0,0,0,0,0,0,44.15024,-75.70847,Philadelphia
193,8812,Green Brook Township,NJ,shopping,3,11,16,150,5,54,2,14,40.595055,-74.485794,Dunellen
895,57472,Selby,SD,shopping,0,0,0,0,0,0,0,0,45.5,-100.03,Selby
401,27258,Haw River,NC,localservices,0,0,0,0,0,0,0,0,36.033836,-79.352387,Haw River
778,17815,Bloomsburg,PA,shopping,3,34,4,10,3,13,4,16,41.007759,-76.454785,Bloomsburg
1027,76627,Blum,TX,shopping,0,0,0,0,1,14,0,0,32.14283,-97.397438,Blum
882,57317,Bonesteel,SD,restaurants,0,0,0,0,0,0,0,0,43.07,-98.94,Bonesteel
1251,24011,Roanoke,VA,restaurants,12,708,24,1913,2,230,1,85,37.271495,-79.940044,Roanoke


In [145]:
tl_zip_df.isnull().sum()

zip         0
city        0
state       0
cat         0
pr_1        0
rv_1        0
pr_2        0
rv_2        0
pr_3        0
rv_3        0
pr_4        0
rv_4        0
avg_lat     0
avg_long    0
db_city     0
dtype: int64

In [162]:
first_500.sample(50)

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long,db_city
1452,64660,Mendon,MO,restaurants,0,0,0,0,0,0,0,0,39.59,-93.13,Mendon
294,95988,Willows,CA,restaurants,10,336,6,606,0,0,0,0,39.524096,-122.212628,Willows
348,81501,Grand Junction,CO,restaurants,25,686,41,3962,3,376,0,0,39.073968,-108.557067,Grand Junction
405,32430,Clarksville,FL,restaurants,0,0,0,0,0,0,0,0,30.43,-85.18,Clarksville
673,46229,Indianapolis,IN,shopping,12,67,21,118,8,40,1,7,39.776632,-85.987275,Indianapolis
1272,49632,Falmouth,MI,restaurants,1,9,0,0,0,0,0,0,44.243549,-85.08503,Falmouth
193,93268,Taft,CA,shopping,1,4,2,9,1,5,0,0,35.144874,-119.462652,Taft
1309,55362,Monticello,MN,shopping,3,7,4,5,2,4,1,4,45.296198,-93.791848,Monticello
412,32445,Malone,FL,shopping,0,0,0,0,0,0,0,0,30.95,-85.16,Malone
619,62244,Fults,IL,shopping,0,0,1,2,0,0,0,0,38.22966,-90.18426,Fults


In [147]:
tl_zip_df[tl_zip_df.zip=='68144']

Unnamed: 0,zip,city,state,cat,pr_1,rv_1,pr_2,rv_2,pr_3,rv_3,pr_4,rv_4,avg_lat,avg_long,db_city
42,68144,Omaha,NE,restaurants,41,1077,24,1892,1,173,0,0,41.235192,-96.119771,Omaha
43,68144,Omaha,NE,shopping,9,101,47,288,8,74,3,10,41.231707,-96.122419,Omaha
44,68144,Omaha,NE,localservices,1,5,6,46,0,0,0,0,41.231717,-96.118715,Omaha


In [146]:
tl_zip_df.to_csv('api_data_501_1000.csv', index=False)

In [164]:
api_1000 = first_500.append(tl_zip_df, ignore_index=True)

In [165]:
api_1000.to_csv('api_1000.csv', index=False)