## New York City Data: Restaurants with Operating Licenses

### Here, we import, cleanse and form a Pandas dataframe ('manhattan_restaurants) for the further analysis of the New York City data on stree cafes in Manhattan with operating licenses. We use the stree cafe data because the City has not published in readily available form any data regarding restaurants in general.  In so doing, we implicitly assume that areas in Manhattan that experience rapid development of restaurants of the type envisioned in this project also experience rapid development of sidewalk cafes (usually if not always as an integral part of the restaurant).


In [1]:
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize

In [6]:
df=pd.read_csv('/users/richardkornblith/NYCHR/Data_for_NYCHR/mansc_lic_csv.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1633 entries, 0 to 1632
Data columns (total 18 columns):
License Type                     1633 non-null object
License Expiration Date          1633 non-null object
License Status                   1633 non-null object
License Creation Date            1633 non-null object
Industry                         1633 non-null object
Business Name                    1633 non-null object
Address Building                 1633 non-null object
Address Street Name              1633 non-null object
Secondary Address Street Name    16 non-null object
Address City                     1633 non-null object
Address ZIP                      1633 non-null int64
Address Borough                  1633 non-null object
Community Board                  1609 non-null float64
Council District                 1609 non-null float64
Census Tract                     1586 non-null float64
Longitude                        1626 non-null float64
Latitude                    

In [7]:
# We clean up the building numbers to facilitate using the USCB Geocode API
df.iloc[1599,6] = '54'
df.iloc[1631,6] = '83'
df.iloc[1632,6] = '176'


#### We need to obtain any missing census tracts in df.  For this, we will use the geocoding API provided by the USCB.  For this, we first isolate the instances in 'df' having tracts to be found into a new dataframe, 'tract_tbf'.  We preserve the original index numbers to facilitate finalization.

In [13]:
tract_tbf = df[df['Census Tract'].isnull()]
tract_tbf.reset_index(drop=False, inplace=True)
tract_tbf.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 19 columns):
index                            47 non-null int64
License Type                     47 non-null object
License Expiration Date          47 non-null object
License Status                   47 non-null object
License Creation Date            47 non-null object
Industry                         47 non-null object
Business Name                    47 non-null object
Address Building                 47 non-null object
Address Street Name              47 non-null object
Secondary Address Street Name    1 non-null object
Address City                     47 non-null object
Address ZIP                      47 non-null int64
Address Borough                  47 non-null object
Community Board                  46 non-null float64
Council District                 46 non-null float64
Census Tract                     0 non-null float64
Longitude                        47 non-null float64
Latitude     


#### To find the missing census tracts, we use the USCB API for geocoding geographicals.  Documentation may be found at 'https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.pdf'.  For convenience, we will isolate the three columns of tract_tbf needed for this process.  We then iterate through slimmed_tbf to obtain the missing tracts and insert them into a cleansed dataframe 'manhattan_restaurants'.  


In [9]:
slimmed_tbf = tract_tbf[['index', 'Address Building', 'Address Street Name']]
slimmed_tbf.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 3 columns):
index                  47 non-null int64
Address Building       47 non-null object
Address Street Name    47 non-null object
dtypes: int64(1), object(2)
memory usage: 1.2+ KB


In [10]:
# This iteration takes some time: be patient!

manhattan_restaurants = df
for i in range(len(slimmed_tbf)):
    index = slimmed_tbf.loc[i,'index']
    Street = slimmed_tbf.loc[i,'Address Building']+' '+slimmed_tbf.loc[i,'Address Street Name']
    City = 'New York'
    State = 'NY'
    url="https://geocoding.geo.census.gov/geocoder/geographies/address?street={}\
        &city={}&state={}&benchmark=Public_AR_Census2010&vintage=Census2010_Census2010&\
        layers=14&format=json".format(Street, City, State)
    tract = requests.get(url).json()
    geog = tract['result']['addressMatches'][0]['geographies']['Census Blocks']
    census_tr = json_normalize(geog)
    tract_found = census_tr['TRACT']

In [12]:
print(manhattan_restaurants.info())
# manhattan_restaurants.tail(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1633 entries, 0 to 1632
Data columns (total 18 columns):
License Type                     1633 non-null object
License Expiration Date          1633 non-null object
License Status                   1633 non-null object
License Creation Date            1633 non-null object
Industry                         1633 non-null object
Business Name                    1633 non-null object
Address Building                 1633 non-null object
Address Street Name              1633 non-null object
Secondary Address Street Name    16 non-null object
Address City                     1633 non-null object
Address ZIP                      1633 non-null int64
Address Borough                  1633 non-null object
Community Board                  1609 non-null float64
Council District                 1609 non-null float64
Census Tract                     1586 non-null float64
Longitude                        1626 non-null float64
Latitude                    