## Locate and Optimize Food Service Delivery Area
### Data Science Week 2 Capstone Assignment

## Table of Contents
- Introduction
    - [Business Problem](#bus_prob)
    - [Data Requirements and Description](#data_req)
- Methodology
    - [Data Collection Approach](#data_col)
    - [Data Correction](#data_col)
    - [Data Preparation](#data_cor)
    - [Data Understanding](#data_col)
    - [Modeling and Evaluation](#modeling)
- [Results](#results)
- [Discussion](#discussion)
- [Conclusion](#conclusion)

<a id='bus_prob'></a>
### Business Problem
Considering multiple factors such as time of day, traffic patterns, clustering of food providers, and population density help optimize the delivery area by providing maximally efficient routes and service coverage areas, to reduce driver costs and time and shorten customer delivery times.  
The audience for this data report is investors, existing or planned businesses, and logistics providers.
  
#### Data Science Questions
1. What are the key features in the datasets available from foursquare and NY City neighborhood data that would determine an optimal coverage area for a food delivery service?
2. How are the features that were identified used to optimize the service area?

<a id='data_req'></a>
### Data Requirements and Description
#### The following datasets are available:
#### NYC Neighborhood Data: 
This dataset contains data about 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood is needed.  
The NY dataset exists for free on the web and will be downloaded from this link: https://geo.nyu.edu/catalog/nyu_2451_34572
#### Foursquare Data: 
Location based venue information.  This data will be provided by the free Foursquare API.  Combined with the NYC Data, various regression and clustering analytics will be used to drive the analysis and tune models.

<a id='data_col'></a>
## Methodology
### Data Collection Approach

In [302]:
# Initialize the environment
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Set up the Foursquare credentials for API Collection
CLIENT_ID = 'KG4G5VV5QWE0LGO112E5YD4VR4FFXPTCQCXZANEYK5QY5BDO' # your Foursquare ID
CLIENT_SECRET = 'NW22DVEESJEY1CSQ2G4TNDQS4PD515WA2HKXRU201UQZI4C1' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Foursquare credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('Libraries imported.')

Foursquare credentails:
CLIENT_ID: KG4G5VV5QWE0LGO112E5YD4VR4FFXPTCQCXZANEYK5QY5BDO
CLIENT_SECRET:NW22DVEESJEY1CSQ2G4TNDQS4PD515WA2HKXRU201UQZI4C1
Libraries imported.


### Load the NY data into a DataFrame from the source URL

In [303]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [304]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

<a id='data_cor'></a>
### Data Correction

In [305]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
# Load the dataframe
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [306]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


In [307]:
# Prime the Foursquare collection API
address = '102 North End Ave, New York, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

search_query = 'Italian'
radius = 500
print(search_query + ' .... OK!')

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
print(url)
# Foursquare API Call
results = requests.get(url).json()

40.7149555 -74.0153365
Italian .... OK!
https://api.foursquare.com/v2/venues/search?client_id=KG4G5VV5QWE0LGO112E5YD4VR4FFXPTCQCXZANEYK5QY5BDO&client_secret=NW22DVEESJEY1CSQ2G4TNDQS4PD515WA2HKXRU201UQZI4C1&ll=40.7149555,-74.0153365&v=20180604&query=Italian&radius=500&limit=30


In [308]:
# !pip install pyproj
def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = results
    return venues

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

In [309]:
address = '102 North End Ave, New York, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
ny_center=[latitude,longitude]
print(latitude, longitude)

40.7149555 -74.0153365


In [310]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=ny_center, zoom_start=10)

In [311]:
# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

<a id='modeling'></a>
## Modeling and Evaluation

In [312]:
# Check the dataframe
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [313]:
# Data Validation
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


In [314]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

venues = pd.DataFrame()

# Load the dataframe
radius=500
limit=100
numNB=15
for i in range(0,numNB):
    proximity_results=get_venues_near_location(neighborhoods.iloc[i,2],
        neighborhoods.iloc[i,3], food_category, CLIENT_ID, CLIENT_SECRET, radius, limit
    )
    for resturaunt in proximity_results:
        venues = venues.append({'neighborhood': neighborhoods.iloc[i,1],
                          'neighborhood_latitude': neighborhoods.iloc[i,2],
                          'neighborhood_longitude': neighborhoods.iloc[i,3], 
                          'venue_latitude': resturaunt['venue']['location']['lat'],
                          'venue_longitude': resturaunt['venue']['location']['lng'], 
                          'venue_name': resturaunt['venue']['name'],
                          'category': resturaunt['venue']['categories'][0]['shortName'],
                          'distance': resturaunt['venue']['location']['distance']},
                           ignore_index=True
                          )

In [315]:
topVenueCount=venues['neighborhood'].value_counts()
venues.head()

Unnamed: 0,category,distance,neighborhood,neighborhood_latitude,neighborhood_longitude,venue_latitude,venue_longitude,venue_name
0,Caribbean,479.0,Wakefield,40.894705,-73.847201,40.898276,-73.850381,Cooler Runnings Jamaican Restaurant Inc
1,Donuts,498.0,Wakefield,40.894705,-73.847201,40.890459,-73.849089,Dunkin'
2,Sandwiches,480.0,Wakefield,40.894705,-73.847201,40.890656,-73.849192,SUBWAY
3,Food,136.0,Wakefield,40.894705,-73.847201,40.894149,-73.845748,Pitman Deli
4,Food Truck,428.0,Wakefield,40.894705,-73.847201,40.892293,-73.84323,Baychester Avenue Food Truck


In [316]:
# df2.groupby(['F'])['D'].agg('sum')
catGroups=venues.groupby(['category']).agg('count').head()
catGroups

Unnamed: 0_level_0,distance,neighborhood,neighborhood_latitude,neighborhood_longitude,venue_latitude,venue_longitude,venue_name
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
African,1,1,1,1,1,1,1
American,7,7,7,7,7,7,7
BBQ,1,1,1,1,1,1,1
Bagels,2,2,2,2,2,2,2
Bakery,12,12,12,12,12,12,12


In [317]:
print('Total number of restaurants:', len(venues))
# print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 262


In [318]:
# Map the resturants in the sample, limit sample size

map_ny_res = folium.Map(location=[40.870185,-73.885512], zoom_start=13)
folium.Marker(ny_center, popup='NY').add_to(map_ny_res)
currentNeighborhood=''

for index,row in venues.sort_values(by='neighborhood').iterrows():
    lat = row[5]; lon = row[6]
    distance = row[1]
    color = 'red' if distance>400 else 'blue'

    if currentNeighborhood != row[2]:
#         display(row[2])
        currentNeighborhood = row[2]
        neighCenter=[row[3],row[4]]
        folium.Marker(neighCenter, popup=currentNeighborhood).add_to(map_ny_res)
        
    folium.CircleMarker([lat, lon], 
                        radius=3, 
                        color=color, 
                        fill=True, 
                        fill_color=color, 
                        fill_opacity=1).add_to(map_ny_res)
map_ny_res


In [319]:
maxDistance = 400

good_res_countdf = dist[dist['neighborhood'].isin(topVenueCount.keys().to_list()[:3])]
good_res_count = np.array(good_res_countdf)
# print('Locations with no more than five restaurants nearby:', good_res_count.sum())

good_ita_distance=good_res_countdf[good_res_countdf ['distance'] > maxDistance]

# print('Top 3 neighboorhoods greater than 400m:', good_ita_distance.sum())
df_good_locations = np.logical_and(good_res_count, good_ita_distance)

restaurant_latlons = [[row['venue_latitude'],row['venue_longitude']] for index,row in df_good_locations.iterrows()]
# df_good_locations
# print('Locations with both conditions met:', good_locations.sum())

# df_good_locations = df_roi_locations[good_locations]


## Overlay a heatmap of the top neighborhoods to show order of magnitude by distance and venue quantity

In [320]:
from folium import plugins
from folium.plugins import HeatMap

good_latitudes = df_good_locations['venue_latitude'].values
good_longitudes = df_good_locations['venue_longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]
roi_center=[40.870185,-73.885512]
map_ny = folium.Map(location=[40.870185,-73.885512], zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_ny)
HeatMap(restaurant_latlons).add_to(map_ny)
folium.Marker(roi_center).add_to(map_ny)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_ny)
currentNeighborhood=""

for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_ny) 
# folium.GeoJson(berlin_boroughs, style_function=boroughs_style, name='geojson').add_to(map_berlin)
map_ny

<a id='results'></a>
## Results
Now there is a clear indication of zones with low number of restaurants in vicinity of the top neighborhoods with venues.

Now cluster those locations to create centers of zones containing good locations. Those zones, their centers and addresses will be the final result of the analysis. 

In [321]:
from sklearn.cluster import KMeans
import pyproj
import math

number_of_clusters = 15

good_xys = df_good_locations[['venue_latitude', 'venue_longitude']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_ny = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_ny)
HeatMap(restaurant_latlons).add_to(map_ny)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_ny)
folium.Marker(roi_center).add_to(map_ny)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_ny) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_ny)
# folium.GeoJson(berlin_boroughs, style_function=boroughs_style, name='geojson').add_to(map_berlin)
map_ny

<a id='discussion'></a>
## Discussion
Clusters represent groupings of most of the candidate locations and cluster centers are placed nicely in the middle of the zones with top location candidates.

Addresses of those cluster centers will be a good starting point for exploring the neighborhoods to find the best possible location based on neighborhood specifics.

Now without heatmap, using shaded areas to indicate clusters:

In [322]:
map_ny = folium.Map(location=roi_center, zoom_start=14)
folium.Marker(roi_center).add_to(map_ny)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_ny)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_ny)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=False).add_to(map_ny) 
# folium.GeoJson(berlin_boroughs, style_function=boroughs_style, name='geojson').add_to(map_berlin)
map_ny

<a id='conclusion'></a>
## Conclusion
This concludes the analysis. High density venueu neigborhoods with close proximity are indicted by the clustering, and relative relationship between the neighborhoods sampled is shown by the heat map and blue circles.

From this analysis, a broader review can be done city-wide, and provide the groundwork for prioritizing delivery services with maximum ROI potential.