# Capstone Project - Manhattan Food delivery Site Selection 

### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

This is a data-driven site selection analysis for food delivery start-ups in Manhattan to select the best-fit positions where they could provide the fastest service and take less operation risk.

This analysis is based on some assumptions:

1) Customers prefer the high rating merchandises

2) Food delivery order cannot be canceled after the food is picked up from the merchandise

3) The distance between site location and high rating merchandises is the dominated factor of business operation risk.

According to the assumptions we should group the all high rating merchandises and find every point that has the shortest distance to all of the merchandises in the group. Those points will be the best locations for food delivery sites.

## Data <a name="data"></a>

Now we could decide what kind of data we will need:

1) New York city neighborhoods data 

2) The position data of all food, drink and grocery stores in Manhattan area

So we are going to use the Foursquare location data API to get the data we need and will use the centroid model of clustering analysis to find the centroid of each group in the data. The K-means clustering algorithm of this model works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of this algorithm are the centroids of the K clusters, which can be used as our best-fit positions of delivery sites.

### 1.Neighborhood Candidates

Before we get the data and start exploring it, let's download all the dependencies that we will need

In [1]:
!conda install -c conda-forge geopy --yes 

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    certifi-2019.3.9           |           py36_0         149 KB  conda-forge
    conda-4.6.8                |           py36_0         876 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    openssl-1.1.1b             |       h14c3975_1         4.0 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         5.2 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::g

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



In [13]:
!pip install shapely

Collecting shapely
[?25l  Downloading https://files.pythonhosted.org/packages/38/b6/b53f19062afd49bb5abd049aeed36f13bf8d57ef8f3fa07a5203531a0252/Shapely-1.6.4.post2-cp36-cp36m-manylinux1_x86_64.whl (1.5MB)
[K    100% |████████████████████████████████| 1.5MB 14.2MB/s 
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.6.4.post2


In [3]:
!pip install pyproj



In [4]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

from geopy.geocoders import Nominatim 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium 

print('Libraries imported.')

Libraries imported.


We can get the New York City neighborhood dataset from https://geo.nyu.edu/catalog/nyu_2451_34572 or just simply run a wget command and access the data.

In [5]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Then load the data and define a new variable that include all the relevant data.

In [6]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

newyork_neighborhoods = newyork_data['features']

Next, we are going to transfer this data into a pandas dataframe.

In [7]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

nyc_neighborhoods = pd.DataFrame(columns=column_names)

for data in newyork_neighborhoods:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    nyc_neighborhoods = nyc_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

nyc_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


We need to make sure this dataframe has all 5 borough and 306 neighborhoods

In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(nyc_neighborhoods['Borough'].unique()),
        nyc_neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


We need to slice the original dataframe and create a new dataframe of the Manhattan data.

In [9]:
manhattan_data = nyc_neighborhoods[nyc_neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
print(manhattan_data.shape)
manhattan_data.head()

(40, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


Let's get the geographical coordinates of Manhattan.

In [10]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7900869, -73.9598295.


Now, let's visualize the Manhattan with the neighborhoods in it.

In [11]:
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [14]:
import shapely.geometry

import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

print('Coordinate transformation check')
print('-------------------------------')
print('Manhattan longitude={}, latitude={}'.format(longitude, latitude))
x, y = lonlat_to_xy(longitude, latitude)
print('Manhattan UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Manhatan longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Manhattan longitude=-73.9598295, latitude=40.7900869
Manhattan UTM X=-5809016.115084005, Y=9864005.670041664
Manhatan longitude=-73.95982949999961, latitude=40.7900868999989


Let's now use Google Maps API to get approximate addresses of those locations.

In [15]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

google_api_key='AIzaSyDEl9SaBTVXEk8eUEHTfgOjCy79j5w-3S0'
addr = get_address(google_api_key, latitude, longitude)
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(latitude, longitude, addr))

Reverse geocoding check
-----------------------
Address of [40.7900869, -73.9598295] is: 97th St Transverse, New York, NY 10029, USA


### 2.Foursqure

Now that we have our neighborhoods candidates, let's use Foursquare API to get info on food/Drink/Grocery stores in each neighborhood in order to explore the stores and segment them.

In [16]:
CLIENT_ID = 'T0H4ONLZG50FLK3TLPDNZCRVB15FTWGGSSQYJPREEDRJBTVC' 
CLIENT_SECRET = 'RWP5XPIU5DJBSO5DTRCJPBNZIRO0QFF5AVXOIXEFVS2PKCU4' 
VERSION = '20190305'
limit=100


food_category = '4d4b7105d754a06374d81259'


def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            food_category,
            radius, 
            limit)
            
       
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [17]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [18]:
print(manhattan_venues.shape)
manhattan_venues.head(20)

(2851, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
2,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant
3,Marble Hill,40.876551,-73.91066,Dunkin' Donuts,40.876993,-73.906507,Donut Shop
4,Marble Hill,40.876551,-73.91066,Parrilla Latina,40.877473,-73.906073,Steakhouse
5,Marble Hill,40.876551,-73.91066,Boston Market,40.87743,-73.905412,American Restaurant
6,Marble Hill,40.876551,-73.91066,Subway Sandwiches,40.874667,-73.909586,Sandwich Place
7,Marble Hill,40.876551,-73.91066,SUBWAY,40.878493,-73.905385,Sandwich Place
8,Marble Hill,40.876551,-73.91066,Subway,40.87772,-73.90538,Sandwich Place
9,Marble Hill,40.876551,-73.91066,Hernandez Grocery,40.875897,-73.912591,Deli / Bodega


We are gonna find out how many unique categories can be curated from all the returned venues

In [19]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 123 uniques categories.


In [20]:
df_venues=pd.DataFrame(manhattan_venues, columns=['Venue', 'Venue Longitude', 'Venue Latitude'])
df_venues.rename(columns={'Venue Longitude':'Longitude','Venue Latitude':'Latitude'}, inplace=True)
print(df_venues.shape)
df_venues.head(20)

(2851, 3)


Unnamed: 0,Venue,Longitude,Latitude
0,Arturo's,-73.910271,40.874412
1,Tibbett Diner,-73.908937,40.880404
2,Land & Sea Restaurant,-73.905873,40.877885
3,Dunkin' Donuts,-73.906507,40.876993
4,Parrilla Latina,-73.906073,40.877473
5,Boston Market,-73.905412,40.87743
6,Subway Sandwiches,-73.909586,40.874667
7,SUBWAY,-73.905385,40.878493
8,Subway,-73.90538,40.87772
9,Hernandez Grocery,-73.912591,40.875897


Let's take a looke of these stores on map.

In [46]:
venues=np.array(df_venues[['Longitude','Latitude']])

map_venues= folium.Map(location=[40.7900869, -73.9598295], zoom_start=12)
for lon, lat in venues:
    folium.Circle([lat, lon], radius=50, color='green', fill=False).add_to(map_venues)
map_venues

Next we are going to use the venues' address data to get the onehot dataframe.

In [36]:
xy = df_venues.apply(lambda x:lonlat_to_xy(x['Longitude'],x['Latitude']), axis =1)
df=pd.DataFrame(xy, columns=['t'])
df_xy=pd.DataFrame(df['t'].tolist(), index=df.index)

df_onehot=pd.concat([df_venues, df_xy], axis=1)
df_onehot.columns=['Venue', 'Longitude', 'Latitude', 'X', 'Y']
df_onehot.head(20)

Unnamed: 0,Venue,Longitude,Latitude,X,Y
0,Arturo's,-73.910271,40.874412,-5794565.0,9858039.0
1,Tibbett Diner,-73.908937,40.880404,-5793547.0,9857897.0
2,Land & Sea Restaurant,-73.905873,40.877885,-5793962.0,9857492.0
3,Dunkin' Donuts,-73.906507,40.876993,-5794115.0,9857569.0
4,Parrilla Latina,-73.906073,40.877473,-5794032.0,9857515.0
5,Boston Market,-73.905412,40.87743,-5794037.0,9857430.0
6,Subway Sandwiches,-73.909586,40.874667,-5794520.0,9857952.0
7,SUBWAY,-73.905385,40.878493,-5793857.0,9857432.0
8,Subway,-73.90538,40.87772,-5793988.0,9857427.0
9,Hernandez Grocery,-73.912591,40.875897,-5794323.0,9858344.0


Now we have all the top-rated food stores in Manhattan area. This concludes the data gathering phase. Now we're ready to use this data for analysis to produce the report on optimal locations for food delivery site!

## Methodology

In this project we will put our efforts into finding the locations in Manhattan that have the nearest distance between them and the top-rated food stores. 

At the first we have collected the required data: location and category of every top-rated food store in Manhattan. 

Then We will create clusters (using k-means clustering) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location.

## Analysis

Next we can use K-means clustering analysis to segment the stores. As we know Manhattan area has uptown, midtown and downtown three parts, so we are going to set our k as 3.

In [23]:

from sklearn.cluster import KMeans


number_of_clusters = 3

good_xys=np.array(df_onehot[['X','Y']])
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]


print(cluster_centers)


[(-73.99777276603686, 40.7253024238906), (-73.94447316867348, 40.82876801991279), (-73.97163749562283, 40.76367885294065)]


In [31]:
df_centers=pd.DataFrame(cluster_centers, columns=['Longitude', 'Latitude'])

df_finals=pd.DataFrame(df_centers, columns=['Latitude', 'Longitude'])
df_finals.head()

Unnamed: 0,Latitude,Longitude
0,40.725302,-73.997773
1,40.828768,-73.944473
2,40.763679,-73.971637


This is the best-fit locations for food delivery sites and we will find their address and show them on the map.

In [49]:
print('==================================================================')
print('Addresses of positions of area recommended for food delivery sites')
print('==================================================================\n')
for lon, lat in cluster_centers:
    addr2 = get_address(google_api_key, lat, lon)   
    print('Address of [{}, {}] is: {}'.format(lat, lon, addr2))

Addresses of positions of area recommended for food delivery sites

Address of [40.7253024238906, -73.99777276603686] is: 166 Mercer St, New York, NY 10012, USA
Address of [40.82876801991279, -73.94447316867348] is: 1833 Amsterdam Ave, New York, NY 10031, USA
Address of [40.76367885294065, -73.97163749562283] is: 59th & Madison, New York, NY 10022, USA


In [47]:
map_finals= folium.Map(location=[40.7900869, -73.9598295], zoom_start=12)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=200, color='red', fill=True).add_to(map_finals)
map_finals

## Results and Discussion

Our analysis tries to cluster more than 2800 top-rated food stores in Manhattan and find the centers of these clusters. First of all we get all address data and handle them into onehot dataframe. Then we use K-means clustering algorithm to cluster the stores. According to the three areas of Manhattan: Uptown, midtown and downtown, we set the K as 3 and get the three centers of clusters. These centers will be the recommended best-fit positions for food delivery sites.

## Conclusion 

This project is based on some business assumptions and focus on finding best-fit positions of delivery sites for start-ups. Final decision on optimal locations will be made by our start-ups customers based on specific characteristics of the locations in every recommended zone, taking into consideration additional factors like cost of each location, proximity to major roads, real estate availability etc.