# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Introduction: Business Problem</a>

2. <a href="#item2">Data</a>

3. <a href="#item3">Methodology</a>

4. <a href="#item4">Analysis</a>

5. <a href="#item5">Results and Discussions</a> 

6. <a href="#item6">Conclusion</a> 
</font>
</div>

<a id='item1'></a>

### 1. Introduction

#### 1.1. Background

New York City (NYC) is the most populous city in the US. With an estimated 2018 population of 8,398,748 distributed over about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States. [\[1\]](https://en.wikipedia.org/wiki/New_York_City) Moreover, each year, more than 60 million visitors from all over the world come to New York, generating billions of dollars in over-all economic impacts. Thus, its food culture is as diverse as its immigrant history. As of 2019, there were 27,043 restaurants in the city, up from 24,865 in 2017.[\[1\]](https://en.wikipedia.org/wiki/New_York_City) 

#### 1.2. Problem Description

As the figure above indicates, restaurant business is highly profitable but also competitive. Starting a successful new restaurant in NYC requires throughout study and smart strategies. This project intends to develop a method using data analysis and machien learning to answer one of the questions for NYC's restaurant entrepreneurs, that is what is the preferred location/neighborhood to start a new restaurant using Foursquare location data. 

To simplify the problem, I've picked pizza restaurant in specific for my study. Other restuarnt types could follow the same methods to.

<a id='item2'></a>

### 2. Data

Based on description of our problem, factors that will influence our decission include:

number of existing pizza restaurant in the neighborhood
number of existing direct competitor restaurant in the neighborhood
number of existing indirect competitor restaurant in the neighborhood
number of existing other groups of venues such as, sports venues, nightlife venues and etc.

Following data sources will be needed to extract/generate the required information:

list of neighborhoods names in NYC compiled by [NYU](https://geo.nyu.edu/catalog/nyu_2451_34572)

number of various venues and their type and location in every neighborhood will be obtained using Foursquare API


#### 2.2. Data Preparation

The data used for this study are listed below:
1. 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!pip install geopy 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium
import folium # map rendering library

import urllib

from bs4 import BeautifulSoup

import requests

from selenium import webdriver

from lxml import html

import os
print('Libraries imported.')

Libraries imported.


#### webscrape data from nyu website

In [19]:
url = 'https://geo.nyu.edu/catalog/nyu_2451_34572'

cwd = os.getcwd()
print(cwd)

preferences = {'download.default_directory': cwd}
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_experimental_option('prefs',preferences)

browser = webdriver.Chrome(executable_path='/Users/mr.x/Desktop/chromedriver', options = options)

browser.get(url)

/Users/mr.x/Dropbox/study/coursera/IBM_DS/Capstone/Coursera_Capstone/ny_restaurant


In [20]:
innerHTML = browser.page_source #returns the inner HTML as a string

soup = BeautifulSoup(innerHTML,'lxml')

soup.find('a',{'data-download-path':'/download/nyu-2451-34572?type=geojson'})

<a class="btn btn-primary btn-block download download-generated" data-download="trigger" data-download-id="nyu-2451-34572" data-download-path="/download/nyu-2451-34572?type=geojson" data-download-type="geojson" href="">Export</a>

In [21]:
browser.find_element_by_xpath('//*[@id="sidebar"]/div[3]/ul/li[3]/div[2]/a').click()
browser.implicitly_wait(10)
browser.find_element_by_xpath('//*[@id="main-flashes"]/div/div/a').click()
browser.quit()

In [22]:
with open('nyu-2451-34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)

In [23]:
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [84]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# add columns and rows from neighborhoods_data
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

# take a look at the shape and the head of the dataset
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)
neighborhoods.head()

The dataframe has 5 boroughs and 306 neighborhoods.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [85]:
# we find there are neighborhoods with same name but differnt borough
dup_ngb =neighborhoods[neighborhoods.Neighborhood.duplicated()].Neighborhood.tolist()
dup_ngb

['Murray Hill', 'Sunnyside', 'Bay Terrace', 'Chelsea']

In [86]:
#find the geographical coordinates of New York City using Nominatim
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Folium Map

In [87]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

#### Foursquare

In [89]:
#Instantiate FourSquare client id, secret and version variables
CLIENT_ID = 'MXTS4YTUGMLVSFVJUEIOFDHRC3D2ZY1ERAHTJMIK2T2UDZZF' # your Foursquare ID
CLIENT_SECRET = '4Z3PAVR3Y4OJO5FMTFGRKZCK11JTICAFV5HBYMATVRFHWHO3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MXTS4YTUGMLVSFVJUEIOFDHRC3D2ZY1ERAHTJMIK2T2UDZZF
CLIENT_SECRET:4Z3PAVR3Y4OJO5FMTFGRKZCK11JTICAFV5HBYMATVRFHWHO3


In [90]:
# define a function to retrieve venue information using FourSquare API
def getNearbyVenues(names, latitudes, longitudes, radius=500,LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name'],
                v['venue']['categories'][0]['id']) for v in results])

    #list comprehension instead of two for-loops
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue ID']
    
    return(nearby_venues)

In [91]:
# retrieve venue information for all neighborhoods
ny_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )
print('Finished')

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [99]:
# take a look at the shape and head of the dataframe
print(ny_venues.shape)
print(len(ny_venues.Neighborhood.unique()))
ny_venues.head()


(9823, 8)
300


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,4bf58dd8d48988d1d0941735
1,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,4bf58dd8d48988d1c9941735
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy,4bf58dd8d48988d10f951735
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,4bf58dd8d48988d10f951735
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,4bf58dd8d48988d148941735


<a id='item3'></a>

### 3. Methodology

In this project we will direct our efforts on identify NYC neighborhoods with more opportunities, better transportation and lower competition for pizza restuarants. We can achieve this using the venue category and venue id information obstained through FourSquare API. 

In first step we have collected the required **data: location and type (category) of various venues in each neighorhood**. We have also **noticed that there are neighborhoods with same name but locating in different borough**.

Second step in our analysis we should first handle the 'duplicate' neighborhoods names to differentiate using their borough name. Then we will explore, find and count the '**total competition (both direct and indirect**' across different neighborhoods. Then, we also need to explore, find and count the '**total opportunities and total transportation venues**'. In specific, we find venues related to another **7** groups, which are **nightlife, residence, hotel, sports, school, theaterMuseum and transportation**. 

In third and final step we will focus on most promising neighborhoods. We will use **k-means clustering** to group the neighborhoods into **3 clusters** based on the 9 venue groups we have put together. Then we will use **Folium Map** to show the clusters on map.

<a id='item4'></a>

### 4. Analysis

#### Handling 'duplicate' neighborhoods in different borough

In [100]:
df_dup_ngb = neighborhoods[neighborhoods.Neighborhood.isin(dup_ngb)]
coord_list = list(zip(df_dup_ngb.Latitude,df_dup_ngb.Longitude))
df_dup_ngb= df_dup_ngb.assign(**{'Neighborhood Coordinates':coord_list})
df_dup_ngb['Neighborhood'] = df_dup_ngb['Neighborhood']+'_'+df_dup_ngb['Borough']
df_dup_ngb=df_dup_ngb.drop(['Borough','Latitude','Longitude'],axis=1)
df_dup_ngb

Unnamed: 0,Neighborhood,Neighborhood Coordinates
115,Murray Hill_Manhattan,"(40.748303077252174, -73.97833207924127)"
116,Chelsea_Manhattan,"(40.744034706747975, -74.00311633472813)"
140,Sunnyside_Queens,"(40.74017628351924, -73.92691617561577)"
175,Bay Terrace_Queens,"(40.782842806245554, -73.7768022262158)"
180,Murray Hill_Queens,"(40.764126122614066, -73.81276269135866)"
220,Sunnyside_Staten Island,"(40.61276015756489, -74.0971255217853)"
235,Bay Terrace_Staten Island,"(40.55398800858462, -74.13916622175768)"
244,Chelsea_Staten Island,"(40.59472602746295, -74.1895604551969)"


In [101]:
ny_venues['Neighborhood Coordinates'] = list(zip(ny_venues['Neighborhood Latitude'],ny_venues['Neighborhood Longitude']))
ny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Neighborhood Coordinates
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,4bf58dd8d48988d1d0941735,"(40.89470517661, -73.84720052054902)"
1,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,4bf58dd8d48988d1c9941735,"(40.89470517661, -73.84720052054902)"
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy,4bf58dd8d48988d10f951735,"(40.89470517661, -73.84720052054902)"
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,4bf58dd8d48988d10f951735,"(40.89470517661, -73.84720052054902)"
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,4bf58dd8d48988d148941735,"(40.89470517661, -73.84720052054902)"


In [102]:
df_merge = ny_venues.merge(df_dup_ngb,how='left',on='Neighborhood Coordinates')

df_merge['Neighborhood_y'].replace(to_replace=np.nan, value = df_merge['Neighborhood_x'],inplace=True)

df_merge.drop(['Neighborhood_x','Neighborhood Coordinates'],axis=1,inplace=True)

df_merge.rename({'Neighborhood_y':'Neighborhood'},axis=1,inplace=True)
 
print(len(df_merge.Neighborhood.unique()))
#confirm we have successfully rename duplicate neighborhoods
df_merge[df_merge.Neighborhood=='Murray Hill_Queens']

304


Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Neighborhood
7033,40.764126,-73.812763,Hahm Ji Bach - 함지박,40.763022,-73.815042,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7034,40.764126,-73.812763,Coffee Factory,40.763125,-73.814341,Coffee Shop,4bf58dd8d48988d1e0931735,Murray Hill_Queens
7035,40.764126,-73.812763,Mapo BBQ,40.762309,-73.81488,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7036,40.764126,-73.812763,Kum Sung Chik Naengmyun,40.763122,-73.815091,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7037,40.764126,-73.812763,Geo Si Gi Restaurant,40.764865,-73.811983,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7038,40.764126,-73.812763,Northern Sushi,40.764717,-73.811235,Japanese Restaurant,4bf58dd8d48988d111941735,Murray Hill_Queens
7039,40.764126,-73.812763,NY Puppy Club,40.765407,-73.817102,Pet Service,5032897c91d4c4b30a586d69,Murray Hill_Queens
7040,40.764126,-73.812763,Mad For Chicken,40.763426,-73.807724,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7041,40.764126,-73.812763,SGD Tofu House,40.762125,-73.815532,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens
7042,40.764126,-73.812763,Mr. Tofu,40.764841,-73.812266,Korean Restaurant,4bf58dd8d48988d113941735,Murray Hill_Queens


In [104]:
#assgin new values to ny_venues dataframe
ny_venues = df_merge

ny_venues.head()

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Neighborhood
0,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,4bf58dd8d48988d1d0941735,Wakefield
1,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,4bf58dd8d48988d1c9941735,Wakefield
2,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy,4bf58dd8d48988d10f951735,Wakefield
3,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,4bf58dd8d48988d10f951735,Wakefield
4,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,4bf58dd8d48988d148941735,Wakefield


#### Explore and group venues into direct and indirect competitors

In [105]:
# take a look at all the venue category
ny_venues['Venue Category'].unique()

array(['Dessert Shop', 'Ice Cream Shop', 'Pharmacy', 'Donut Shop',
       'Gas Station', 'Sandwich Place', 'Deli / Bodega', 'Laundromat',
       'Pizza Place', 'Discount Store', 'Mattress Store', 'Bagel Shop',
       'Grocery Store', 'Fast Food Restaurant', 'Restaurant',
       'Bus Station', 'Chinese Restaurant', 'Gift Shop',
       'Basketball Court', 'Park', 'Baseball Field',
       'Caribbean Restaurant', 'Diner', 'Seafood Restaurant',
       'Bowling Alley', 'Bus Stop', 'Food & Drink Shop', 'Platform',
       'Metro Station', 'Convenience Store', 'Juice Bar', 'Intersection',
       'Plaza', 'River', 'Bank', 'Food Truck', 'Home Service', 'Gym',
       'Playground', 'Gourmet Shop', 'Latin American Restaurant',
       'Burger Joint', 'Pub', 'Beer Bar', 'Warehouse Store',
       'Spanish Restaurant', 'Coffee Shop', 'Wings Joint',
       'Mexican Restaurant', 'Bar', 'Bakery', 'Trail', 'Supermarket',
       'Candy Store', 'Rental Car Location', 'Thrift / Vintage Store',
       'Breakfas

We need to consider all the potential venues tha could affect the pizza business. First let's consider the competitors of a pizza restaurant. There are two kinds of competitors, direct, indirect and other pizza places. Direct competitors are sustitutes of pizza restaurant such as fast food resturants which share similar characteristics like taste, service or price. Indirect competitors are restaurants whose food or service is differentiated from the pizza restaurants.

In [106]:
#define a fucntion to find a list of venue categories with specific word in its category name
def findCategoryName(word):
    res = ny_venues['Venue Category'].where(ny_venues['Venue Category'].str.contains(word)).unique().tolist()
    res.pop(0)
    return res

In [108]:
#find all restaurants with 'Restaurant' in the category name
all_res = findCategoryName('Restaurant')
all_res

['Fast Food Restaurant',
 'Restaurant',
 'Chinese Restaurant',
 'Caribbean Restaurant',
 'Seafood Restaurant',
 'Latin American Restaurant',
 'Spanish Restaurant',
 'Mexican Restaurant',
 'American Restaurant',
 'Italian Restaurant',
 'Indian Restaurant',
 'Sushi Restaurant',
 'Thai Restaurant',
 'French Restaurant',
 'African Restaurant',
 'Greek Restaurant',
 'Paella Restaurant',
 'Asian Restaurant',
 'Peruvian Restaurant',
 'South American Restaurant',
 'South Indian Restaurant',
 'Middle Eastern Restaurant',
 'Arepa Restaurant',
 'Eastern European Restaurant',
 'Japanese Restaurant',
 'Southern / Soul Food Restaurant',
 'Comfort Food Restaurant',
 'Caucasian Restaurant',
 'Dim Sum Restaurant',
 'New American Restaurant',
 'Vietnamese Restaurant',
 'Mediterranean Restaurant',
 'Shabu-Shabu Restaurant',
 'Hotpot Restaurant',
 'Russian Restaurant',
 'Polish Restaurant',
 'Korean Restaurant',
 'Turkish Restaurant',
 'Cajun / Creole Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Ramen

In [109]:
#find all food stores with 'Joint' in the category name
all_joint = findCategoryName('Joint')

all_joint

['Burger Joint',
 'Wings Joint',
 'Fried Chicken Joint',
 'BBQ Joint',
 'Hot Dog Joint',
 'Mac & Cheese Joint']

In [110]:
#find all food stores with 'Place' in the category name
all_place = findCategoryName('Place')
all_place.remove('Pizza Place')
all_place.pop(-1)
all_place

['Sandwich Place',
 'Soup Place',
 'Taco Place',
 'Snack Place',
 'Salad Place',
 'Burrito Place',
 'Poke Place']

In [111]:
#list of direct competitors
dir_comp = ['Diner']
dir_comp.extend(all_joint)
dir_comp.append(all_res[0])
dir_comp

['Diner',
 'Burger Joint',
 'Wings Joint',
 'Fried Chicken Joint',
 'BBQ Joint',
 'Hot Dog Joint',
 'Mac & Cheese Joint',
 'Fast Food Restaurant']

In [112]:
#list of indirect comepetitors
indir_comp = []
indir_comp.extend(all_res[1:])
indir_comp.extend(all_place)
indir_comp

['Restaurant',
 'Chinese Restaurant',
 'Caribbean Restaurant',
 'Seafood Restaurant',
 'Latin American Restaurant',
 'Spanish Restaurant',
 'Mexican Restaurant',
 'American Restaurant',
 'Italian Restaurant',
 'Indian Restaurant',
 'Sushi Restaurant',
 'Thai Restaurant',
 'French Restaurant',
 'African Restaurant',
 'Greek Restaurant',
 'Paella Restaurant',
 'Asian Restaurant',
 'Peruvian Restaurant',
 'South American Restaurant',
 'South Indian Restaurant',
 'Middle Eastern Restaurant',
 'Arepa Restaurant',
 'Eastern European Restaurant',
 'Japanese Restaurant',
 'Southern / Soul Food Restaurant',
 'Comfort Food Restaurant',
 'Caucasian Restaurant',
 'Dim Sum Restaurant',
 'New American Restaurant',
 'Vietnamese Restaurant',
 'Mediterranean Restaurant',
 'Shabu-Shabu Restaurant',
 'Hotpot Restaurant',
 'Russian Restaurant',
 'Polish Restaurant',
 'Korean Restaurant',
 'Turkish Restaurant',
 'Cajun / Creole Restaurant',
 'Vegetarian / Vegan Restaurant',
 'Ramen Restaurant',
 'Tapas Res

In [113]:
#check whether each venue is a Pizza, direct or indirect competitor
ny_venues['DirectCompetitor'] = ny_venues['Venue Category'].isin(dir_comp)
ny_venues['IndirectCompetitor'] = ny_venues['Venue Category'].isin(indir_comp)
ny_venues['Pizza'] = ny_venues['Venue Category'].isin(['Pizza Place'])

In [128]:
nightlife = ['4d4b7105d754a06376d81259','4bf58dd8d48988d116941735','56aa371ce4b08b9a8d57356c','4bf58dd8d48988d117941735',
            '4bf58dd8d48988d11e941735','4bf58dd8d48988d118941735','4bf58dd8d48988d1d8941735','4bf58dd8d48988d119941735',
            '4bf58dd8d48988d1d5941735','4bf58dd8d48988d120941735','4bf58dd8d48988d11b941735','4bf58dd8d48988d11c941735',
            '4bf58dd8d48988d11d941735','4bf58dd8d48988d122941735','4bf58dd8d48988d123941735','50327c8591d4c4b30a586d5d',
            '4bf58dd8d48988d121941735','53e510b7498ebcb1801b55d4','4bf58dd8d48988d11f941735','4bf58dd8d48988d11a941735',
            '4bf58dd8d48988d1d6941735']

residence = ['4e67e38e036454776db1fb3a','5032891291d4c4b30a586d68','4bf58dd8d48988d103941735','4f2a210c4b9023bd5841ed28',
            '4d954b06a243a5684965b473','52f2ab2ebcbc57f1066b8b55']

hotel = ['4bf58dd8d48988d1fa931735','4bf58dd8d48988d1f8931735','4f4530a74b9074f6e4fb0100','4bf58dd8d48988d1ee931735',
        '4bf58dd8d48988d132951735','5bae9231bedf3950379f89cb','4bf58dd8d48988d1fb931735','4bf58dd8d48988d12f951735',
        '56aa371be4b08b9a8d5734e1']

transportation = ['4bf58dd8d48988d1fc931735','4bf58dd8d48988d1fd931735','4f2a23984b9023bd5841ed2c','4e74f6cabd41c4836eac4c31',
                 '56aa371be4b08b9a8d57353e','52f2ab2ebcbc57f1066b8b53','4bf58dd8d48988d1ef941735','53fca564498e1a175f32528b',
                 '4bf58dd8d48988d130951735','4f4530164b9074f6e4fb00ff','4bf58dd8d48988d129951735','4f4531504b9074f6e4fb0102',
                 '4bf58dd8d48988d12a951735','54541b70498ea6ccd0204bff','4f04b25d2fb6e1c99f3db0c0','52f2ab2ebcbc57f1066b8b4f',
                 '4bf58dd8d48988d1fe931735','4bf58dd8d48988d12b951735','4e4c9077bd41f78e849722f9','4d4b7105d754a06379d81259']

school = ['4bf58dd8d48988d198941735','4bf58dd8d48988d199941735','4bf58dd8d48988d1a8941735','4bf58dd8d48988d1a6941735',
          '4bf58dd8d48988d1ae941735','4bf58dd8d48988d13b941735','58daa1558bbb0b01f18ec200','4bf58dd8d48988d13d941735',
         '4f04b10d2fb6e1c99f3db0be']

sports = ['4bf58dd8d48988d184941735','4bf58dd8d48988d18c941735','4bf58dd8d48988d18b941735','4e39a891bd410d7aed40cbc2',
         '4f4528bc4b90abdf24c9de85','4d4b7105d754a06377d81259','4bf58dd8d48988d1e1941735','52e81612bcbc57f1066b7a2b',
         '52e81612bcbc57f1066b7a2f','56aa371be4b08b9a8d57351a','4bf58dd8d48988d175941735','52f2ab2ebcbc57f1066b8b49',
         '52f2ab2ebcbc57f1066b8b47','503289d391d4c4b30a586d6a','4bf58dd8d48988d105941735','4bf58dd8d48988d176941735',
         '4bf58dd8d48988d101941735','4bf58dd8d48988d102941735','52e81612bcbc57f1066b7a2e','4e39a956bd410d7aed40cbc3',
         '4eb1bf013b7b6f98df247e07','52e81612bcbc57f1066b7a2d']

theaterMuseum = ['4bf58dd8d48988d17f941735','4bf58dd8d48988d17e941735','4bf58dd8d48988d181941735','4bf58dd8d48988d18f941735',
                   '4bf58dd8d48988d190941735','4bf58dd8d48988d1ac941735',]


In [131]:
ny_venues['nightlife'] = ny_venues['Venue ID'].isin(nightlife)
ny_venues['residence'] = ny_venues['Venue ID'].isin(residence)
ny_venues['hotel'] = ny_venues['Venue ID'].isin(hotel)
ny_venues['transportation'] = ny_venues['Venue ID'].isin(transportation)
ny_venues['school'] = ny_venues['Venue ID'].isin(school)
ny_venues['sports'] = ny_venues['Venue ID'].isin(sports)
ny_venues['theaterMuseum'] = ny_venues['Venue ID'].isin(theaterMuseum)
ny_venues.head()

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue ID,Neighborhood,DirectCompetitor,IndirectCompetitor,Pizza,nightlife,residence,hotel,transportation,school,sports,theaterMuseum
0,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop,4bf58dd8d48988d1d0941735,Wakefield,False,False,False,False,False,False,False,False,False,False
1,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop,4bf58dd8d48988d1c9941735,Wakefield,False,False,False,False,False,False,False,False,False,False
2,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy,4bf58dd8d48988d10f951735,Wakefield,False,False,False,False,False,False,False,False,False,False
3,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy,4bf58dd8d48988d10f951735,Wakefield,False,False,False,False,False,False,False,False,False,False
4,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop,4bf58dd8d48988d148941735,Wakefield,False,False,False,False,False,False,False,False,False,False


In [121]:
def sumTotal(name):
    ny_direct = ny_venues.groupby(['Neighborhood',name]).count()['Venue Category']
    ny_direct = ny_direct.groupby('Neighborhood').cumsum().reset_index(level=1)
    ny_direct[name] = np.where(ny_direct[name]==0,-1,1)
    df_dir = ny_direct[name]*ny_direct['Venue Category']
    ny_direct= df_dir.groupby('Neighborhood').sum().apply(lambda x: 0 if x<0 else x)
    ny_direct.rename('Num_of_{}'.format(name),inplace= True)
    return ny_direct

In [133]:
ny_direct = sumTotal('DirectCompetitor')
ny_indirect = sumTotal('IndirectCompetitor')
ny_pizza = sumTotal('Pizza')
ny_nightlife = sumTotal('nightlife')
ny_residence = sumTotal('residence')
ny_hotel = sumTotal('hotel')
ny_transportation = sumTotal('transportation')
ny_sports = sumTotal('sports')
ny_school = sumTotal('school')
ny_theaterMuseum = sumTotal('theaterMuseum')
#confirm there are 304 neighborhoods
print('There are {} neighborhoods in ny_indirect dataframe'.format(ny_indirect.shape[0]))
print('There are {} neighborhoods in ny_direct dataframe'.format(ny_direct.shape[0]))
print('There are {} neighborhoods in ny_pizza dataframe'.format(ny_pizza.shape[0]))
print('There are {} neighborhoods in ny_nightlife dataframe'.format(ny_nightlife.shape[0]))
print('There are {} neighborhoods in ny_residence dataframe'.format(ny_residence.shape[0]))
print('There are {} neighborhoods in ny_hotel dataframe'.format(ny_hotel.shape[0]))
print('There are {} neighborhoods in ny_transportation dataframe'.format(ny_transportation.shape[0]))
print('There are {} neighborhoods in ny_sports dataframe'.format(ny_sports.shape[0]))
print('There are {} neighborhoods in ny_school dataframe'.format(ny_school.shape[0]))
print('There are {} neighborhoods in ny_theaterMuseum dataframe'.format(ny_theaterMuseum.shape[0]))

There are 304 neighborhoods in ny_indirect dataframe
There are 304 neighborhoods in ny_direct dataframe
There are 304 neighborhoods in ny_pizza dataframe
There are 304 neighborhoods in ny_nightlife dataframe
There are 304 neighborhoods in ny_residence dataframe
There are 304 neighborhoods in ny_hotel dataframe
There are 304 neighborhoods in ny_transportation dataframe
There are 304 neighborhoods in ny_sports dataframe
There are 304 neighborhoods in ny_school dataframe
There are 304 neighborhoods in ny_theaterMuseum dataframe


In [196]:
df_ny = pd.concat([ny_direct,ny_indirect,ny_pizza,ny_nightlife,ny_residence,ny_hotel,ny_transportation,ny_sports
                  , ny_school,ny_theaterMuseum],axis=1).reset_index()
df_ny.head(10)

Unnamed: 0,Neighborhood,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
0,Allerton,2,3,4,0,0,0,1,1,0,0
1,Annadale,1,2,3,1,0,0,1,0,0,0
2,Arden Heights,0,0,1,0,0,0,1,0,0,0
3,Arlington,0,0,0,0,0,0,2,0,0,0
4,Arrochar,0,6,1,0,0,1,2,2,0,0
5,Arverne,0,3,1,0,0,1,3,0,0,0
6,Astoria,1,40,1,13,0,0,0,5,0,0
7,Astoria Heights,1,1,1,1,0,1,1,0,0,0
8,Auburndale,1,4,0,2,0,0,1,1,0,0
9,Bath Beach,4,18,2,1,0,0,2,0,0,0


#### Cluster neighborhoods into three clusters using k-means

In [197]:
# set number of clusters
kclusters = 3

df_ny_clustering = df_ny.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_ny_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 2, 1, 1, 0], dtype=int32)

In [211]:
#add cluster labels
try:
    df_ny.insert(0, 'Cluster Labels', kmeans.labels_)
    
except ValueError:
    df_ny.drop('Cluster Labels',1,inplace=True)
    df_ny.insert(0, 'Cluster Labels', kmeans.labels_)
    
df_ny.head(10)

Unnamed: 0,Cluster Labels,Neighborhood,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
0,1,Allerton,2,3,4,0,0,0,1,1,0,0
1,1,Annadale,1,2,3,1,0,0,1,0,0,0
2,1,Arden Heights,0,0,1,0,0,0,1,0,0,0
3,1,Arlington,0,0,0,0,0,0,2,0,0,0
4,1,Arrochar,0,6,1,0,0,1,2,2,0,0
5,1,Arverne,0,3,1,0,0,1,3,0,0,0
6,2,Astoria,1,40,1,13,0,0,0,5,0,0
7,1,Astoria Heights,1,1,1,1,0,1,1,0,0,0
8,1,Auburndale,1,4,0,2,0,0,1,1,0,0
9,0,Bath Beach,4,18,2,1,0,0,2,0,0,0


In [200]:
lt = neighborhoods[neighborhoods['Neighborhood'].isin(dup_ngb)].reset_index().iloc[:,0].tolist()
df_ngb = neighborhoods.copy()
for i in lt:
    df_ngb.iloc[i,1] = df_ngb.iloc[i,1]+'_'+df_ngb.iloc[i,0]
df_ngb

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
5,Bronx,Kingsbridge,40.881687,-73.902818
6,Manhattan,Marble Hill,40.876551,-73.91066
7,Bronx,Woodlawn,40.898273,-73.867315
8,Bronx,Norwood,40.877224,-73.879391
9,Bronx,Williamsbridge,40.881039,-73.857446


In [217]:
#merge two dataframe
ny_merged = df_ngb.join(df_ny.set_index('Neighborhood'),how='inner', on='Neighborhood')
print(ny_merged.shape[0])
print(ny_merged['Cluster Labels'].unique())
ny_merged.head()

304
[1 0 2]


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
0,Bronx,Wakefield,40.894705,-73.847201,1,0,1,0,0,0,0,0,0,0,0
1,Bronx,Co-op City,40.874294,-73.829939,1,1,2,1,0,0,0,2,1,0,0
2,Bronx,Eastchester,40.887556,-73.827806,1,3,5,1,0,0,0,6,0,0,0
3,Bronx,Fieldston,40.895437,-73.905643,1,0,0,0,0,0,0,0,0,0,0
4,Bronx,Riverdale,40.890834,-73.912585,1,0,0,0,0,0,0,1,1,0,0


In [218]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

#### Cluster 1

In [222]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[0,1] + list(range(5, ny_merged.shape[1]))]]


Unnamed: 0,Borough,Neighborhood,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
5,Bronx,Kingsbridge,7,13,6,8,0,1,1,0,0,0
8,Bronx,Norwood,2,9,4,0,0,0,1,0,0,0
13,Bronx,Bedford Park,6,9,3,2,0,0,3,1,0,0
16,Bronx,Fordham,7,13,4,0,0,0,1,5,0,0
30,Bronx,Parkchester,1,11,3,0,0,0,0,2,0,0
37,Bronx,Pelham Bay,4,11,0,2,0,0,1,3,0,0
39,Bronx,Edgewater Park,0,8,2,3,0,0,0,0,0,0
47,Brooklyn,Bensonhurst,0,9,2,1,0,0,0,0,0,0
48,Brooklyn,Sunset Park,2,8,3,0,0,0,0,3,0,0
51,Brooklyn,Brighton Beach,3,14,0,1,0,0,0,0,0,0


#### Cluster 2

In [223]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[0,1] + list(range(5, ny_merged.shape[1]))]]



Unnamed: 0,Borough,Neighborhood,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
0,Bronx,Wakefield,0,1,0,0,0,0,0,0,0,0
1,Bronx,Co-op City,1,2,1,0,0,0,2,1,0,0
2,Bronx,Eastchester,3,5,1,0,0,0,6,0,0,0
3,Bronx,Fieldston,0,0,0,0,0,0,0,0,0,0
4,Bronx,Riverdale,0,0,0,0,0,0,1,1,0,0
6,Manhattan,Marble Hill,1,5,1,0,0,0,0,4,0,0
7,Bronx,Woodlawn,0,3,2,2,0,0,2,0,0,0
9,Bronx,Williamsbridge,0,3,0,2,0,0,0,0,0,0
10,Bronx,Baychester,2,3,1,0,0,0,1,0,0,0
11,Bronx,Pelham Parkway,0,8,2,0,0,0,3,0,0,0


#### Cluster 3

In [224]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[0,1] + list(range(5, ny_merged.shape[1]))]]


Unnamed: 0,Borough,Neighborhood,Num_of_DirectCompetitor,Num_of_IndirectCompetitor,Num_of_Pizza,Num_of_nightlife,Num_of_residence,Num_of_hotel,Num_of_transportation,Num_of_sports,Num_of_school,Num_of_theaterMuseum
34,Bronx,Belmont,4,27,9,2,0,0,2,1,0,0
46,Brooklyn,Bay Ridge,2,34,5,7,0,1,0,1,0,0
49,Brooklyn,Greenpoint,1,18,7,17,0,1,1,3,0,0
59,Brooklyn,Prospect Heights,1,23,2,15,0,0,0,1,0,0
62,Brooklyn,Bushwick,1,21,3,8,1,0,1,1,0,0
64,Brooklyn,Brooklyn Heights,3,23,3,4,0,0,0,11,0,1
65,Brooklyn,Cobble Hill,2,28,4,10,0,0,0,4,0,1
66,Brooklyn,Carroll Gardens,2,23,5,12,0,0,0,2,0,1
69,Brooklyn,Fort Greene,0,25,3,6,0,0,0,3,0,1
84,Brooklyn,Clinton Hill,3,38,6,7,0,0,1,4,0,0


<a id='item5'></a>

### Results and Discussion

Our analysis shows that in cluster 3 neighborhoods (light green dots on the leaflet map), there are a lot of competitons (direct or indirect) but also better transportation and more opportunities(more residents/tourists/activities). And most cluster 3 neighborhoods are the ones in manhattan, brooklyn and queens. This fits our expectation that because manhattan, brooklyn and queens are more deloveped than other boroughs, there should be more competition, better transportation and more opportunitites. 

In cluster 2 neighborhoods, there is little competition but at the same time, little opporunities and worse transportation. 

In cluster 1 neighborhoods, there is moderate level of competition but also moderate opportunities and transporation. 

Because a pizza restaurant caters to customers who seek lower price and moderate level of service, cluster 1 might be a better choice. Especially, it might be a good idea to pick a brooklyn and queens neighborhoods in cluster 1. Because there are also many brooklyn and queens neighborhoods in cluster 3, the developed areas in these two boroughs will gradually expand to more neighborhoods in their borough. Thus, considering the future opporunities, it's recommended to pick a cluster 2 neighborhoods in cluster 1.  

<a id='item6'></a>

## Conclusion

Purpose of this project was to identify neighborhoods that have more opportunities, better transportation and lower competition for pizza restaurants. By calculating the total number of venues in each neighborhoods for the 9 categories we have created from Foursquare data, we are able to cluster these neighborhoods into three clusters. Cluster 1 represents the moderate neighborhoods which has moderate level of competition, transportation and opportunities. Cluster 2 represents neighborhoods with little competition but also little opportunities and bad transportation. Cluster 3 represents neighborhoods with fierce competition but also more opportunities and better transportation. 

Final decission on optimal pizza restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every clusters. Each cluster has its own characteristics when cluster 1 might be a better choice. A more in-depth study might be needed in order to decide the final location.