# Capstone Project - Battle of the Neighborhoods (Part 2)

### Table of Contents
* Introduction: Business Problem
* Data
* Methodology
* Analysis
* Results

### Introduction

This notebook is part of the Capstone Project for IBM's Data Science Professional Certificate. For this project, I will be exploring a hypothetical request regarding Japanese restaurants in New York City. As of 2011, about 20,000 Japanese were living in New York City and a total of around 45,000 Japanese were living in the Greater New York City area. Most of this population consists of students, artists and expatriate business workers who are typically posted within the US for three to five years. While the Japanese in New York comprise a smaller demographic than the Chinese or Korean populations respectively, the popularity of Japanese cuisine means that there will never be a lack of demand, both from the largest Japanese community on the East Coast and from the general public, tourists and locals alike.

I will try to detect the locations of already existing Japanese restaurants within and around New York (extending out to Bergen County, NJ and Long Island). This data will be used to scope out locations which have a lesser concentration of Japanese restaurants.

**Note: This project can be further expanded by clustering the Japanese population in the state of New York by city and plotting the cluster of population against cluster of restaurants. This would lead to
a much more accurate prediction, but will not be used in this report due to time constraints.**

#### Target Audience

This report and its findings will be useful to anyone wanting to get a detailed overview of Japanese cuisine within New York City bounds and for anyone who wishes to see how data is used to provide prospective small business owners and restauranteurs with the information needed to make informed decisions.

### Data

Based on definition of our problem, factors that will influence our decision are:

* Number of existing Japanese restaurants in the New York City area.
* Location (latitude and longitude) of Japanese restaurants.
* Concentration of Japanese population in New York and neighborhoods around New York. (again, not used due to time constraints)

The following data sources will be needed to extract/generate the required information:

* Geo coordinates of the cities of New York, NY and Scarsdale, NY will be obtained using Geopy Nominatim
* Number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API

**Note: Since Foursquare explore_venue API returns only 50 results per search, I am using one more search query with the city of Scarsdale, NY to get more coverage. Scarsdale has a sizable Japanese community consisting of households with first and second generations, as opposed to communities in the New York Metropolitan area that normally consist of expatriates who are assigned to live there for only three to five years.**

#### Let's Begin - Importing Required Libraries

In [2]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Folium and geopy installed')
print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Folium and geopy installed
Libraries imported.


#### Now, let's Analyze and Cluster Japanese Restaurants around New York - Obtain Geographical Coordinates for New York and Scarsdale Using Geopy

In [3]:
address = 'New York, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude_nyc = location.latitude
longitude_nyc = location.longitude
print('The geograpical coordinate of New York are {}, {}.'.format(latitude_nyc, longitude_nyc))

The geograpical coordinate of New York are 40.7127281, -74.0060152.


In [4]:
address = 'Scarsdale, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude_scr = location.latitude
longitude_scr = location.longitude
print('The geograpical coordinate of Scarsdale are {}, {}.'.format(latitude_scr, longitude_scr))

The geograpical coordinate of Scarsdale are 40.9690798, -73.7635316.


#### Define Foursquare Credentials

In [5]:
CLIENT_ID = 'HFIG5BPXXFM0ISH1XNCAY5K52B5FPFCBBI51Z04UT13U412X' # your Foursquare ID
CLIENT_SECRET = 'CBSE1R4EOGTXSC2J1HKW4R3TJMYHXIFQGLY4PHIA0ZAE1O0A' # your Foursquare Secret
VERSION = '20200804' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: HFIG5BPXXFM0ISH1XNCAY5K52B5FPFCBBI51Z04UT13U412X
CLIENT_SECRET:CBSE1R4EOGTXSC2J1HKW4R3TJMYHXIFQGLY4PHIA0ZAE1O0A


#### The Fun Part - Search for Japanese Restaurants Using Foursquare API

* Foursqaure API ~venues/search option
* search_query = 'Japanese'
* category = '4d4b7105d754a06374d81259' # Food

**Note: Refer to http://developer.foursquare.com/docs/api-reference/venues/search/ for documentation on Foursquare venue/search.**

In [6]:
# Foursquare venues->search -> New York, NY

LIMIT = 100 # limit of number of venues returned by Foursquare API (Maximum 50 resposes)
search_query = 'Japanese'
radius =100000 # define radius
category = '4d4b7105d754a06374d81259' # Food

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&query={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_query,
    category,
    latitude_nyc, 
    longitude_nyc,
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/search?&client_id=HFIG5BPXXFM0ISH1XNCAY5K52B5FPFCBBI51Z04UT13U412X&client_secret=CBSE1R4EOGTXSC2J1HKW4R3TJMYHXIFQGLY4PHIA0ZAE1O0A&v=20200804&query=Japanese&categoryId=4d4b7105d754a06374d81259&ll=40.7127281,-74.0060152&radius=100000&limit=100'

In [7]:
# get request for results
results = requests.get(url).json()
results  # display results

{'meta': {'code': 200, 'requestId': '5f4a64f7d7237d7d56dc833c'},
 'response': {'venues': [{'id': '4b78ac62f964a52047dd2ee3',
    'name': 'Zutto Japanese American Pub',
    'location': {'address': '77 Hudson St',
     'crossStreet': 'Harrison St and Worth St.',
     'lat': 40.7185655837561,
     'lng': -74.00891273768076,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.7185655837561,
       'lng': -74.00891273768076},
      {'label': 'entrance', 'lat': 40.718459, 'lng': -74.009006}],
     'distance': 694,
     'postalCode': '10013',
     'cc': 'US',
     'city': 'New York',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['77 Hudson St (Harrison St and Worth St.)',
      'New York, NY 10013',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d1d1941735',
      'name': 'Noodle House',
      'pluralName': 'Noodle Houses',
      'shortName': 'Noodles',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/ramen_',


#### Get Relevant Part of JSON and Transform it into a pandas Dataframe

In [8]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_nyc = json_normalize(venues)

# check numbers of results returned
df_nyc.shape

  """


(50, 25)

#### Second Verse, Same as the First...

In [9]:
# Foursquare venues->search -> Scarsdale, NY

LIMIT = 100 # limit of number of venues returned by Foursquare API (Maximum 50 resposes)
search_query = 'Japanese'
radius =100000 # define radius
category = '4d4b7105d754a06374d81259' # Food

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&query={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_query,
    category,
    latitude_scr, 
    longitude_scr,
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/search?&client_id=HFIG5BPXXFM0ISH1XNCAY5K52B5FPFCBBI51Z04UT13U412X&client_secret=CBSE1R4EOGTXSC2J1HKW4R3TJMYHXIFQGLY4PHIA0ZAE1O0A&v=20200804&query=Japanese&categoryId=4d4b7105d754a06374d81259&ll=40.9690798,-73.7635316&radius=100000&limit=100'

In [10]:
# get request for results
results = requests.get(url).json()

# check numbers of results returned
results   # display results

{'meta': {'code': 200, 'requestId': '5f4a648a539a1348c565f2ef'},
 'response': {'venues': [{'id': '516e9e7ee4b00e4457f937ec',
    'name': 'Gyu-Kaku Japanese BBQ',
    'location': {'address': '159 Main Street',
     'lat': 41.03243055533406,
     'lng': -73.7688248686135,
     'labeledLatLngs': [{'label': 'display',
       'lat': 41.03243055533406,
       'lng': -73.7688248686135}],
     'distance': 7066,
     'postalCode': '10601',
     'cc': 'US',
     'city': 'White Plains',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['159 Main Street',
      'White Plains, NY 10601',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d111941735',
      'name': 'Japanese Restaurant',
      'pluralName': 'Japanese Restaurants',
      'shortName': 'Japanese',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/japanese_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1598711077',
    'hasPerk': False},
   {'id'

In [11]:
# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
df_scr = json_normalize(venues)

# check numbers of results returned
df_scr.shape

  """


(50, 25)

#### Append the Restaurant Search Results for New York and Scarsdale and Remove Duplicate/Overlapping Entries

In [12]:
# Append the dataframes
df_append = df_nyc.append([df_scr])

# Remove duplicates
df_nyc_scr = df_append.drop_duplicates(subset=['id'])
df_nyc_scr.shape
print ("No of unique Japanese restaurants identified around New York : " + str(df_nyc_scr.shape[0]))

No of unique Japanese restaurants identified around New York : 71


#### Select only Relevant Information and Check the Results

In [15]:
df_rest = df_nyc_scr [['name','location.address','location.city','location.state','location.lat','location.lng']]
df_rest

Unnamed: 0,name,location.address,location.city,location.state,location.lat,location.lng
0,Zutto Japanese American Pub,77 Hudson St,New York,NY,40.718566,-74.008913
1,Sumo Japanese Cuisine,104 John St,New York,NY,40.707714,-74.006226
2,EN Japanese Brasserie,435 Hudson St,New York,NY,40.730297,-74.006957
3,Emperor Japanese Tapas Shabu,96 Bowery,New York,NY,40.717581,-73.995330
4,Pinklady Japanese Cheese Tart,11 Mott St,New York,NY,40.713986,-73.998741
...,...,...,...,...,...,...
30,Kumo Japanese Bistro,22 S Main St,Pearl River,NY,41.058524,-74.021720
32,T & O Thai and Japanese Restaurant,140 Jericho Tpke Syosset 11791,Syosset,NY,40.809472,-73.511359
43,Sakura Japanese Restaurant,371 Franklin Ave,Wyckoff,NJ,41.009682,-74.170503
48,Matsuya Japanese Steak House,490 Market St,Saddle Brook,NJ,40.897249,-74.100824


This concludes the Data gathering phase of this project.

### Methodology

Now that I have the location details for all Japanese restaurants around New York, I will cluster them using K-Means clustering algorithm. This will give an idea of concentration of Japanese restaurants in New York and the Greater New York area. The clustering will help to identify locations with lesser density of Japanese restaurants. Using the density of population of Japanese people in the area, we can find out the clusters which could be good to open a new Japanese restaurant.

#### Visualize Japanese Restaurants in New York using Folium

In [19]:
venues_map = folium.Map(location=[latitude_nyc, longitude_nyc], zoom_start=10) # generate map centred around New York

# add a red circle marker to New York
folium.features.CircleMarker(
    [latitude_nyc, longitude_nyc],
    radius=10,
    color='red',
    popup='New York',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the Japanese restaurants as blue circle markers
for lat, lng, label in zip(df_rest['location.lat'], df_rest['location.lng'], df_rest['location.city']):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

#### Cluster Neighborhoods using K-Means Clustering Algorithm

In [20]:
df_lat_lng = df_nyc_scr [['location.lat', 'location.lng']]

#### K-Means Clustering

In [21]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_lat_lng)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:71]

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 4, 0,
       2, 0, 0, 4, 4, 0, 3, 3, 3, 1, 3, 1, 1, 1, 4, 1, 3, 0, 3, 0, 1, 0,
       3, 1, 0, 0, 1], dtype=int32)

#### Add Cluster Labels to the Dataframe df_rest

In [22]:
# add clustering labels 
df_rest.insert(0, 'Cluster Labels', kmeans.labels_)

#### Visualize Clusters

In [23]:
# create map
map_clusters = folium.Map(location=[latitude_nyc, longitude_nyc], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_rest['location.lat'], df_rest['location.lng'], df_rest['location.city'], df_rest['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Results

Our analysis using the clusters map shows that there are a greater number of Japanese restaurants in Midtown and Downtown Manhattan, Queens, Westchester, and Bergen County NJ. This makes sense as there are established Japanese communities in these areas.