Introduction 

Cities around the world are different demographically and culturally. However, cosmopolitan business centers around the world are diverse and have localities which can be grouped into similar clusters with venues and places which are similar.  

People migrate to a different city or move to a different city for work. When they do so, they try to find localities similar to the ones they live in or localities with places of interest to them. 

In this project, localities in the central business district of New York City and Toronto have been clustered and compared. Similar comparison can be done with other cities around the world. 

The objective of this project is to find the most popular venues in a city, categorize them and cluster the localities in the city according to the category of popular venues. These cluster of venues are then compared with those of other cities. 

This project is a part of the Coursera Capstone Course for IBM Data Science Professional Certificate. The project aims at showcasing the different skills learnt during the course including Data cleaning and formatting, Data Visualization, Maps, Clustering, etc. 

 

Source of Data 

In this report New York and Toronto has been considered for comparison of the localities. Data used in the project has been collected from multiple sources through internet. 

Dataset with location details of different localities of New York have been taken from the geospatial data available at NYU Spatial Repository Data https://geo.nyu.edu/catalog/nyu_2451_34572. 

Dataset about the different localities of Toronto has been taken from Wikipedia website available at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. 

The geospatial data for the different locations of Toronto has been obtained using the geopy.geocoders package. The dataset with location data is also available at http://cocl.us/Geospatial_data. 

Details of the popular venues in each locality has been obtained using Foursuare api. 

Localities across the two cities are compared based on the category of the popular venues at each location. Clustering has been used to group similar localities together. 

Methodology 

Location data from the sources as mentioned above were cleaned and formatted to get details of the localities, borough, PIN codes and location coordinates (latitude and longitude). 

The location coordinates were used to get the popular venues in and around the location using Foursquare API. 

The locations were clustered using K means clustering and the clusters thus obtained were analyzed and were classified according to the category of popular venues in the location. 

 

In [1]:
import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').read()
soup = bs.BeautifulSoup(source,'lxml')

from bs4 import BeautifulSoup

table = soup.find('table')
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)


df = pd.DataFrame(res, columns=["PostalCode", "Borough", "Neighbourhood"])

df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [2]:
df=df[df.Borough != "Not assigned"]

df

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [3]:
df[df.Neighbourhood == "Not assigned"]
# There is no such neighbourhood where Neighbourhood is "Not assigned" and Borough has a value

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [4]:
df = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [5]:
#!conda install -c conda-forge geopy --yes

#!conda install -c conda-forge geocoder --yes

The location coordinates are taken from the CSV file available at https://cocl.us/Geospatial_data.

In [6]:
#Getting latitude and longitude data and concatenating data to get the required dataframe
df2 = pd.read_csv("https://cocl.us/Geospatial_data")

df3=pd.concat([df,df2], axis=1, join = 'inner')

df3.drop(['Postal Code'], axis =1, inplace = True)

df3

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [7]:
#!conda install -c conda-forge folium=0.5.0 --yes

In [8]:
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [9]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [10]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Borough'], df3['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [11]:
dttoronto_data = df3[df3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dttoronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [71]:
dttoronto_data

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752


In [12]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6563221, -79.3809161.


In [13]:
# create map of Downtown Toronto using latitude and longitude values
map_dttoronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(dttoronto_data['Latitude'], dttoronto_data['Longitude'], dttoronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dttoronto)  
    
map_dttoronto

In [73]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value


In [15]:
dttoronto_data.loc[0, 'Neighbourhood']

'Rosedale'

In [16]:
neighborhood_latitude = dttoronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dttoronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dttoronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rosedale are 43.6795626, -79.37752940000001.


In [17]:
radius = 500
LIMIT = 30

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=QSLWZH04AWBFELTDFF5DBBSI4DDDBMKOXL4PAJ3ZZPOX21FU&client_secret=E1VUT00QDZ2FAFLYJNFNSOVIJIFAOH1TCHZIYW2ZTH32Q3JF&ll=43.6563221,-79.3809161&v=20180605&radius=500&limit=30'

In [18]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5faf4626f5f91049c0f1baa7'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Bay Street Corridor',
  'headerFullLocation': 'Bay Street Corridor, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 108,
  'suggestedBounds': {'ne': {'lat': 43.6608221045, 'lng': -79.37470788695488},
   'sw': {'lat': 43.651822095499995, 'lng': -79.3871243130451}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '57eda381498ebe0e6ef40972',
       'name': 'UNIQLO ユニクロ',
       'location': {'address': '220 Yonge St',
        'crossStreet': 'at Dundas St W',
        'lat': 43.65591027779457,
        'lng': -79.38064099181345,
        'labeledLatLngs

In [19]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [20]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,UNIQLO ユニクロ,Clothing Store,43.65591,-79.380641
1,Silver Snail Comics,Comic Shop,43.657031,-79.381403
2,Ed Mirvish Theatre,Theater,43.655102,-79.379768
3,Yonge-Dundas Square,Plaza,43.656054,-79.380495
4,CF Toronto Eaton Centre,Shopping Mall,43.654447,-79.380952


In [21]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

30 venues were returned by Foursquare.


In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
dttoronto_venues = getNearbyVenues(names=dttoronto_data['Neighbourhood'],
                                   latitudes=dttoronto_data['Latitude'],
                                   longitudes=dttoronto_data['Longitude']
                                  )

print(dttoronto_venues.shape)

dttoronto_venues.head()

Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Queen's Park, Ontario Provincial Government
(516, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"St. James Town, Cabbagetown",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


In [24]:
dttoronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,30,30,30,30,30,30
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,30,30,30,30,30,30
Christie,16,16,16,16,16,16
Church and Wellesley,30,30,30,30,30,30
"Commerce Court, Victoria Hotel",30,30,30,30,30,30
"First Canadian Place, Underground city",30,30,30,30,30,30
"Garden District, Ryerson",30,30,30,30,30,30
"Harbourfront East, Union Station, Toronto Islands",30,30,30,30,30,30
"Kensington Market, Chinatown, Grange Park",30,30,30,30,30,30


In [25]:
# one hot encoding
dttoronto_onehot = pd.get_dummies(dttoronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dttoronto_onehot['Neighbourhood'] = dttoronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [dttoronto_onehot.columns[-1]] + list(dttoronto_onehot.columns[:-1])
dttoronto_onehot = dttoronto_onehot[fixed_columns]

dttoronto_onehot.head()
dttoronto_onehot.shape

(516, 155)

In [26]:
dttoronto_grouped = dttoronto_onehot.groupby('Neighbourhood').mean().reset_index()
dttoronto_grouped

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Art Gallery,...,Thai Restaurant,Theater,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0625,0.0625,0.0625,0.125,0.125,0.0625,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.033333
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.033333,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,...,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.066667,0.033333,0.0


In [27]:
num_top_venues = 5

for hood in dttoronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dttoronto_grouped[dttoronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.10
1  Seafood Restaurant  0.07
2      Farmers Market  0.07
3            Beer Bar  0.07
4        Cocktail Bar  0.07


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0   Airport Lounge  0.12
1  Airport Service  0.12
2          Airport  0.06
3              Bar  0.06
4            Plane  0.06


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.23
1  Italian Restaurant  0.07
2                Café  0.07
3          Comic Shop  0.03
4     Bubble Tea Shop  0.03


----Christie----
           venue  freq
0  Grocery Store  0.25
1           Café  0.19
2           Park  0.12
3    Coffee Shop  0.06
4      Nightclub  0.06


----Church and Wellesley----
             venue  freq
0      Men's Store  0.07
1      Coffee Shop  0.07
2              Pub  0.03
3         Creperie  0.03
4  Bubble Tea Shop  0.03


----Comm

In [28]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [29]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = dttoronto_grouped['Neighbourhood']

for ind in np.arange(dttoronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dttoronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Beer Bar,Farmers Market,Seafood Restaurant,Cocktail Bar,Bakery,Museum,Liquor Store,Restaurant,Creperie
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport,Bar,Harbor / Marina,Coffee Shop,Rental Car Location,Boutique,Sculpture Garden,Boat or Ferry
2,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Yoga Studio,Spa,Modern European Restaurant,Miscellaneous Shop,Middle Eastern Restaurant,Park,Bubble Tea Shop
3,Christie,Grocery Store,Café,Park,Athletics & Sports,Baby Store,Coffee Shop,Nightclub,Candy Store,Restaurant,Italian Restaurant
4,Church and Wellesley,Coffee Shop,Men's Store,Ramen Restaurant,Dance Studio,Creperie,Salon / Barbershop,Bookstore,Restaurant,Breakfast Spot,Bubble Tea Shop


In [30]:
# set number of clusters
kclusters = 5

dttoronto_grouped_clustering = dttoronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dttoronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 3, 0, 4, 1, 1, 1, 1, 1, 1])

In [31]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dttoronto_merged = dttoronto_data

# merge dttoronto_grouped with manhattan_data to add latitude/longitude for each neighborhood
dttoronto_merged = dttoronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

dttoronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Playground,Trail,Yoga Studio,Comfort Food Restaurant,Deli / Bodega,Dance Studio,Creperie,Cosmetics Shop,Concert Hall
1,M4X,Downtown Toronto,"St. James Town, Cabbagetown",43.667967,-79.367675,1,Coffee Shop,Café,Italian Restaurant,Restaurant,Bakery,Playground,Caribbean Restaurant,Butcher,Pub,Bank
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Men's Store,Ramen Restaurant,Dance Studio,Creperie,Salon / Barbershop,Bookstore,Restaurant,Breakfast Spot,Bubble Tea Shop
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,Yoga Studio,Performing Arts Venue,Café,Pub,Mexican Restaurant
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Café,Coffee Shop,Clothing Store,Theater,Comic Shop,Shopping Mall,Japanese Restaurant,Sandwich Place,Bookstore,Ramen Restaurant


In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dttoronto_merged['Latitude'], dttoronto_merged['Longitude'], dttoronto_merged['Neighbourhood'], dttoronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [33]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 0, dttoronto_merged.columns[[2] + list(range(7, dttoronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"Regent Park, Harbourfront",Bakery,Park,Theater,Breakfast Spot,Yoga Studio,Performing Arts Venue,Café,Pub,Mexican Restaurant
7,Central Bay Street,Café,Italian Restaurant,Yoga Studio,Spa,Modern European Restaurant,Miscellaneous Shop,Middle Eastern Restaurant,Park,Bubble Tea Shop
18,"Queen's Park, Ontario Provincial Government",Yoga Studio,Creperie,Smoothie Shop,Bar,Distribution Center,Sandwich Place,Diner,Mexican Restaurant,Italian Restaurant


The first cluster in Downtown Toronto, cluster 0, consists of places like bakery, cafe, park, theatre, yoga studio, spa, bar, etc. So, these are places of relaxation.

In [34]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 1, dttoronto_merged.columns[[2] + list(range(7, dttoronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"St. James Town, Cabbagetown",Café,Italian Restaurant,Restaurant,Bakery,Playground,Caribbean Restaurant,Butcher,Pub,Bank
2,Church and Wellesley,Men's Store,Ramen Restaurant,Dance Studio,Creperie,Salon / Barbershop,Bookstore,Restaurant,Breakfast Spot,Bubble Tea Shop
4,"Garden District, Ryerson",Coffee Shop,Clothing Store,Theater,Comic Shop,Shopping Mall,Japanese Restaurant,Sandwich Place,Bookstore,Ramen Restaurant
5,St. James Town,Farmers Market,Café,Restaurant,Japanese Restaurant,Coffee Shop,Latin American Restaurant,BBQ Joint,Creperie,Middle Eastern Restaurant
6,Berczy Park,Beer Bar,Farmers Market,Seafood Restaurant,Cocktail Bar,Bakery,Museum,Liquor Store,Restaurant,Creperie
8,"Richmond, Adelaide, King",Steakhouse,Coffee Shop,Gym / Fitness Center,Concert Hall,Restaurant,Plaza,Lounge,Monument / Landmark,Seafood Restaurant
9,"Harbourfront East, Union Station, Toronto Islands",Park,Café,Plaza,Salad Place,Ice Cream Shop,Skating Rink,Basketball Stadium,Dessert Shop,Japanese Restaurant
10,"Toronto Dominion Centre, Design Exchange",Café,Restaurant,Japanese Restaurant,Asian Restaurant,Sandwich Place,Pub,Beer Bar,Concert Hall,Bakery
11,"Commerce Court, Victoria Hotel",Coffee Shop,Hotel,Gastropub,American Restaurant,Restaurant,Japanese Restaurant,Museum,Pub,Deli / Bodega
12,"University of Toronto, Harbord",Bookstore,Sandwich Place,Bar,Bakery,Japanese Restaurant,Yoga Studio,Italian Restaurant,Noodle House,College Arts Building


The second cluster, cluster 1, consists of places like farmers market, steakhouse, restaurants, coffee shops, bar, etc. So, these are eateries and places for dining out.

In [35]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 2, dttoronto_merged.columns[[2] + list(range(7, dttoronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,Playground,Trail,Yoga Studio,Comfort Food Restaurant,Deli / Bodega,Dance Studio,Creperie,Cosmetics Shop,Concert Hall


The third cluster, cluster 2 consists of place with playground, trail, yoga studio, dance studio, etc. indicating that these are places for exercising and physical fitness.

In [36]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 3, dttoronto_merged.columns[[2] + list(range(7, dttoronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport,Bar,Harbor / Marina,Coffee Shop,Rental Car Location,Boutique,Sculpture Garden,Boat or Ferry


The fourth cluster, cluster 3  consists of airport with other places around the airport. This may be classified as a transportation hub.

In [37]:
dttoronto_merged.loc[dttoronto_merged['Cluster Labels'] == 4, dttoronto_merged.columns[[2] + list(range(7, dttoronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Christie,Café,Park,Athletics & Sports,Baby Store,Coffee Shop,Nightclub,Candy Store,Restaurant,Italian Restaurant


The fifth custer, cluster 4 has park, sports center along with night club and restaurant. So, this may be classified as a place of recreation and relaxation.

Let us compare the localities of Toronto with that of New york city. 
The next step is clustering the localities of New York and comparing the two cities.

json file with data on neighbourhoods have been downloaded from https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

In [38]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [39]:
neighborhoods_data = newyork_data['features']

In [40]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [70]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    

neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585
...,...,...,...,...
607,Manhattan,Hudson Yards,40.756658,-74.000111
608,Queens,Hammels,40.587338,-73.805530
609,Queens,Bayswater,40.611322,-73.765968
610,Queens,Queensbridge,40.756091,-73.945631


In [42]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


In [43]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

In [44]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [72]:
manhattan_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


In [45]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [46]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

In [47]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


In [48]:
    # formatting foursquare url for getting the locality data
    radius = 500
    LIMIT = 30

    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
    url


'https://api.foursquare.com/v2/venues/explore?client_id=QSLWZH04AWBFELTDFF5DBBSI4DDDBMKOXL4PAJ3ZZPOX21FU&client_secret=E1VUT00QDZ2FAFLYJNFNSOVIJIFAOH1TCHZIYW2ZTH32Q3JF&ll=40.7896239,-73.9598939&v=20180605&radius=500&limit=30'

In [49]:
results = requests.get(url).json()

In [50]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [51]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Central Park Tennis Center,Tennis Court,40.789313,-73.961862
1,North Meadow,Park,40.792027,-73.959853
2,East Meadow,Field,40.79016,-73.955498
3,Central Park - North Meadow Recreation Center,Playground,40.790939,-73.960304
4,Oldest Tree in Central Park,Park,40.789188,-73.957867


In [52]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [53]:
# type your answer here
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


In [54]:
manhattan_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,30,30,30,30,30,30
Carnegie Hill,30,30,30,30,30,30
Central Harlem,30,30,30,30,30,30
Chelsea,30,30,30,30,30,30
Chinatown,30,30,30,30,30,30
Civic Center,30,30,30,30,30,30
Clinton,30,30,30,30,30,30
East Harlem,30,30,30,30,30,30
East Village,30,30,30,30,30,30
Financial District,30,30,30,30,30,30


In [55]:
print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))

There are 238 uniques categories.


In [56]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Marble Hill,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Waterfront,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Carnegie Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0
2,Central Harlem,0.0,0.0,0.066667,0.066667,0.0,0.0,0.033333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chelsea,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0
4,Chinatown,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Civic Center,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333
6,Clinton,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,East Harlem,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,East Village,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.0,0.0,0.066667,0.0,0.066667,0.0,0.0,0.0
9,Financial District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0


In [58]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Battery Park City----
           venue  freq
0  Memorial Site  0.10
1           Park  0.10
2          Plaza  0.07
3            Gym  0.07
4     Food Court  0.07


----Carnegie Hill----
                  venue  freq
0                   Gym  0.07
1    Italian Restaurant  0.07
2           Coffee Shop  0.07
3  Gym / Fitness Center  0.07
4           Pizza Place  0.07


----Central Harlem----
                 venue  freq
0   African Restaurant  0.07
1  American Restaurant  0.07
2    French Restaurant  0.07
3       Cosmetics Shop  0.07
4   Chinese Restaurant  0.07


----Chelsea----
            venue  freq
0         Theater  0.07
1           Hotel  0.07
2  Ice Cream Shop  0.07
3     Coffee Shop  0.07
4      Taco Place  0.03


----Chinatown----
                venue  freq
0  Chinese Restaurant  0.10
1                 Spa  0.07
2        Noodle House  0.07
3      Sandwich Place  0.07
4              Museum  0.03


----Civic Center----
                  venue  freq
0                Bakery  0.07


In [59]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [60]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Memorial Site,Park,Food Court,Plaza,Coffee Shop,Gym,Building,Scenic Lookout,Sandwich Place,Gourmet Shop
1,Carnegie Hill,Pizza Place,Bookstore,Coffee Shop,Gym,Gym / Fitness Center,Italian Restaurant,Café,Dance Studio,Restaurant,Shipping Store
2,Central Harlem,French Restaurant,African Restaurant,American Restaurant,Chinese Restaurant,Cosmetics Shop,Bagel Shop,Music Venue,Library,Gym / Fitness Center,Ethiopian Restaurant
3,Chelsea,Theater,Hotel,Ice Cream Shop,Coffee Shop,Chinese Restaurant,Beer Bar,Café,Taco Place,New American Restaurant,Sushi Restaurant
4,Chinatown,Chinese Restaurant,Spa,Sandwich Place,Noodle House,Indie Movie Theater,Roof Deck,Boutique,Bubble Tea Shop,Spanish Restaurant,Salon / Barbershop


In [61]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 3, 3, 2, 3, 2, 4, 1, 3, 2])

In [62]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,1,Gym,Sandwich Place,Discount Store,Coffee Shop,Yoga Studio,Diner,Pizza Place,Steakhouse,Supplement Shop,Seafood Restaurant
1,Manhattan,Chinatown,40.715618,-73.994279,3,Chinese Restaurant,Spa,Sandwich Place,Noodle House,Indie Movie Theater,Roof Deck,Boutique,Bubble Tea Shop,Spanish Restaurant,Salon / Barbershop
2,Manhattan,Washington Heights,40.851903,-73.9369,3,Café,Deli / Bodega,Wine Shop,Park,Bakery,Scenic Lookout,Latin American Restaurant,Market,Coffee Shop,Tapas Restaurant
3,Manhattan,Inwood,40.867684,-73.92121,3,Mexican Restaurant,Wine Bar,Deli / Bodega,Park,Frozen Yogurt Shop,Restaurant,Bakery,Café,Yoga Studio,Latin American Restaurant
4,Manhattan,Hamilton Heights,40.823604,-73.949688,2,Coffee Shop,Yoga Studio,Mexican Restaurant,Caribbean Restaurant,Cocktail Bar,Café,Latin American Restaurant,Mediterranean Restaurant,Japanese Restaurant,Burger Joint


In [69]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [64]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,Stuyvesant Town,Park,Baseball Field,Coffee Shop,Farmers Market,Gas Station,Boat or Ferry,Bistro,Gym / Fitness Center,Bar,Cocktail Bar


The first cluster in Manhattan, cluster 0, has neighbourhood having park, baseball field, coffee shp, boat or ferry, gym, etc. So, this cluster may be defined as sports center.

In [65]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Marble Hill,Gym,Sandwich Place,Discount Store,Coffee Shop,Yoga Studio,Diner,Pizza Place,Steakhouse,Supplement Shop,Seafood Restaurant
7,East Harlem,Mexican Restaurant,Thai Restaurant,Bakery,Latin American Restaurant,New American Restaurant,Taco Place,Street Art,Steakhouse,Café,French Restaurant
10,Lenox Hill,Gym,Burger Joint,Thai Restaurant,Lingerie Store,College Academic Building,Restaurant,Gift Shop,Taco Place,Coffee Shop,Cocktail Bar
15,Midtown,Hotel,Sporting Goods Shop,Tailor Shop,Clothing Store,Bookstore,French Restaurant,Chinese Restaurant,Salad Place,Szechuan Restaurant,Cycle Studio
16,Murray Hill,Japanese Restaurant,Hotel,Burger Joint,Coffee Shop,Jewish Restaurant,Tea Room,Sushi Restaurant,Bar,Grocery Store,Restaurant
20,Lower East Side,Chinese Restaurant,Japanese Restaurant,Ramen Restaurant,Coffee Shop,Café,Art Gallery,Filipino Restaurant,Bubble Tea Shop,French Restaurant,Mexican Restaurant
31,Noho,Wine Shop,Ice Cream Shop,Cocktail Bar,Rock Club,Coffee Shop,French Restaurant,Indie Movie Theater,Deli / Bodega,Bookstore,Boutique
33,Midtown South,Korean Restaurant,Fried Chicken Joint,Hotel,Coffee Shop,Building,Clothing Store,Lingerie Store,Grocery Store,Leather Goods Store,Snack Place
34,Sutton Place,Gym / Fitness Center,Gym,Grocery Store,Italian Restaurant,Beer Garden,Indian Restaurant,Deli / Bodega,Greek Restaurant,Gourmet Shop,Steakhouse
38,Flatiron,Cycle Studio,Japanese Restaurant,Furniture / Home Store,Thai Restaurant,Coffee Shop,Salad Place,Russian Restaurant,Gift Shop,Donut Shop,Tapas Restaurant


The second cluster, cluster 1, consists of places like Gym, yoga studio along with sporta good shop, book store, etc. So, the cluster may be defined as places for spending leasure time and exercises.

In [66]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Hamilton Heights,Coffee Shop,Yoga Studio,Mexican Restaurant,Caribbean Restaurant,Cocktail Bar,Café,Latin American Restaurant,Mediterranean Restaurant,Japanese Restaurant,Burger Joint
5,Manhattanville,Seafood Restaurant,Coffee Shop,Mexican Restaurant,Italian Restaurant,Ramen Restaurant,Bar,Gastropub,Sushi Restaurant,Climbing Gym,Boutique
9,Yorkville,Italian Restaurant,Wine Shop,Coffee Shop,Deli / Bodega,Gym,Park,Bagel Shop,Diner,Liquor Store,Sushi Restaurant
11,Roosevelt Island,Deli / Bodega,Cosmetics Shop,Restaurant,Farmers Market,Bubble Tea Shop,Food & Drink Shop,Dry Cleaner,Supermarket,Dog Run,School
17,Chelsea,Theater,Hotel,Ice Cream Shop,Coffee Shop,Chinese Restaurant,Beer Bar,Café,Taco Place,New American Restaurant,Sushi Restaurant
18,Greenwich Village,Italian Restaurant,Sushi Restaurant,French Restaurant,Cosmetics Shop,Clothing Store,Café,Bagel Shop,Coffee Shop,Sandwich Place,New American Restaurant
21,Tribeca,Men's Store,Wine Shop,Spa,American Restaurant,Greek Restaurant,Park,Dog Run,Café,Salad Place,Coffee Shop
24,West Village,Italian Restaurant,Cocktail Bar,Coffee Shop,American Restaurant,Gourmet Shop,Theater,French Restaurant,Candy Store,Speakeasy,Mediterranean Restaurant
26,Morningside Heights,Park,American Restaurant,Bookstore,Coffee Shop,Burger Joint,Food Truck,Café,Mexican Restaurant,Farmers Market,Frozen Yogurt Shop
28,Battery Park City,Memorial Site,Park,Food Court,Plaza,Coffee Shop,Gym,Building,Scenic Lookout,Sandwich Place,Gourmet Shop


The third locality, locality 2, has coffee shops, caffee, karaoke bar, cocktail bar along with spa, park, etc. The cluster may be defined as placed for outing and relaxation.

In [67]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Chinatown,Chinese Restaurant,Spa,Sandwich Place,Noodle House,Indie Movie Theater,Roof Deck,Boutique,Bubble Tea Shop,Spanish Restaurant,Salon / Barbershop
2,Washington Heights,Café,Deli / Bodega,Wine Shop,Park,Bakery,Scenic Lookout,Latin American Restaurant,Market,Coffee Shop,Tapas Restaurant
3,Inwood,Mexican Restaurant,Wine Bar,Deli / Bodega,Park,Frozen Yogurt Shop,Restaurant,Bakery,Café,Yoga Studio,Latin American Restaurant
6,Central Harlem,French Restaurant,African Restaurant,American Restaurant,Chinese Restaurant,Cosmetics Shop,Bagel Shop,Music Venue,Library,Gym / Fitness Center,Ethiopian Restaurant
8,Upper East Side,Italian Restaurant,Hotel,American Restaurant,Bakery,Pet Store,Sculpture Garden,Salad Place,Sandwich Place,Coffee Shop,Chocolate Shop
12,Upper West Side,Bakery,American Restaurant,Italian Restaurant,Bar,Movie Theater,Bagel Shop,Seafood Restaurant,Greek Restaurant,Ramen Restaurant,Juice Bar
19,East Village,Pizza Place,Korean Restaurant,Wine Bar,Vietnamese Restaurant,Gift Shop,Coffee Shop,Swiss Restaurant,Beer Store,Juice Bar,Beer Bar
22,Little Italy,Ice Cream Shop,Café,Wine Bar,Sandwich Place,Coffee Shop,Cocktail Bar,Chinese Restaurant,Clothing Store,Snack Place,Bakery
23,Soho,Clothing Store,Shoe Store,Sporting Goods Shop,Men's Store,Boutique,Italian Restaurant,Tea Room,Dance Studio,Salon / Barbershop,Miscellaneous Shop
25,Manhattan Valley,Bar,Coffee Shop,Playground,Pizza Place,Yoga Studio,Bubble Tea Shop,Caribbean Restaurant,Cosmetics Shop,Park,Ethiopian Restaurant


The fourth cluster, cluster 3, has venues of eateries like restaurant, bakery, bar, pizza place, etc. So, the cluster may be defined as favourate place for dining and eateries.

In [68]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,Lincoln Square,Theater,Indie Movie Theater,Concert Hall,Performing Arts Venue,Plaza,Mexican Restaurant,Library,Movie Theater,Cycle Studio,Circus
14,Clinton,Theater,Gym / Fitness Center,Sporting Goods Shop,Café,Supermarket,French Restaurant,Sports Bar,Mediterranean Restaurant,Building,Movie Theater
39,Hudson Yards,American Restaurant,Gym / Fitness Center,Hotel,Comedy Club,Camera Store,Building,Music School,Furniture / Home Store,Supermarket,Cocktail Bar


The fifth cluster, cluster 4 has venues like theater, concert halls, performing arts venue, comedy halls. So, the localities grouped under this cluster are recreation places.

Discussion 

 

It is observed that the location clusters in the two cities are very similar. For example, Rosedale in Toronto is very similar to Stuyvesant Town in New York. 

Similarly, clusters with places of dining and eateries are similar in both the cities. 

There are clusters with places of recreation, entertainment and live performance which are similar in both the cities. 

Places with Coffee Shops, markets, gym, spa, etc are also similar in both the cities. 

Thus, people living in these cosmopolitan cities or persons migrating from them may find places of interest as per their interest common in both the cities. 

Conclusion 

Data Science can be used in our daily lives. As in the case described in this project, places of interest may be suggested to tourists or visitors to a place based on their preferences or the city location where they reside. 

Cosmopolitan cities may have cultural and political differences across the globe, but people travelling or migrating to these cities will always find places of their interest which are quite similar to the places in their home cities. 