# The objective of this project is to compare the neighborhoods of Manhattan and Toronto, and find which neighborhoods of Toronto are similar to that of Manhattan's to help people moving from Manhattan to Toronto.
# To achieve this, we will extract the neighborhood data for both Manhattan and Toronto using Foursquare and feed the machine learning algorithm (clustering) the data from both the locations together. The expectation is that similar neighborhoods from both the locations will fall in the same cluster, thereby allowing to identify them in terms of their similarity. We will accomplish this using the following steps:

## 1.Download all dependencies

In [94]:
# import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

from bs4 import BeautifulSoup # BeautifulSoup libraries for web scraping

! pip install folium==0.5.0
import folium # map rendering library



## 2. Getting Manhattan Data

### 2.1. Download NewYork Data

In [95]:
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

### 2.2. Load and explore NewYork Data

In [96]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

All the relevant data is in the features key, which is basically a list of the neighborhoods. So, we define a new variable that includes this data.

In [97]:
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]  # checking first item in the list

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

### 2.3. Tranform the data into a _pandas_ dataframe

In [98]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=column_names)

# loop through the data and fill the dataframe one row at a time
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [99]:
ny_neighborhoods.head()  # examining resulting dataframe

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### 2.4. Extracting only Manhattan data from NY data

In [100]:
man_neighborhood = ny_neighborhoods[ny_neighborhoods.Borough=='Manhattan'].reset_index(drop=True)
man_neighborhood

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


In [101]:
print('The dataframe has {} neighborhoods.'.format(man_neighborhood.shape[0])) # No. of Manhattan neighborhoods

The dataframe has 40 neighborhoods.


## 3. Obtaining Toronto Data

### 3.1. Webscraping Canada data using Beautiful Soup

In [102]:
response = requests.get(
	url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M",
)
print(response.status_code)

200


In [103]:
soup = BeautifulSoup(response.content,'html.parser')
My_table = soup.find('table',{'class':'wikitable sortable'}) #Find wikitable on the url

### 3.2. Extract Data and fill the Dataframe

In [104]:
# Initiate empty arrays to hold data to be extracted
codes = []
neighborhoods = []
boroughs = []

# Extract data for each column and fill the respective column arrays
rows = My_table.find_all('tr')  # Find all rows in the table
for row in rows:
    cells = row.find_all('td')  # Find columns
    # Ignore first row as it has headers
    if len(cells) > 1:     
        borough = cells[1]
        boroughs.append(borough.text.strip())
        
        code = cells[0]
        codes.append(code.text.strip())
            
        neighborhood = cells[2]
        neighborhoods.append(neighborhood.text.strip())

In [105]:
# instantiate and fill the dataframe
df = pd.DataFrame()
df['Postal Code']=codes
df['Borough'] = boroughs
df['Neighborhood'] = neighborhoods

### 3.3. Remove cells with a Borough = 'Not assigned'

In [106]:
df1 = df[df.Borough != 'Not assigned']
print('The dataframe has {} rows and {} columns.'.format(
        df1.shape[0],df1.shape[1]))

The dataframe has 103 rows and 3 columns.


### 3.4. Add Latitude and Longitude coordinates

In [107]:
df_postal = pd.read_csv('https://cocl.us/Geospatial_data') # Read coordinates from csv file
df_postal.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### 3.5. Merge two dataframes - 1) from data scraped from wikipedia and 2) from csv file with coordinates

In [108]:
df_merged = pd.merge(left=df1, right=df_postal, how='left', left_on='Postal Code', right_on='Postal Code')
df_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### 3.6. Select only boroughs that contain 'Toronto' in their names and drop the Postal Code column

In [109]:
toronto_boroughs = df_merged[df_merged['Borough'].astype(str).str.contains('Toronto')].reset_index(drop=True)
del toronto_boroughs['Postal Code']
toronto_boroughs.head(10)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,Downtown Toronto,St. James Town,43.651494,-79.375418
4,East Toronto,The Beaches,43.676357,-79.293031
5,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,Downtown Toronto,Christie,43.669542,-79.422564
8,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


## 4. Plotting Manhattan and Toronto Neighborhoods

### 4.1. Use geopy library to get the latitude and longitude values of Manhattan

In [110]:
! pip install geopy
from geopy.geocoders import Nominatim



In [111]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
man_latitude = location.latitude
man_longitude = location.longitude
print('The geograpical coordinate of Manhttan are {}, {}.'.format(man_latitude, man_longitude))

The geograpical coordinate of Manhttan are 40.7896239, -73.9598939.


### 4.2. Use geopy library to get the latitude and longitude values of Manhattan

In [112]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


### 4.3. Create a map of Manhattan neighborhoods

In [113]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[man_latitude, man_longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(man_neighborhood['Latitude'], man_neighborhood['Longitude'], man_neighborhood['Neighborhood']):
    label = '{}'.format(man_neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

### 4.4. Create a map of Toronto neighborhoods

In [114]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood,borough in zip(toronto_boroughs['Latitude'], toronto_boroughs['Longitude'], toronto_boroughs['Neighborhood'], toronto_boroughs['Borough']):
    label = '{}','{}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 5. Use Foursquare to explore nearby venues for both Manhattan and Toronto

### 5.1. Combine the two dataframes to concatenate Manhattan and Toronto data

In [115]:
data = pd.concat([man_neighborhood,toronto_boroughs]).reset_index(drop=True);
data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688
5,Manhattan,Manhattanville,40.816934,-73.957385
6,Manhattan,Central Harlem,40.815976,-73.943211
7,Manhattan,East Harlem,40.792249,-73.944182
8,Manhattan,Upper East Side,40.775639,-73.960508
9,Manhattan,Yorkville,40.77593,-73.947118


### 5.2. Define Foursquare Credentials and Version

In [116]:
CLIENT_ID = 'Z3NOC52B4OR0PVANWGQB320XCOIXEJPI1JVKKKRYWSHT0NPA' # your Foursquare ID
CLIENT_SECRET = 'NL2W1LIROL4IEV555ZNN5PDPCO4EQBKD1KPVY30U2AM0JEIX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

### 5.3. Function to obtain category for each venue

In [117]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
    

In [118]:
radius = 500;
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    man_latitude, 
    man_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()  #Send the GET request and examine the results
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()



Unnamed: 0,name,categories,lat,lng
0,Central Park Tennis Center,Tennis Court,40.789313,-73.961862
1,North Meadow,Park,40.792027,-73.959853
2,East Meadow,Field,40.79016,-73.955498
3,Central Park - North Meadow Recreation Center,Playground,40.790939,-73.960304
4,Oldest Tree in Central Park,Park,40.789188,-73.957867


In [119]:
def get_categories():
    try:
        with open("categories.json") as data:
            categories = json.load(data)
    except IOError:
        url = 'https://api.foursquare.com/v2/venues/categories'
        params = {
            "client_id": CLIENT_ID,
            "client_secret": CLIENT_SECRET,
            "v": VERSION,
        }
        categories = requests.get(url, params=params).json()["response"]["categories"]
    return categories

#the function return dictionaries of lists with parents and child categories
def collect_categories(node, categories):
    categories.append(node["name"])
    if not node["categories"]:
        return
    for sub_node in node['categories']:
        collect_categories(sub_node, categories)
        
#from list of dictionaries to one dictoinary
categories_list = {}
for i in get_categories():
    categories = []
    collect_categories(i, categories)
    categories_list[i["name"]] = categories


### 5.4. Function to obtain info for nearby venues for each neighborhood within 500m radius

In [120]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                             'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

venues = getNearbyVenues(names= data['Neighborhood'],
                                   latitudes=data['Latitude'],
                                   longitudes=data['Longitude']
                                  );

In [121]:
venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop


In [122]:
venueCat1 = []
for venue_category in venues["Venue Category"]:
    for key in categories_list.keys():
        if venue_category in categories_list[key]:
            venueCat1.append(key)

venues["General Venue Category"] = venueCat1
venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,General Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,Food
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio,Outdoors & Recreation
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner,Food
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop,Food
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop,Food
5,Marble Hill,40.876551,-73.91066,Astral Fitness & Wellness Center,40.876705,-73.906372,Gym,Outdoors & Recreation
6,Marble Hill,40.876551,-73.91066,Starbucks,40.873755,-73.908613,Coffee Shop,Food
7,Marble Hill,40.876551,-73.91066,Rite Aid,40.875467,-73.908906,Pharmacy,Shop & Service
8,Marble Hill,40.876551,-73.91066,Blink Fitness,40.877271,-73.905595,Gym,Outdoors & Recreation
9,Marble Hill,40.876551,-73.91066,T.J. Maxx,40.877232,-73.905042,Department Store,Shop & Service


In [123]:
venues.groupby('Neighborhood').count() #check how many venues were returned for each neighborhood

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,General Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Battery Park City,66,66,66,66,66,66,66
Berczy Park,55,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16,16
Carnegie Hill,86,86,86,86,86,86,86
Central Bay Street,68,68,68,68,68,68,68
Central Harlem,45,45,45,45,45,45,45
Chelsea,100,100,100,100,100,100,100
Chinatown,100,100,100,100,100,100,100


In [124]:
venues.shape

(4827, 8)

In [125]:
print('There are {} uniques categories and {} general categories.'.format(len(venues['Venue Category'].unique()),(len(venues['General Venue Category'].unique())))) #find out how many unique categories can be curated from all the returned venues

There are 368 uniques categories and 9 general categories.


### 5.5. Analyze Each Neighborhood

In [126]:
# one hot encoding
onehot = pd.get_dummies(venues[['General Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,Neighborhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Marble Hill,0,0,1,0,0,0,0,0,0
1,Marble Hill,0,0,0,0,1,0,0,0,0
2,Marble Hill,0,0,1,0,0,0,0,0,0
3,Marble Hill,0,0,1,0,0,0,0,0,0
4,Marble Hill,0,0,1,0,0,0,0,0,0


### 5.6. Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [127]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped

Unnamed: 0,Neighborhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Battery Park City,0.060606,0.0,0.272727,0.045455,0.257576,0.060606,0.0,0.19697,0.106061
1,Berczy Park,0.090909,0.0,0.509091,0.090909,0.054545,0.0,0.0,0.236364,0.018182
2,"Brockton, Parkdale Village, Exhibition Place",0.086957,0.0,0.478261,0.130435,0.086957,0.0,0.0,0.173913,0.043478
3,"Business reply mail Processing Centre, South C...",0.0,0.0,0.25,0.0625,0.25,0.0,0.0,0.375,0.0625
4,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0625,0.0625,0.125,0.0,0.0,0.0625,0.6875
5,Carnegie Hill,0.034884,0.0,0.476744,0.069767,0.127907,0.023256,0.0,0.255814,0.011628
6,Central Bay Street,0.014706,0.0,0.779412,0.014706,0.044118,0.014706,0.0,0.117647,0.014706
7,Central Harlem,0.088889,0.0,0.533333,0.088889,0.088889,0.044444,0.0,0.155556,0.0
8,Chelsea,0.08,0.01,0.48,0.06,0.1,0.03,0.0,0.22,0.02
9,Chinatown,0.04,0.0,0.63,0.07,0.02,0.0,0.0,0.23,0.01


In [128]:
grouped.shape

(79, 10)

### 5.7. Function to sort the venues in descending order

In [129]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### 5.8. create new dataframe and display the top 5 general categories for each neighborhood

In [130]:
import numpy as np 
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Battery Park City,Food,Outdoors & Recreation,Shop & Service,Travel & Transport,Professional & Other Places
1,Berczy Park,Food,Shop & Service,Nightlife Spot,Arts & Entertainment,Outdoors & Recreation
2,"Brockton, Parkdale Village, Exhibition Place",Food,Shop & Service,Nightlife Spot,Outdoors & Recreation,Arts & Entertainment
3,"Business reply mail Processing Centre, South C...",Shop & Service,Outdoors & Recreation,Food,Travel & Transport,Nightlife Spot
4,"CN Tower, King and Spadina, Railway Lands, Har...",Travel & Transport,Outdoors & Recreation,Shop & Service,Nightlife Spot,Food


## 6. Clustering

### 6.1. Run _k_-means to cluster the neighborhoods

In [131]:
# set number of clusters
kclusters = 4

# Drop the Neighborhood column from features
grouped_clustering = grouped.drop('Neighborhood', 1)

# Set k-means parameters
k_means = KMeans(init = "k-means++", n_clusters = kclusters, n_init = 25)

# run k-means clustering
k_means.fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
k_means.labels_

array([2, 0, 0, 2, 3, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

### 6.2. Create a new dataframe that includes the cluster as well as the top 5 general categories for each neighborhood

In [132]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', k_means.labels_)

merged_data = data

# merge grouped with 'data' to add latitude/longitude for each neighborhood
merged_data = merged_data.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

merged_data.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Manhattan,Marble Hill,40.876551,-73.91066,0,Food,Shop & Service,Outdoors & Recreation,Arts & Entertainment,Travel & Transport
1,Manhattan,Chinatown,40.715618,-73.994279,0,Food,Shop & Service,Nightlife Spot,Arts & Entertainment,Outdoors & Recreation
2,Manhattan,Washington Heights,40.851903,-73.9369,0,Food,Shop & Service,Outdoors & Recreation,Nightlife Spot,Travel & Transport
3,Manhattan,Inwood,40.867684,-73.92121,0,Food,Shop & Service,Nightlife Spot,Outdoors & Recreation,Travel & Transport
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0,Food,Nightlife Spot,Shop & Service,Outdoors & Recreation,Professional & Other Places


### 6.3. Visualize clusters on the Manhattan map

In [133]:
# create map
map_clusters = folium.Map(location=[man_latitude, man_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged_data['Latitude'], merged_data['Longitude'], merged_data['Neighborhood'], merged_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [134]:
# create map
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged_data['Latitude'], merged_data['Longitude'], merged_data['Neighborhood'], merged_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 7. Examining clusters

### Cluster 1

In [135]:
merged_data.loc[merged_data['Cluster Labels'] == 0, merged_data.columns[[0] +[1]+ list(range(5, merged_data.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Manhattan,Marble Hill,Food,Shop & Service,Outdoors & Recreation,Arts & Entertainment,Travel & Transport
1,Manhattan,Chinatown,Food,Shop & Service,Nightlife Spot,Arts & Entertainment,Outdoors & Recreation
2,Manhattan,Washington Heights,Food,Shop & Service,Outdoors & Recreation,Nightlife Spot,Travel & Transport
3,Manhattan,Inwood,Food,Shop & Service,Nightlife Spot,Outdoors & Recreation,Travel & Transport
4,Manhattan,Hamilton Heights,Food,Nightlife Spot,Shop & Service,Outdoors & Recreation,Professional & Other Places
5,Manhattan,Manhattanville,Food,Shop & Service,Outdoors & Recreation,Nightlife Spot,Travel & Transport
6,Manhattan,Central Harlem,Food,Shop & Service,Outdoors & Recreation,Nightlife Spot,Arts & Entertainment
7,Manhattan,East Harlem,Food,Shop & Service,Arts & Entertainment,Outdoors & Recreation,Nightlife Spot
8,Manhattan,Upper East Side,Food,Shop & Service,Outdoors & Recreation,Arts & Entertainment,Travel & Transport
9,Manhattan,Yorkville,Food,Shop & Service,Outdoors & Recreation,Nightlife Spot,Professional & Other Places


### Cluster 2

In [136]:
merged_data.loc[merged_data['Cluster Labels'] == 1, merged_data.columns[[0] + [1] + list(range(5, merged_data.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
59,Central Toronto,Roselawn,Outdoors & Recreation,Arts & Entertainment,Travel & Transport,Shop & Service,Residence
69,Central Toronto,"Moore Park, Summerhill East",Outdoors & Recreation,Travel & Transport,Shop & Service,Residence,Professional & Other Places
73,Downtown Toronto,Rosedale,Outdoors & Recreation,Travel & Transport,Shop & Service,Residence,Professional & Other Places


### Cluster 3

In [137]:
merged_data.loc[merged_data['Cluster Labels'] == 2, merged_data.columns[[0] + [1] + list(range(5, merged_data.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
11,Manhattan,Roosevelt Island,Outdoors & Recreation,Food,Shop & Service,Residence,Professional & Other Places
13,Manhattan,Lincoln Square,Food,Outdoors & Recreation,Arts & Entertainment,Shop & Service,College & University
28,Manhattan,Battery Park City,Food,Outdoors & Recreation,Shop & Service,Travel & Transport,Professional & Other Places
37,Manhattan,Stuyvesant Town,Outdoors & Recreation,Shop & Service,Travel & Transport,Nightlife Spot,Food
39,Manhattan,Hudson Yards,Food,Outdoors & Recreation,Shop & Service,Travel & Transport,Nightlife Spot
44,East Toronto,The Beaches,Outdoors & Recreation,Shop & Service,Nightlife Spot,Travel & Transport,Residence
47,Downtown Toronto,Christie,Shop & Service,Food,Outdoors & Recreation,Nightlife Spot,Travel & Transport
49,West Toronto,"Dufferin, Dovercourt Village",Shop & Service,Food,Nightlife Spot,Outdoors & Recreation,Arts & Entertainment
60,Central Toronto,Davisville North,Outdoors & Recreation,Shop & Service,Food,Travel & Transport,Arts & Entertainment
61,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",Outdoors & Recreation,Shop & Service,Food,Travel & Transport,Residence


### Cluster 4

In [138]:
merged_data.loc[merged_data['Cluster Labels'] == 3, merged_data.columns[[0] + [1]+ list(range(5, merged_data.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
58,Central Toronto,Lawrence Park,Travel & Transport,Professional & Other Places,Outdoors & Recreation,Shop & Service,Residence
72,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",Travel & Transport,Outdoors & Recreation,Shop & Service,Nightlife Spot,Food
