# Week 3 Lab - Segmenting and Clustering Neighborhoods in Toronto

## 1. Scrap borough data from wikipedia page and store in a data frame

In [1]:
import pandas as pd
import urllib.request
from lxml import html

### Load the html and get the "table" element

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_str = urllib.request.urlopen(url).read()
html_element = html.fromstring(html_str)

In [3]:
table_rows = html_element.xpath("//table[contains(@class, 'wikitable')]/tbody/tr")
print("Each row has", len(table_rows[0]), "children")
print("Children html", html.tostring(table_rows[0], pretty_print=True))

Each row has 3 children
Children html b'<tr>\n<th>Postal Code\n</th>\n<th>Borough\n</th>\n<th>Neighborhood\n</th>\n</tr>\n\n'


### Create data frame to store the scraped data.

In [4]:
column_names = ["PostalCode", "Borough", "Neighborhood"] 
neighborhoods = pd.DataFrame(columns=column_names)

If a postal code doesn't have a borough assigned, exclude it from the data frame.

In [5]:
for row in table_rows[1:]:
    if len(row) == 3:
        postal = row[0].text.rstrip()
        borough = row[1].text.rstrip()
        neighbor = row[2].text.rstrip()
        
        if borough != "Not assigned":
            neighborhoods = neighborhoods.append({"Borough": borough,
                                                  "Neighborhood": neighbor,
                                                  "PostalCode": postal}, ignore_index=True)



In [110]:
neighborhoods.shape

(103, 5)

## 2. Find the borough to explore

The dataframe shows there are <b>103</b> postal codes in <i>Toronto</i> (Canada) from the given wikipedia website.  The next cell finds the neighborhoods that cover multiple postal codes.  I will use the borough from the neighborhood that covers the most postal codes and explore the area using Foursquare API.

In [131]:
from collections import Counter

all_neighborhoods = []

for neighbor in neighborhoods["Neighborhood"]:
    all_neighborhoods = all_neighborhoods + neighbor.split(",")
    
counter = Counter(all_neighborhoods)
non_unique_neighborhoods = ["{}-{}".format(counter[k], k) for k in counter.keys() if counter[k] > 1]
non_unique_neighborhoods.sort(reverse=True)
    
print("unique neighborhoods count:", 
      len(counter.keys()), 
      "total neighborhoods count across all postal codes:",
      len(all_neighborhoods))
print("neighborhoods in multiple postal codes:",
       non_unique_neighborhoods)



unique neighborhoods count: 209 total neighborhoods count across all postal codes: 217
neighborhoods in multiple postal codes: ['4-Downsview', '3-Willowdale', '2-St. James Town', '2-Runnymede', '2-Don Mills']


<b>Downsview</b> is the neighborhood that covers the most postal codes.  Now let's find the borough for those postal codes.

In [130]:
neighborhoods[neighborhoods["Neighborhood"].str.contains("Downsview")]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
40,M3K,North York,Downsview,43.737473,-79.464763
46,M3L,North York,Downsview,43.739015,-79.506944
53,M3M,North York,Downsview,43.728496,-79.495697
60,M3N,North York,Downsview,43.761631,-79.520999


From the results above, this notebook will explore the <b>North York</b> borough.

## 2. Get the latitude and longitude for every postal codes

### Build a helper function to use geocoder package

The problem with this library is it doesn't always return a valid value so the code needs to keep trying until the library returns a latitude and longitude pair.

In [132]:
import geocoder

def getLatLong(postcal_cd):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postcal_cd))
      lat_lng_coords = g.latlng
        
    return lat_lng_coords

### Use the helper function

For every postal code, call the helper function to get the latitude and longitude.  Store results in two separate lists, <i>latitudes</i> and <i>longitudes</i>.  After all the postal codes have been processed, add the lists as two new columns to the <b>neighborhoods</b> data frame.

In [9]:
latitudes = []
longitudes = []

for postalCode in neighborhoods["PostalCode"]:
    lat_long = getLatLong(postalCode)
    latitudes.append(lat_long[0])
    longitudes.append(lat_long[1])
    
neighborhoods["Latitude"] = latitudes
neighborhoods["Longitude"] = longitudes

Make sure the two columns are added.

In [10]:
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## 3. Explore the neigborhoods on a map

### Get the latitude and longitude for Toronto, Canada

 Use the latitude and longitude for Toronto as the center point on the map and display each borough on the map in a circle marker.  We will use the <b>Nominatim</b> function from the geopy library.

In [11]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = "Toronto Canada"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
toronto_lat = location.latitude
toronto_long = location.longitude
print('The geograpical coordinate for Toronto Canada are {}, {}.'.format(toronto_lat, toronto_long))

The geograpical coordinate for Toronto Canada are 43.6534817, -79.3839347.


### Use folium to display the map and use Toronto's latitude and longitude as the center point, then show the boroughs in circle markers on the map.  If you click on a circle marker, it will show neighborhoods for that borough.

In [12]:
import folium

print(toronto_lat, toronto_long)
map_toronto = folium.Map(location=[toronto_lat, toronto_long], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

43.6534817 -79.3839347


## 4. Find a borough to explore

The dataframe shows there are <b>103</b> postal codes in Toronto from the given wikipedia website.  The next cell finds the most common neighborhoods that cover multiple postal codes.  I will use the borough from the neighborhood that covers the most postal codes and explore the area using Foursquare API.

In [131]:
from collections import Counter

all_neighborhoods = []

for neighbor in neighborhoods["Neighborhood"]:
    all_neighborhoods = all_neighborhoods + neighbor.split(",")
    
counter = Counter(all_neighborhoods)
non_unique_neighborhoods = ["{}-{}".format(counter[k], k) for k in counter.keys() if counter[k] > 1]
non_unique_neighborhoods.sort(reverse=True)
    
print("unique neighborhoods count:", 
      len(counter.keys()), 
      "total neighborhoods count across all postal codes:",
      len(all_neighborhoods))
print("neighborhoods in multiple postal codes:",
       non_unique_neighborhoods)



unique neighborhoods count: 209 total neighborhoods count across all postal codes: 217
neighborhoods in multiple postal codes: ['4-Downsview', '3-Willowdale', '2-St. James Town', '2-Runnymede', '2-Don Mills']


<b>Downsview</b> is the neighborhood that covers the most postal codes.  Now let's find the borough for those postal codes.

In [130]:
neighborhoods[neighborhoods["Neighborhood"].str.contains("Downsview")]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259
40,M3K,North York,Downsview,43.737473,-79.464763
46,M3L,North York,Downsview,43.739015,-79.506944
53,M3M,North York,Downsview,43.728496,-79.495697
60,M3N,North York,Downsview,43.761631,-79.520999


From the results above, this notebook will explore the <b>North York</b> borough.

## 5. Create a data frame with North York data only and display the neighborhoods from this borough on the map

In [133]:
north_york = neighborhoods.loc[neighborhoods["Borough"] == "North York"]
north_york.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M3B,North York,Don Mills,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073


In [15]:
address = "North York, Toronto"

geolocator = Nominatim(user_agent="north_york_explorer")
location = geolocator.geocode(address)
north_york_lat = location.latitude
north_york_long = location.longitude
print('The geograpical coordinate for North York, Toronto, Canada are {}, {}.'.format(north_york_lat, north_york_long))

The geograpical coordinate for North York, Toronto, Canada are 43.7543263, -79.44911696639593.


In [16]:
map_north_york_toronto = folium.Map(location=[north_york_lat, north_york_long], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(north_york['Latitude'], north_york['Longitude'], north_york['Borough'], north_york['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york_toronto)  
    
map_north_york_toronto

## 6. Use Foursquare API to explore the North York borough and prepare data for clustering

In [28]:
CLIENT_ID = "xxxxx"
CLIENT_SECRET = "xxxxxx"
VERSION = "20200525" 

### Use the first neighborhood in the list in North York and explore the venues.

In [141]:
print("The first neighborhood in North York is", north_york.loc[0, 'Neighborhood'])

The first neighborhood in North York is Parkwoods


In [137]:
neighborhood_latitude = north_york.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = north_york.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = north_york.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


### The response should contain at most 10 venues within a 500-meter radius centered around the Parkwoods neighborhood.

Build the request:

In [143]:
LIMIT = 10
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=JERKJ1OQN2QDI3FWAALJW54GNLJ3EX2N4EFLT5WT5EBMOLNN&client_secret=32YBBW04HDHMZUJ2FMRKX4TGOYPDU53TFNFOP324U4GUCGQ2&v=20200524&ll=43.7532586,-79.3296565&radius=500&limit=10'

Call Foursquare:

In [138]:
import requests

results = requests.get(url).json()

Parse the response and extract the data we want to explore:

In [162]:
import pandas

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

venues = results['response']['groups'][0]['items']
    
nearby_venues = pandas.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [164]:
print('{} venues were returned by Foursquare for the Parkwood neighborhood.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare for the Parkwood neighborhood.


### Now for every neighborhood in North York, get the nearby venues by following the same steps for Parkwood.

In [168]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [169]:
north_york_venues = getNearbyVenues(names=north_york['Neighborhood'],
                                   latitudes=north_york['Latitude'],
                                   longitudes=north_york['Longitude']
                                  )
print("finished")

finished


In [177]:
print(north_york_venues.shape)
north_york_venues.head()

(126, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [178]:
north_york_venues.Neighborhood.value_counts()

Don Mills                                          16
Downsview                                          14
Bathurst Manor, Wilson Heights, Downsview North    10
Fairview, Henry Farm, Oriole                       10
Willowdale, Willowdale East                        10
Lawrence Manor, Lawrence Heights                   10
Bedford Park, Lawrence Manor East                  10
Willowdale, Willowdale West                         7
Northwood Park, York University                     7
Victoria Village                                    5
Hillcrest Village                                   5
Glencairn                                           4
Bayview Village                                     4
York Mills West                                     3
North Park, Maple Leaf Park, Upwood Park            3
Parkwoods                                           2
Humberlea, Emery                                    2
Humber Summit                                       2
York Mills, Silver Hills    

In [179]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 76 uniques categories.


### Create a data frame with all the possible venue categories as columns.  Each row is a neighborhood in North York.  If a neighborhood is in a category, mark the value as 1, else mark it as 0.

In [181]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.shape

(126, 77)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [183]:
north_york_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()
north_york_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Arts & Crafts Store,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,Bike Shop,...,Salon / Barbershop,Shopping Mall,Snack Place,Sporting Goods Shop,Steakhouse,Sushi Restaurant,Tea Room,Thai Restaurant,Toy / Game Store,Vietnamese Restaurant
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.1,0.0,0.0
3,Don Mills,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0625,...,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0
4,Downsview,0.0,0.071429,0.0,0.071429,0.0,0.071429,0.0,0.071429,0.0,...,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Confirm the new size

In [184]:
north_york_grouped.shape

(20, 77)

There are 20 neighborhoods and 77 venue categories.

### Print each neighborhood with the top 5 most common venues

In [186]:
num_top_venues = 5

for hood in north_york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = north_york_grouped[north_york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
            venue  freq
0     Coffee Shop   0.2
1      Restaurant   0.1
2  Ice Cream Shop   0.1
3   Deli / Bodega   0.1
4            Bank   0.1


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Bank  0.25
2                 Café  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                     venue  freq
0               Restaurant   0.1
1                     Café   0.1
2  Comfort Food Restaurant   0.1
3          Thai Restaurant   0.1
4              Coffee Shop   0.1


----Don Mills----
                  venue  freq
0   Japanese Restaurant  0.12
1        Clothing Store  0.06
2  Caribbean Restaurant  0.06
3                  Café  0.06
4            Restaurant  0.06


----Downsview----
                venue  freq
0       Grocery Store  0.14
1                Park  0.14
2            Bus Stop  0.07
3        Liquor Store  0.0

### Now put that into a dataframe, with top 10 most common venues for each neighborhood

In [50]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [187]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = north_york_grouped['Neighborhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Ice Cream Shop,Bridal Shop,Sushi Restaurant,Bank,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Diner,Comfort Food Restaurant
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Vietnamese Restaurant,Diner,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega
2,"Bedford Park, Lawrence Manor East",Café,Thai Restaurant,Coffee Shop,Sushi Restaurant,Juice Bar,Comfort Food Restaurant,Italian Restaurant,Restaurant,Indian Restaurant,Pub
3,Don Mills,Japanese Restaurant,Baseball Field,Caribbean Restaurant,Discount Store,Italian Restaurant,Concert Hall,Bike Shop,Restaurant,Café,Gym / Fitness Center
4,Downsview,Grocery Store,Park,Baseball Field,Gym / Fitness Center,Hotel,Liquor Store,Bus Stop,Food Truck,Snack Place,Airport


## 7. Clustering Neighborhoods

Run *k*-means to cluster the neighborhood into 3 clusters.

In [235]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

north_york_grouped_clustering = north_york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(north_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 3, 0, 0, 0, 4, 3])

In [236]:
neighborhoods_venues_sorted = neighborhoods_venues_sorted.drop(columns=["Cluster Labels"])

In [237]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = north_york

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
north_york_merged = north_york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

north_york_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3,Park,Food & Drink Shop,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,French Restaurant,Hockey Arena,Coffee Shop,Portuguese Restaurant,Intersection,Fast Food Restaurant,Falafel Restaurant,Event Space,Dog Run,Discount Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0,Clothing Store,Vietnamese Restaurant,Furniture / Home Store,Boutique,Coffee Shop,Event Space,Accessories Store,Snack Place,Convenience Store,Chocolate Shop
7,M3B,North York,Don Mills,43.745906,-79.352188,0,Japanese Restaurant,Baseball Field,Caribbean Restaurant,Discount Store,Italian Restaurant,Concert Hall,Bike Shop,Restaurant,Café,Gym / Fitness Center
10,M6B,North York,Glencairn,43.709577,-79.445073,0,Bakery,Japanese Restaurant,Italian Restaurant,Pub,Vietnamese Restaurant,Deli / Bodega,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store
13,M3C,North York,Don Mills,43.7259,-79.340923,0,Japanese Restaurant,Baseball Field,Caribbean Restaurant,Discount Store,Italian Restaurant,Concert Hall,Bike Shop,Restaurant,Café,Gym / Fitness Center
27,M2H,North York,Hillcrest Village,43.803762,-79.363452,0,Golf Course,Mediterranean Restaurant,Fast Food Restaurant,Pool,Dog Run,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.754328,-79.442259,0,Coffee Shop,Ice Cream Shop,Bridal Shop,Sushi Restaurant,Bank,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Diner,Comfort Food Restaurant
33,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,0,Movie Theater,Shopping Mall,Burger Joint,Chocolate Shop,Restaurant,Salon / Barbershop,Clothing Store,Bakery,Toy / Game Store,Tea Room
34,M3J,North York,"Northwood Park, York University",43.76798,-79.487262,0,Furniture / Home Store,Falafel Restaurant,Caribbean Restaurant,Massage Studio,Miscellaneous Shop,Bar,Coffee Shop,Fast Food Restaurant,Event Space,Dog Run


In [238]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[north_york_lat, north_york_long], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighborhood'], north_york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 8. Examine Clusters

In [254]:
list(range(5, north_york_merged.shape[1]))
north_york_merged.shape[1]

north_york_merged.columns[[0, 1, 2]]

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

#### Cluster 1

In [255]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 0, north_york_merged.columns[[0, 1, 2] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M4A,North York,Victoria Village,0,French Restaurant,Hockey Arena,Coffee Shop,Portuguese Restaurant,Intersection,Fast Food Restaurant,Falafel Restaurant,Event Space,Dog Run,Discount Store
3,M6A,North York,"Lawrence Manor, Lawrence Heights",0,Clothing Store,Vietnamese Restaurant,Furniture / Home Store,Boutique,Coffee Shop,Event Space,Accessories Store,Snack Place,Convenience Store,Chocolate Shop
7,M3B,North York,Don Mills,0,Japanese Restaurant,Baseball Field,Caribbean Restaurant,Discount Store,Italian Restaurant,Concert Hall,Bike Shop,Restaurant,Café,Gym / Fitness Center
10,M6B,North York,Glencairn,0,Bakery,Japanese Restaurant,Italian Restaurant,Pub,Vietnamese Restaurant,Deli / Bodega,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store
13,M3C,North York,Don Mills,0,Japanese Restaurant,Baseball Field,Caribbean Restaurant,Discount Store,Italian Restaurant,Concert Hall,Bike Shop,Restaurant,Café,Gym / Fitness Center
27,M2H,North York,Hillcrest Village,0,Golf Course,Mediterranean Restaurant,Fast Food Restaurant,Pool,Dog Run,Deli / Bodega,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping
28,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",0,Coffee Shop,Ice Cream Shop,Bridal Shop,Sushi Restaurant,Bank,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Diner,Comfort Food Restaurant
33,M2J,North York,"Fairview, Henry Farm, Oriole",0,Movie Theater,Shopping Mall,Burger Joint,Chocolate Shop,Restaurant,Salon / Barbershop,Clothing Store,Bakery,Toy / Game Store,Tea Room
34,M3J,North York,"Northwood Park, York University",0,Furniture / Home Store,Falafel Restaurant,Caribbean Restaurant,Massage Studio,Miscellaneous Shop,Bar,Coffee Shop,Fast Food Restaurant,Event Space,Dog Run
39,M2K,North York,Bayview Village,0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Vietnamese Restaurant,Diner,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega


<b>Observation:</b>
If we exaimine the rows (neighborhoods) closely in this cluster, we can notice almost every row has at least one venue that could potentially allow event space for hundreds of people.  Let's take the <i>Victoria Village</i> neighborhood as an example, it has a hockey arena and an event space venue.  Another example would be the <i>Don Mills</i> neighborhoods, it covers two postal codes (M3B and M3C) and both postal codes have a baseball field.

#### Cluster 2

In [256]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 1, north_york_merged.columns[[0, 1, 2] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,M2M,North York,"Willowdale, Newtonbrook",1,Piano Bar,Vietnamese Restaurant,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega,Discount Store


<b>Observation:</b>
This is the only neighborhood/postcal code combination in this cluster.  The top five venues are all restaurant related.

#### Cluster 3

In [257]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 2, north_york_merged.columns[[0, 1, 2] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,M2L,North York,"York Mills, Silver Hills",2,Cafeteria,Vietnamese Restaurant,Discount Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega,Diner,Dog Run


<b>Observation:</b>
This is also the only neighborhood/postcal code combination in this cluster.  The top 10 venues have a mixture of restaurants, retails stores and event space.

#### Cluster 4

In [258]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 3, north_york_merged.columns[[0, 1, 2] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,3,Park,Food & Drink Shop,Diner,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega
40,M3K,North York,Downsview,3,Grocery Store,Park,Baseball Field,Gym / Fitness Center,Hotel,Liquor Store,Bus Stop,Food Truck,Snack Place,Airport
46,M3L,North York,Downsview,3,Grocery Store,Park,Baseball Field,Gym / Fitness Center,Hotel,Liquor Store,Bus Stop,Food Truck,Snack Place,Airport
49,M6L,North York,"North Park, Maple Leaf Park, Upwood Park",3,Park,Bakery,Construction & Landscaping,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Deli / Bodega,Diner
53,M3M,North York,Downsview,3,Grocery Store,Park,Baseball Field,Gym / Fitness Center,Hotel,Liquor Store,Bus Stop,Food Truck,Snack Place,Airport
57,M9M,North York,"Humberlea, Emery",3,Construction & Landscaping,Baseball Field,Vietnamese Restaurant,Dog Run,Comfort Food Restaurant,Concert Hall,Convenience Store,Deli / Bodega,Diner,Discount Store
60,M3N,North York,Downsview,3,Grocery Store,Park,Baseball Field,Gym / Fitness Center,Hotel,Liquor Store,Bus Stop,Food Truck,Snack Place,Airport
66,M2P,North York,York Mills West,3,Park,Bank,Convenience Store,Discount Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Deli / Bodega,Diner


<b>Observation:</b>
This cluster clearly indicates neighborhoods with parks.

#### Cluster 5

In [244]:
north_york_merged.loc[north_york_merged['Cluster Labels'] == 4, north_york_merged.columns[[1] + list(range(5, north_york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,North York,4,Pizza Place,Shopping Mall,Vietnamese Restaurant,Diner,Coffee Shop,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Deli / Bodega


<b>Obervation:</b>
This is also the only neighborhood/postcal code combination in this cluster.  The top 10 venues have a mixture of probably quick grab and go food services, a shopping mall and potentially professional services.

## 9. Conclusion

The selected borough <b>North York</b> has only 20 neighborhoods so with 5 clusters, it's potentially not enough data to clearly group the neighborhoods based on the venues returned from Foursquare.  If the same analysis can be done on a different borough or multiple boroughs, it might produce more suffient results than the analysis above.  To summarize, the neighborhoods in North York can be divided into 5 clusters:

Cluster 1: venues for event spaces <br/>
Cluster 2: restaurants <br/>
Cluster 3: mixtures <br/>
Cluster 4: parks <br/>
Cluster 5: shopping