This notebook contains the material for the peer assignment: Segmenting and Clustering Neighborhoods in Toronto. Each question is given in a markdown cell, with the material for the question listed in the cells below it.

In [1]:
#Set up and import necessary packages

import pandas as pd
import numpy as np
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

PART ONE

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
To create the above dataframe:
The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [2]:
#PART ONE

# Import data as a pandas dataframe
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
PC_can = pd.read_html(url)[0]

PC_can.replace("Not assigned", np.nan, inplace = True) #replace "Not assigned" with NaN for easier processing
PC_can.dropna(subset=["Borough"], axis=0, inplace=True) #drop rows from dataframe with no assigned borough
PC_can.reset_index(drop=True, inplace=True) # reset index
PC_can.head() #visualize dataframe

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- This is already the case in the above dataframe

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [3]:
PC_can["Neighborhood"].isna().value_counts() #Check for any missing "Neighborhood" 

False    103
Name: Neighborhood, dtype: int64

Since there are no missing values in the "Neighborhood" column, no replacement is needed. Finally,  use the .shape method to print the number of rows of the dataframe.

In [4]:
PC_can.shape #print the number of rows (and columns) in the dataframe.

(103, 3)

PART TWO

Use the Geocoder package or the csv file to create the following dataframe.

In [5]:
url_coords = "https://cocl.us/Geospatial_data" # CSV file where location data is held
PC_can_loc = pd.read_csv(url_coords) #read in data to pandas dataframe
PC_can_loc.head() #visualise data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
PC_2 = PC_can.join(PC_can_loc.set_index('Postal Code'), on = "Postal Code") #Join dataframes based on postal code
PC_2.head() #visualise new dataframe

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


PART THREE

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

to add enough Markdown cells to explain what you decided to do and to report any observations you make.
to generate maps to visualize your neighborhoods and how they cluster together.
Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

Step 1: create "toronto_data" dataframe which only contains boroughs with Toronto in the name

In [8]:
#Step 1: Extract Toronto Boroughs Only
toronto_boroughs = []
borough_list = PC_2['Borough'].unique() #Find list of unique borough names

#Add only those containing the string "Toronto" to a list
for i in range(0, len(PC_2['Borough'].unique())):
    if "Toronto" in str(borough_list[i]):
        toronto_boroughs = toronto_boroughs + [borough_list[i]]
        
toronto_data = PC_2[PC_2.Borough.isin(toronto_boroughs)] #Filter such that only boroughs in the list are included in the dataframe
toronto_data.reset_index(drop=True, inplace=True) # reset index
toronto_data.head() #visualise


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Step 2: Visualise the locations of each of the neighborhoods in the selected boroughs.

In [18]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto



The below cell defines functions taken directly from the lab notebook on "Segmenting and Clustering Neighbourhoods in New York City". The first function collects a list of venues 500m from the center of a given district (i.e. a point latitude and longitude) and adds them to a dataframe. The "print" line has been removed for clarity in answers.

The second function returns only the top 'x' number of venues for each neighbourhood.

In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Step 3: Locate venues within each neighbourhood

In [19]:
#FourSquare credentials
CLIENT_ID = '5VKZMOG0VTC5AIKTYGQANSCCT2KPBISSHB34USIWG112RJY5' # my Foursquare ID
CLIENT_SECRET = 'YBRYR0VZTWMYG33OS4H55I5B5S4ORESAWVXQWD3WEQJXUQ1S' # my Foursquare Secret
VERSION = '20180604'

LIMIT = 100
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )
print(toronto_venues.shape)
toronto_venues.head()

(1614, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


Step 4: One-hot encoding to determine the frequency of occurence of each category of venue within each neighbourhood.

In [59]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Step 5: Pick out the top 5 most commonly occuring types of venue in each neighbourhood, for easier viewing

In [81]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Café,Bakery,Beer Bar,Cheese Shop,Restaurant,Basketball Stadium,Beach
1,"Brockton, Parkdale Village, Exhibition Place",Café,Coffee Shop,Breakfast Spot,Gym,Bakery,Convenience Store,Performing Arts Venue,Pet Store,Climbing Gym,Restaurant
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Park,Auto Workshop,Comic Shop,Pizza Place,Recording Studio,Butcher,Restaurant,Burrito Place,Brewery
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Sculpture Garden,Airport,Airport Food Court,Airport Terminal,Harbor / Marina,Boutique,Bar,Boat or Ferry
4,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Café,Salad Place,Thai Restaurant,Department Store,Bar,Burger Joint


Step 6: Run k-means clustering with 4 clusters to group neighbourhoods.

In [82]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Diner,Sushi Restaurant,Gym,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint,Distribution Center
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Clothing Store,Coffee Shop,Bubble Tea Shop,Café,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Lingerie Store,Tea Room,Pizza Place
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Gastropub,Restaurant,Cocktail Bar,American Restaurant,Japanese Restaurant,Italian Restaurant,Clothing Store,Cosmetics Shop
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Asian Restaurant,Health Food Store,Pub,Trail,Distribution Center,Department Store,Dessert Shop,Diner,Discount Store,Doner Restaurant


Step 7: Visualise the neighbourhood clusters

In [88]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Step 8: Examine the cluster groups to determine how clustering has been applied.

In [84]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Theater,Cosmetics Shop,Shoe Store,Restaurant
1,Downtown Toronto,0,Coffee Shop,Diner,Sushi Restaurant,Gym,Park,Mexican Restaurant,Italian Restaurant,Hobby Shop,Fried Chicken Joint,Distribution Center
2,Downtown Toronto,0,Clothing Store,Coffee Shop,Bubble Tea Shop,Café,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Lingerie Store,Tea Room,Pizza Place
3,Downtown Toronto,0,Café,Coffee Shop,Gastropub,Restaurant,Cocktail Bar,American Restaurant,Japanese Restaurant,Italian Restaurant,Clothing Store,Cosmetics Shop
4,East Toronto,0,Asian Restaurant,Health Food Store,Pub,Trail,Distribution Center,Department Store,Dessert Shop,Diner,Discount Store,Doner Restaurant
5,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Seafood Restaurant,Café,Bakery,Beer Bar,Cheese Shop,Restaurant,Basketball Stadium,Beach
6,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Café,Salad Place,Thai Restaurant,Department Store,Bar,Burger Joint
7,Downtown Toronto,0,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Diner,Restaurant,Baby Store,Candy Store,Nightclub
8,Downtown Toronto,0,Coffee Shop,Restaurant,Café,Gym,Thai Restaurant,Hotel,Deli / Bodega,Bar,Salad Place,Bookstore
9,West Toronto,0,Pharmacy,Bakery,Supermarket,Café,Grocery Store,Brewery,Bank,Park,Music Venue,Bar


In [85]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,1,Park,Women's Store,Deli / Bodega,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run
33,Downtown Toronto,1,Park,Playground,Trail,Dance Studio,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant,Dog Run


In [86]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,2,Pool,Home Service,Garden,Women's Store,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop,Doner Restaurant


In [87]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Central Toronto,3,Jewelry Store,Trail,Mexican Restaurant,Sushi Restaurant,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Donut Shop


Analysis: The majority of neighbourhoods have been placed in cluster 0, with a few neighbourhoods in cluster 1 and only one neighbourhood in each of the clusters 2 and 3.

Cluster 0 contains neighbourhoods which are dominated by food and drink outlets such as restaurants, bars, supermarkets, cafes and coffee shops.

Cluster 1 contains neighborhoods which have the highest ratio of parks to other types of venue, as well as a high proportion of public spaces like playgrounds, event spaces and dance studios. 

Cluster 2 contains a neighbourhood where pools are most common. Pools are not common in any other neighbourhoods so this is unique.

Cluster 3 contains the only neighbourhood with jewelry stores in the top 5 most common venues, making it unique and deserving of its own cluster. The neighbourhood in cluster 3 contains a high proportion of non-food stores in general.