# Segmenting and Clustering Neighbourhoods in Toronto

## Introduction

In this notebook, we will complete the Week3 peer-graded assignment for the Applied Data Science Capstone course. The project requires us to segment and cluster neighbourhoods in Toronto using data available on this Wikipedia page.

[List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

First, we will scrape the postal code data from the Wikipedia page using the BeautifulSoup package and clean it. Then, we will use the Geocoder package to add geographical coordinates to each neighbourhood. Next, we will use the Foursquare API to get data for each of these neighbourhoods. Finally, we will build a model that will use the details of each neighbourhood to create clusters of similar locations. 

## Table of Contents

1. Importing libraries and initial setup
2. Web scraping Toronto neighbourhood data
3. Data cleaning
4. Adding geographical coordinates
5. Analyzing neighbourhoods using Foursquare API

## 1. Importing Libraries and Initial Setup

In [373]:
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
import pgeocode
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from pandas import json_normalize
from sklearn.cluster import KMeans

## 2. Web scraping Toronto neighbourhood data

To scrape data from the Wikipedia page, we will first write a function that takes an HTML table as input and returns a pandas dataframe.

In [374]:
def readDataframeFromHTML(htmlTable):
    htmlRows = htmlTable.find_all("tr")
    dataRows = []
    for tr in htmlRows:
        htmlCells = tr.find_all(re.compile(r"(th|td)"))
        drow = []
        for td in htmlCells:
            try:
                drow.append(td.text.replace("\n", ""))
            except:
                continue
        if len(drow) > 0:
            dataRows.append(drow)

    df = pd.DataFrame(dataRows[1:], columns=dataRows[0])
    return (df)


Now, we will fetch the Wikipedia page html using the **requests** library. Next we will use the **BeautifulSoup** library to parse the html and retrieve the hmtl table containing Postal Codes, Boroughs and Neighbourhoods for Canada. We will pass this html table to our function **readDataframeFromHTML** to get a pandas dataframe.

In [375]:
wikiURL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
htmlPage = requests.get(wikiURL)
soup = BeautifulSoup(htmlPage.text, "html.parser")
htmlTable = soup.find("table", attrs={"class":"wikitable"})
df = readDataframeFromHTML(htmlTable)

## 3. Data cleaning

Now that we have the data for the Canadian neighbourhoods, we will clean the data by removing rows where **Borough** is *Not assigned*. Also, rows where **Neighbourhood** is *Not assigned*, we will set the **Neighbourhood** same as the **Borough** for that entry. We will also ensure that there are no rows with the same **Postal Codes** as we will be using these to get the geographical coordinates at a later stage.

In [376]:
df = df[df["Borough"] != "Not assigned"].reset_index(drop=True)
df.loc[df["Neighbourhood"] == "Not assigned", "Neighbourhood"] =  df.loc[df["Neighbourhood"] == "Not assigned", "Borough"]

duplicateCodes = df.groupby(by="Postal Code").count().reset_index(drop=True)
print("Number of rows with duplicate Postal Codes = " + str(duplicateCodes[duplicateCodes["Borough"] > 1].shape[0]))

Number of rows with duplicate Postal Codes = 0


Finally, we will check the size of the dataframe.

In [377]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## 4. Adding geographical coordinates

In order to analyze each neighbourhood, we first need their geographical coordinates. There are multiple libraries in Python that can be used for this purpose. We will be using the **pgeocode** library which allows us to set the country location and then pass the postal code to get the desired coordinates.

We will write a simple function that takes the postal code as input and returns a dictionary of Latitude, Longitude for it.

In [378]:
def getLatLongData(postalCode):
    geoObject = pgeocode.Nominatim("CA")
    location = geoObject.query_postal_code(postalCode)
    coordinates = {"Latitude": location.latitude, "Longitude": location.longitude}
    return (coordinates)

We will then call the function for each of the postal codes in our dataframe. Finally, we will add the latitude and longitude data to the original dataframe.

In [379]:
allCoords = df["Postal Code"].map(getLatLongData)
coordsDF = pd.DataFrame(allCoords.to_list())
df = pd.concat([df, coordsDF], axis = 1)
print(df.shape)
df.head()

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


We also need to check that we have updated all the coordinates correctly, i.e. there are no missing values in the data.

In [380]:
df[df["Latitude"].isna() | df["Longitude"].isna()]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


Since there is one Postal code for which coordinates we not available, we will update these values using the csv file provided as past of the assignment.

[http://cocl.us/Geospatial_data](https://cocl.us/Geospatial_data)

In [381]:
df.loc[df["Postal Code"] == "M7R", "Latitude"] = 43.6369
df.loc[df["Postal Code"] == "M7R", "Longitude"] = -79.6158

It is also mentioned in the assignment that we only need to analyze the Boroughs whose name contains *Toronto*. So we will filter our dataframe for these boroughs and recheck the shape.

In [382]:
torontoData = df[df["Borough"].str.contains("Toronto")].reset_index(drop=True)
print(torontoData.shape)
torontoData.head()

(40, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941


## 5. Analyzing neighbourhoods using Foursquare API

In [383]:
address = 'Toronto, CA'
geolocator = Nominatim(user_agent="foursquare_api")
location = geolocator.geocode(address)
torontoLatitude = location.latitude
torontoLongitude = location.longitude
print('The geograpical coordinate of Toronta are {}, {}.'.format(torontoLatitude, torontoLongitude))

The geograpical coordinate of Toronta are 43.6534817, -79.3839347.


In [384]:
fig = folium.Figure(width=800, height=600)

# create map of Toronto using latitude and longitude values
mapToronto = folium.Map(location=[torontoLatitude, torontoLongitude], zoom_start=11, control_scale = True)

# add markers to map
for lat, lng, borough, neigh in zip(torontoData['Latitude'], torontoData['Longitude'], torontoData['Borough'], torontoData['Neighbourhood']):
    label = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label, parse_html=True)
    folium.Circle([lat, lng], radius=1000, popup=label, color='none', fill=True, fill_color='#3186cc', fill_opacity=0.5, parse_html=False).add_to(mapToronto)

fig.add_child(mapToronto)
fig

It looks like one of the neighbourhoods with postal code **M7Y** has been updated with incorrect geographical coordinates. We will manually update these values from the csv file provided as past of the assignment. 

In [385]:
torontoData[torontoData["Neighbourhood"] == "Business reply mail Processing Centre, South Central Letter Processing Plant Toronto"]


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
39,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505


In [386]:
torontoData.loc[torontoData["Postal Code"] == "M7Y", "Latitude"] = 43.6627439
torontoData.loc[torontoData["Postal Code"] == "M7Y", "Longitude"] = -79.321558


In [387]:
CLIENT_ID = 'AQTJCK1VUU5GFR1G4MRRULFLODH4KAJDMEKVHC4K33KR0QST' # your Foursquare ID
CLIENT_SECRET = '2EMV3DNSLBZEMBPYAROD25OMHMB2DSYFZPHWUXUBCMRFPOAZ' # your Foursquare Secret
ACCESS_TOKEN = 'HASGBCRWKKA5HVRPMHLHPODZL4FNZ1TYSXJCSXDLRURA3KQI' # your FourSquare Access Token
VERSION = '20180604'

In [388]:
# function that extracts the category of the venue
def getVenueCategory(row):
    try:
        categoriesList = row['categories']
    except:
        categoriesList = row['venue.categories']
        
    if len(categoriesList) == 0:
        return None
    else:
        return (categoriesList[0]['name'])

Since Toronto has a larger area (630.2 sq.km.), and quite a few neighbourhoods have an area of more than 1-2 sq.kms., we will set **neighRadius** to 1000. This will result in substantial overlap between the neighbourhoods in denser areas like *Downtown Toronto*, but will better capture the characteristics of other neighbourhoods.

In [389]:
def getVenuesForNeighbourhood(neighLatitude, neighLongitude, neighRadius = 1000, maxLimit = 100):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        neighLatitude, 
        neighLongitude, 
        neighRadius, 
        maxLimit)
    try:
        results = requests.get(url).json()
        # Convert the results json into a dataframe
        venues = results["response"]["groups"][0]["items"]
        nearbyVenues = json_normalize(venues) # flatten JSON
        filteredColumns = ["venue.name", "venue.categories", "venue.location.lat", "venue.location.lng"]
        nearbyVenues = nearbyVenues.loc[:, filteredColumns]
        # Set the column names by removing "venue." from the names
        nearbyVenues["venue.categories"] = nearbyVenues.apply(getVenueCategory, axis=1)
        nearbyVenues.columns = [col.split(".")[-1] for col in nearbyVenues.columns]
        return (nearbyVenues)
    except:
        return None
    

In [390]:
def getCompleteVenueList(names, latitudes, longitudes):
    venuesDF = pd.DataFrame()
    for name, lat, lng in zip(names, latitudes, longitudes):
        neighVenues = getVenuesForNeighbourhood(lat, lng)
        if neighVenues is None:
            print("No venues were found for {}.".format(name))
        else:
            neighVenues.insert(loc=0, column="neigh", value=name)
            venuesDF = venuesDF.append(pd.DataFrame(neighVenues))
    venuesDF.reset_index(drop=True)
    venuesDF.columns = ["Neighbourhood", "Venue", "Category", "Venue Latitude", "Venue Longitude"]
    return (venuesDF)



In [391]:
allVenues = getCompleteVenueList(names=torontoData["Neighbourhood"], latitudes=torontoData["Latitude"], longitudes=torontoData["Longitude"])
print("We found {} venues across {} neighbourhoods.".format(allVenues.shape[0], len(allVenues["Neighbourhood"].unique())))

We found 3286 venues across 40 neighbourhoods.


In [392]:
allVenues.loc[:,["Neighbourhood", "Venue"]].groupby("Neighbourhood").count().sort_values(by="Venue", ascending=False).reset_index()

Unnamed: 0,Neighbourhood,Venue
0,Berczy Park,100
1,"High Park, The Junction South",100
2,"Toronto Dominion Centre, Design Exchange",100
3,"The Danforth West, Riverdale",100
4,"The Annex, North Midtown, Yorkville",100
5,Studio District,100
6,Stn A PO Boxes,100
7,St. James Town,100
8,"Richmond, Adelaide, King",100
9,"Regent Park, Harbourfront",100


In [393]:
print("There are {} uniques categories.".format(len(allVenues["Category"].unique())))

There are 285 uniques categories.


In [394]:
venuesOnehot = pd.get_dummies(allVenues[["Category"]], prefix="", prefix_sep="")
venuesOnehot.insert(loc=0, column="Neighbourhood", value=allVenues["Neighbourhood"])
print(venuesOnehot.shape)
venuesOnehot.head()

(3286, 286)


Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [395]:
venuesGrouped = venuesOnehot.groupby("Neighbourhood").mean().reset_index()
print(venuesGrouped.shape)
venuesGrouped.head()

(40, 286)


Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0


In [396]:
numTopVenues = 5

for hood in venuesGrouped["Neighbourhood"]:
    print("----" + hood + "----")
    temp = venuesGrouped[venuesGrouped["Neighbourhood"] == hood].T.reset_index()
    temp.columns = ["venue", "freq"]
    temp = temp.iloc[1:]
    temp["freq"] = temp["freq"].astype(float)
    temp = temp.round({"freq": 2})
    print(temp.sort_values("freq", ascending=False).reset_index(drop=True).head(numTopVenues))
    print("\n")

----Berczy Park----
                 venue  freq
0          Coffee Shop  0.11
1                 Café  0.07
2  Japanese Restaurant  0.04
3                Hotel  0.04
4   Seafood Restaurant  0.04


----Brockton, Parkdale Village, Exhibition Place----
                    venue  freq
0              Restaurant  0.05
1                     Bar  0.05
2             Coffee Shop  0.05
3                    Café  0.05
4  Furniture / Home Store  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venue  freq
0                Park  0.10
1         Pizza Place  0.06
2             Brewery  0.06
3  Italian Restaurant  0.04
4         Coffee Shop  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.06
2  Yoga Studio  0.06
3          Gym  0.05
4         Park  0.05


----Central Bay Street----
                venue  f

In [397]:
def getMostCommonVenues(row, numTopVenues):
    rowCategories = row.iloc[1:]
    rowCategoriesSorted = rowCategories.sort_values(ascending=False)
    return rowCategoriesSorted.index.values[0:numTopVenues]

In [409]:
numTopVenues = 10
indicators = ["st", "nd", "rd"]
columns = ["Neighbourhood"]
for ind in np.arange(numTopVenues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1, indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))

neighVenuesSorted = pd.DataFrame(columns=columns)
neighVenuesSorted["Neighbourhood"] = venuesGrouped["Neighbourhood"]

for ind in np.arange(venuesGrouped.shape[0]):
    neighVenuesSorted.iloc[ind, 1:] = getMostCommonVenues(venuesGrouped.iloc[ind, :], numTopVenues)

In [410]:
# set number of clusters
kclusters = 5
venuesGroupedClustering = venuesGrouped.drop("Neighbourhood", 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(venuesGroupedClustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 2, 0, 0, 0, 0, 0, 2, 2])

In [411]:
# add clustering labels
neighVenuesSorted.insert(loc=0, column="Cluster Labels", value=kmeans.labels_)
mergedDF = torontoData
# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
mergedDF = mergedDF.join(neighVenuesSorted.set_index("Neighbourhood"), on="Neighbourhood", how="inner")
mergedDF.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,0,Coffee Shop,Park,Restaurant,Café,Theater,Bakery,Gastropub,Diner,Breakfast Spot,Pub
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,0,Coffee Shop,Park,Café,Sushi Restaurant,Italian Restaurant,Boutique,Hotel,Clothing Store,Japanese Restaurant,Pizza Place
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,0,Coffee Shop,Gastropub,Japanese Restaurant,Café,Seafood Restaurant,Theater,Italian Restaurant,Hotel,Middle Eastern Restaurant,Bakery
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,0,Coffee Shop,Café,Restaurant,Seafood Restaurant,Gastropub,Italian Restaurant,Theater,Bakery,Art Gallery,Plaza
4,M4E,East Toronto,The Beaches,43.6784,-79.2941,2,Pub,Coffee Shop,Pizza Place,Burger Joint,Grocery Store,Bar,Breakfast Spot,Health Food Store,Caribbean Restaurant,Japanese Restaurant


In [412]:
# create map
fig = folium.Figure(width=800, height=600)
mapClusters = folium.Map(location=[torontoLatitude, torontoLongitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colorsArray = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colorsArray]

# add markers to the map
markersColors = []
for lat, lon, poi, cluster in zip(mergedDF["Latitude"], mergedDF["Longitude"], mergedDF["Neighbourhood"], mergedDF["Cluster Labels"]):
    label = folium.Popup(str(poi) + " Cluster " + str(cluster), parse_html=True)
    folium.Circle([lat, lon], radius=1000, color='none', fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(mapClusters)
    folium.CircleMarker([lat, lon], radius=1, popup=label, color='#000', fill=True, fill_color="#000").add_to(mapClusters)

fig.add_child(mapClusters)
fig