# Segmenting and Clustering Neighbourhoods in Toronto - Part 2

## Introduction

In this notebook, we will complete the Week3 peer-graded assignment for the Applied Data Science Capstone course. The project requires us to segment and cluster neighbourhoods in Toronto using data available on this Wikipedia page.

[List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

First, we will scrape the postal code data from the Wikipedia page using the BeautifulSoup package and clean it. Then, we will use the Geocoder package to add geographical coordinates to each neighbourhood. Next, we will use the Foursquare API to get data for each of these neighbourhoods. Finally, we will build a model that will use the details of each neighbourhood to create clusters of similar locations. 

## Table of Contents

1. Importing libraries and initial setup
2. Web scraping Toronto neighbourhood data
3. Data cleaning
4. Adding geographical coordinates
5. Visualizing neighbourhoods using Folium
6. Analyzing neighbourhoods using Foursquare API
7. Final analysis

## 1. Importing Libraries and Initial Setup

In [12]:
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
import pgeocode
from geopy.geocoders import Nominatim

## 2. Web scraping Toronto neighbourhood data

To scrape data from the Wikipedia page, we will first write a function that takes an HTML table as input and returns a pandas dataframe.

In [13]:
def readDataframeFromHTML(htmlTable):
    htmlRows = htmlTable.find_all("tr")
    dataRows = []
    for tr in htmlRows:
        htmlCells = tr.find_all(re.compile(r"(th|td)"))
        drow = []
        for td in htmlCells:
            try:
                drow.append(td.text.replace("\n", ""))
            except:
                continue
        if len(drow) > 0:
            dataRows.append(drow)

    df = pd.DataFrame(dataRows[1:], columns=dataRows[0])
    return (df)


Now, we will fetch the Wikipedia page html using the **requests** library. Next we will use the **BeautifulSoup** library to parse the html and retrieve the hmtl table containing Postal Codes, Boroughs and Neighbourhoods for Canada. We will pass this html table to our function **readDataframeFromHTML** to get a pandas dataframe.

In [14]:
wikiURL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
htmlPage = requests.get(wikiURL)
soup = BeautifulSoup(htmlPage.text, "html.parser")
htmlTable = soup.find("table", attrs={"class":"wikitable"})
df = readDataframeFromHTML(htmlTable)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 3. Data cleaning

Now that we have the data for the Canadian neighbourhoods, we will clean the data by removing rows where **Borough** is *Not assigned*. Also, rows where **Neighbourhood** is *Not assigned*, we will set the **Neighbourhood** same as the **Borough** for that entry. We will also ensure that there are no rows with the same **Postal Codes** as we will be using these to get the geographical coordinates at a later stage.

In [15]:
df = df[df["Borough"] != "Not assigned"].reset_index(drop=True)
df.loc[df["Neighbourhood"] == "Not assigned", "Neighbourhood"] =  df.loc[df["Neighbourhood"] == "Not assigned", "Borough"]

duplicateCodes = df.groupby(by="Postal Code").count().reset_index(drop=True)
print("Number of rows with duplicate Postal Codes = " + str(duplicateCodes[duplicateCodes["Borough"] > 1].shape[0]))

Number of rows with duplicate Postal Codes = 0


Finally, we will check the size of the dataframe.

In [16]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## 4. Adding geographical coordinates

In order to analyze each neighbourhood, we first need their geographical coordinates. There are multiple libraries in Python that can be used for this purpose. We will be using the **pgeocode** library which allows us to set the country location and then pass the postal code to get the desired coordinates.

We will write a simple function that takes the postal code as input and returns a dictionary of Latitude, Longitude for it.

In [17]:
def getLatLongData(postalCode):
    geoObject = pgeocode.Nominatim("CA")
    location = geoObject.query_postal_code(postalCode)
    coordinates = {"Latitude": location.latitude, "Longitude": location.longitude}
    return (coordinates)

We will then call the function for each of the postal codes in our dataframe. Finally, we will add the latitude and longitude data to the original dataframe.

In [18]:
allCoords = df["Postal Code"].map(getLatLongData)
coordsDF = pd.DataFrame(allCoords.to_list())
df = pd.concat([df, coordsDF], axis = 1)
print(df.shape)
df.head()

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889


We also need to check that we have updated all the coordinates correctly, i.e. there are no missing values in the data.

In [19]:
df[df["Latitude"].isna() | df["Longitude"].isna()]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


Since there is one Postal code for which coordinates were not available, we will update these values using the csv file provided as past of the assignment.

[http://cocl.us/Geospatial_data](https://cocl.us/Geospatial_data)

In [20]:
df.loc[df["Postal Code"] == "M7R", "Latitude"] = 43.6369
df.loc[df["Postal Code"] == "M7R", "Longitude"] = -79.6158

It is also mentioned in the assignment that we only need to analyze the Boroughs whose name contains *Toronto*. So we will filter our dataframe for these boroughs and recheck the shape.

In [21]:
torontoData = df[df["Borough"].str.contains("Toronto")].reset_index(drop=True)
print(torontoData.shape)
torontoData.head()

(40, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756
4,M4E,East Toronto,The Beaches,43.6784,-79.2941
