# Segmenting and Clustering Neighbourhoods in Toronto - Part 1

## Introduction

In this notebook, we will complete the Week3 peer-graded assignment for the Applied Data Science Capstone course. The project requires us to segment and cluster neighbourhoods in Toronto using data available on this Wikipedia page.

[List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

First, we will scrape the postal code data from the Wikipedia page using the BeautifulSoup package and clean it. Then, we will use the Geocoder package to add geographical coordinates to each neighbourhood. Next, we will use the Foursquare API to get data for each of these neighbourhoods. Finally, we will build a model that will use the details of each neighbourhood to create clusters of similar locations. 

## Table of Contents

1. Importing libraries and initial setup
2. Web scraping Toronto neighbourhood data
3. Data cleaning
4. Adding geographical coordinates
5. Visualizing neighbourhoods using Folium
6. Analyzing neighbourhoods using Foursquare API
7. Final analysis

## 1. Importing Libraries and Initial Setup

In [1]:
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup


## 2. Web scraping Toronto neighbourhood data

To scrape data from the Wikipedia page, we will first write a function that takes an HTML table as input and returns a pandas dataframe.

In [2]:
def readDataframeFromHTML(htmlTable):
    htmlRows = htmlTable.find_all("tr")
    dataRows = []
    for tr in htmlRows:
        htmlCells = tr.find_all(re.compile(r"(th|td)"))
        drow = []
        for td in htmlCells:
            try:
                drow.append(td.text.replace("\n", ""))
            except:
                continue
        if len(drow) > 0:
            dataRows.append(drow)

    df = pd.DataFrame(dataRows[1:], columns=dataRows[0])
    return (df)


Now, we will fetch the Wikipedia page html using the **requests** library. Next we will use the **BeautifulSoup** library to parse the html and retrieve the hmtl table containing Postal Codes, Boroughs and Neighbourhoods for Canada. We will pass this html table to our function **readDataframeFromHTML** to get a pandas dataframe.

In [3]:
wikiURL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
htmlPage = requests.get(wikiURL)
soup = BeautifulSoup(htmlPage.text, "html.parser")
htmlTable = soup.find("table", attrs={"class":"wikitable"})
df = readDataframeFromHTML(htmlTable)

## 3. Data cleaning

Now that we have the data for the Canadian neighbourhoods, we will clean the data by removing rows where **Borough** is *Not assigned*. Also, rows where **Neighbourhood** is *Not assigned*, we will set the **Neighbourhood** same as the **Borough** for that entry. We will also ensure that there are no rows with the same **Postal Codes** as we will be using these to get the geographical coordinates at a later stage.

In [4]:
df = df[df["Borough"] != "Not assigned"].reset_index(drop=True)
df.loc[df["Neighbourhood"] == "Not assigned", "Neighbourhood"] =  df.loc[df["Neighbourhood"] == "Not assigned", "Borough"]

duplicateCodes = df.groupby(by="Postal Code").count().reset_index(drop=True)
print("Number of rows with duplicate Postal Codes = " + str(duplicateCodes[duplicateCodes["Borough"] > 1].shape[0]))

Number of rows with duplicate Postal Codes = 0


Finally, we will check the size of the dataframe.

In [5]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
