# Segmenting and Clustering Neighborhoods in Toronto

By Thibault D.

## Table of Contents
 
1. Wikipedia Scrapping
2. Geolocalization
3. Exploration and Clustering

In [160]:
# import libraries
# url fetch
import requests
from pandas.io.json import json_normalize

# scrapping
# !conda install beautifulsoup4
from bs4 import BeautifulSoup

# data
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import numpy as np

# geolocalization
from geopy.geocoders import Nominatim
import geocoder

# plot
import folium
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# KMean clustering
from sklearn.cluster import KMeans

# PART 1: START  
__________________

## 1. Wikipedia Scrapping

In this section, we use the BeautifulSoup library to extract the table containing the list of Neighborhood in Toronto.
The following steps are followed:
1. Create a soup object that contains the webpage data.
2. Retrieve the subset of HTML code which contains the table data.
3. Extract the headers from the table.
4. Extract the content of the table.

**Step 1: Create a soup object that contains the webpage data.**

In [161]:
# url to be scrapped
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [162]:
# GET request
request = requests.get(URL)
data = request.text

# convert request to soup
soup = BeautifulSoup(data, "lxml")

In [165]:
# display content of soupd
print(soup.prettify()[:5000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":876823784,"wgRevisionId":876823784,"wgArticleId":539066,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

As we can see above, the data is contained in a **table**, the headers are stored between **th** tags while the data is stored using **td** tags.

**Step 2: Retrieve the subset of HTML code which contains the table data.**

In [8]:
# extract the table
match = soup.find('table',class_='wikitable sortable')
print(match)

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

**Step 3: Extract the headers from the table.**

In [9]:
# fetch columns names
headers = soup.find('table',class_='wikitable sortable').find('tbody').find_all('th')
columns = [head.text.strip() for head in headers]
columns

['Postcode', 'Borough', 'Neighbourhood']

In [10]:
# create new dataframe used to store the table data
zip_canada = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])

**Step 4: Extract the content of the table.**

In [11]:
# fetch table rows tr
data_rows = soup.find('table',class_='wikitable sortable').find('tbody').find_all('tr')

# fetch table cells td
for data_row in data_rows:
    data_split = data_row.find_all('td')

    if len(data_split)>0:
        postcode = data_split[0].text.strip()
        borough = data_split[1].text.strip()
        neighborhood = data_split[2].text.strip()
        
        zip_canada = zip_canada.append({'PostalCode':postcode,
                                        'Borough':borough,
                                        'Neighborhood':neighborhood},ignore_index=True)

zip_canada.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Data cleanup

In this section, the data is processed and invalid data is eliminated. The following steps are applied:
1. Delete row where the Borough is defined as **"Not assigned"**
2. Concatenate neighborhoods with the same PostalCode
3. Replace unassigned Neighborhood by the Borough name
4. We display the shape of the cleaned DataFrame

**Step 1: Delete row where the Borough is defined as "Not assigned"**

In [12]:
# Step 1
clean_df1 = zip_canada[zip_canada['Borough']!='Not assigned'].reset_index(drop=True)
clean_df1.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


**Step 2: Concatenate neighborhoods with the same PostalCode**

In [13]:
# group by PostalCode and Borough, then concatenate the Neighborhoods.
clean_df2 = clean_df1.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()

We verify that there are no longer any duplicates in the 'PostalCode' columns.

In [14]:
print('Checking for duplicates...')
print('Are there PostalCode duplicates?',~clean_df2['PostalCode'].value_counts().max()==1)

Checking for duplicates...
Are there PostalCode duplicates? False


In [15]:
clean_df2.tail(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
93,M9A,Etobicoke,Islington Avenue
94,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ..."
95,M9C,Etobicoke,"Bloordale Gardens, Eringate, Markland Wood, Ol..."
96,M9L,North York,Humber Summit
97,M9M,North York,"Emery, Humberlea"
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


**Step 3: Replace unassigned Neighborhood by the Borough name**

We first list rows where the Neighborhood contains "Not assigned".

In [16]:
clean_df2[clean_df2['Neighborhood'].str.contains('Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


Only one record contains an unassigned Neighborhood name. We replace it.

In [17]:
clean_df2.loc[clean_df2['Neighborhood'].str.contains('Not assigned'),'Neighborhood'] = clean_df2.loc[clean_df2['Neighborhood'].str.contains('Not assigned'),'Borough']

We verify that the data is now cleaned:

In [18]:
print('Checking for unassigned Neighborhood...')
print('Are there unassigned neighborhood?',~clean_df2[clean_df2['Neighborhood'].str.contains('Not assigned')]['Neighborhood'].count()==1)

Checking for unassigned Neighborhood...
Are there unassigned neighborhood? False


**Step 4: Verification**

In [19]:
print("There are {} records in the DataFrame".format(clean_df2.shape[0]))

There are 103 records in the DataFrame


In [20]:
print("The shape of the DataFrame is:")
print(clean_df2.shape)

The shape of the DataFrame is:
(103, 3)


# PART 1: END  
__________________  
# PART 2: START  

## 2.Geolocalization

In this section, we retrieve the latitude and the longitude coordinates of each neighborhood. We loop through every row in our DataFrame and retrieve the latitude and longitude.

In [21]:
zip_canada = pd.DataFrame(columns = list(clean_df2.columns)+['Latitude','Longitude'])
zip_canada

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude


In order to geolocalize the neighborhood, we use the ARCGIS Service intead of the *geocoder.google*. ARCGIS is more reliable and gives accurate results after a single call to the API.

Geocoder Documentation:  
https://media.readthedocs.org/pdf/geocoder/latest/geocoder.pdf

In [22]:
# set counter of API calls
api_calls = 0

for postalcode, borough, neighborhood in zip(clean_df2['PostalCode'],clean_df2['Borough'],clean_df2['Neighborhood']):  
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario,Canada'.format(postalcode))
        lat_lng_coords = g.latlng
        api_calls+=1

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    zip_canada = zip_canada.append({
        'PostalCode':postalcode,
        'Borough':borough,
        'Neighborhood':neighborhood,
        'Latitude':latitude,
        'Longitude':longitude
    },ignore_index=True)
    
print('All locations have been retrieved.')
print('{} calls to the API were made.'.format(api_calls))

All locations have been retrieved.
103 calls to the API were made.


In [23]:
zip_canada.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.81165,-79.195561
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785605,-79.158701
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76569,-79.175299
3,M1G,Scarborough,Woburn,43.768216,-79.21761
4,M1H,Scarborough,Cedarbrae,43.769608,-79.23944
5,M1J,Scarborough,Scarborough Village,43.743085,-79.232172
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.72626,-79.26367
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.713213,-79.28491
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.723575,-79.234976
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.69669,-79.260069


In [24]:
# save csv file
zip_canada.to_csv("./zip_canada.csv")

# PART 2: END  