# Segmenting and Clustering Neighbourhoods in Toronto

### Section: #1 - Explore and cluster neighbourhoods in Toronto

Capstone project - Battle of the Neighbourhoods. This project will compare suburbs and will determine similarities based on clustering techniques using location data services.

This project uses web scraping techniques to retrieve data from the Canadian Postal Code's Wikipedia page.

The data is then acquired and cleansed in preparation for clustering.

We import a file that contains the geospatial loctions which we then merge with the post code data which enables us to visualise the data over a map of the area.

We then cluster and plot the data over the map.

The clustering is carried out by K Means and the clusters are plotted using the Folium Library.

We map the data across Toronto and then focus/cluster the data in on boroughs containing the name 'Toronto'.

Install and import required libraries..

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Setup the reference to the Postal Codes of Canada wiki page..

Read in the web page and define a Beautiful Soup object for manipulation

In [2]:
canadian_postcodes_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
canadian_postcodes = requests.get(canadian_postcodes_url).text
soup = BeautifulSoup(canadian_postcodes, "html5lib")


lets find the tables within the web page

In [3]:
table_contents = []
canadian_postcodes_table = soup.find("table")

Lets take a look at the raw html table..

In [4]:
canadian_postcodes_table

<table cellpadding="2" cellspacing="0" rules="all" style="width:100%;">

<tbody><tr>
<td style="width:11%;">
<p>M1A<br/><span style="font-size:85%;">Not assigned</span>
</p>
</td>
<td style="width:11%;">
<p>M2A<br/><span style="font-size:85%;">Not assigned</span>
</p>
</td>
<td style="width:11%;">
<p>M3A<br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>)</span>
</p>
</td>
<td style="width:11%;">
<p>M4A<br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>)</span>
</p>
</td>
<td style="width:11%;">
<p>M5A<br/><span style="font-size:85%;"><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a><br/>(<a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a> / <a href="/wiki/Harbourfront,_Toronto" title="Harbourfront, Toronto">Harbourf

we're going to wrangle the data into a dataframe consisting of 3 columns..

PostalCode, Borough, Neighborhood

And we're going to remove those cells that contain Not Assigned, merge neighborhoods into a single postal code area and remove the duplicate postal codes

In [5]:
for row in canadian_postcodes_table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass #ignore these ones
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)

#merge neighborhoods
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

#remove duplicate values - some postal code area have multiple neighbourhoods i.e.M5A
postcode_data = df.drop_duplicates()

#check de-duped
#postcode_data.loc[df['PostalCode'] == 'M5A']



view the data

In [6]:
postcode_data

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Answer for Section 1 - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

print the number of rows in the data

In [7]:
postcode_data.shape

(103, 3)

###Section: #2 - combining the Postal Code data with Geospatial data


Reading the Geospatial_Coordinates csv file from notebook project assets.... tried using the Geocoder package, but received errors, so CSV it is...

this cell will be hidden, however the data is read into a dataframe named geospatial_data and it will be referenced later in the notebook

In [8]:
# The code was removed by Watson Studio for sharing.

Lets review the Geospatial data

In [9]:
geospatial_data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


print the number of rows in the data

In [10]:
geospatial_data.shape

(103, 3)

We're going to merge (or join) the post code data and the geospatial data so that we can link up the latitudes and longitudes for the neighbourhoods in Canada.

In [11]:
geospatial_data.rename(columns={'Postal Code':'PostalCode'},inplace=True)
merged_geospatial_data = pd.merge(postcode_data,geospatial_data,on='PostalCode')


Answer for Section 2 - output the dataframe data as per the example

In [12]:
merged_geospatial_data.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


###Section: #3 - Explore and cluster the neighborhoods in Toronto


lets do some quick analysis around the data, how many boroughs and neighborhoods are there in the data..

In [13]:
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(merged_geospatial_data['Borough'].unique()),
        merged_geospatial_data.shape[0]
    )
)


The dataframe has 15 boroughs and 103 neighbourhoods.


Use geopy library to get the lattitude and longitude of Toronto

In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


lets visualise the data over a map of Toronto, Ontario

In [15]:
toronto_data = merged_geospatial_data

In [16]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Lets segment and cluster only the neighborhoods that contain "toronto" in their name

In [17]:
toronto_boroughs_data = merged_geospatial_data[merged_geospatial_data['Borough'].str.contains('Toronto',regex=False)]
toronto_boroughs_data


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
35,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


In [18]:
# create map of Toronto using latitude and longitude values
map_toronto_boroughs = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_boroughs_data['Latitude'], toronto_boroughs_data['Longitude'], toronto_boroughs_data['Borough'], toronto_boroughs_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_boroughs)  
    
map_toronto_boroughs

define K Means cluster on the boroughs of Toronto

In [19]:
k=5
toronto_clustering = toronto_boroughs_data.drop(['PostalCode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
toronto_boroughs_data.insert(0, 'Cluster Labels', kmeans.labels_)

view the clustered table

In [20]:
toronto_boroughs_data

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,2,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,4,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
35,3,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


visualise the clusters over the map

In [21]:
# create map
map_toronto_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_boroughs_data['Latitude'], toronto_boroughs_data['Longitude'], toronto_boroughs_data['Neighborhood'], toronto_boroughs_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_clusters)
       
map_toronto_clusters