## Toronto Neighbourhood Clustering

In [31]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
doc = r.text

soup = BeautifulSoup(doc, 'html.parser')

data = []
table = soup.find('table', attrs = {'class' : 'wikitable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
df = pd.DataFrame(data)
df.drop([0], inplace=True)
df.columns = ['PostalCode','Borough','Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Great! We scraped the html and used beautifulsoup to extract the data and convert it to a pandas dataframe. To use this dataframe for data analysis, we will now clean the data. We have to 

1) ignore cells with borough that is Not assigned. 

2) Join Neighbourhoods with the same postal area. 

3) If Neighborhood is Not assigned, it is the same as the borough. 

4) use the .shape method to print the number of rows in dataframe. 

In [46]:
print(list(df[df['Borough'] != 'Not assigned']['Neighbourhood'])) #No more not assigned
print(df[df['Borough'] != 'Not assigned'].groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index())

# we see that the transformed dataframe has the same dimensions as the csv file implying correctness
# the first line shows that there are no cases in which neighborhood is 'not assigned' once we filter out Bouroughs
# now it is safe to destructively update the dataframe and find the .shape

df1 = df[df['Borough'] != 'Not assigned'].groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df1.shape

['Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront', 'Lawrence Manor, Lawrence Heights', "Queen's Park, Ontario Provincial Government", 'Islington Avenue, Humber Valley Village', 'Malvern, Rouge', 'Don Mills', 'Parkview Hill, Woodbine Gardens', 'Garden District, Ryerson', 'Glencairn', 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale', 'Rouge Hill, Port Union, Highland Creek', 'Don Mills', 'Woodbine Heights', 'St. James Town', 'Humewood-Cedarvale', 'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood', 'Guildwood, Morningside, West Hill', 'The Beaches', 'Berczy Park', 'Caledonia-Fairbanks', 'Woburn', 'Leaside', 'Central Bay Street', 'Christie', 'Cedarbrae', 'Hillcrest Village', 'Bathurst Manor, Wilson Heights, Downsview North', 'Thorncliffe Park', 'Richmond, Adelaide, King', 'Dufferin, Dovercourt Village', 'Scarborough Village', 'Fairview, Henry Farm, Oriole', 'Northwood Park, York University', 'East Toronto, Broadview North (Old East York)', '

(103, 3)

Great! The data has been grouped and processed for removal of "Not assigned". We can proceed to the next step of appending the latitude and longitude to the dataframe.

In [54]:
latlng = pd.read_csv('/tmp/mozilla_ro0/Geospatial_Coordinates.csv')
print(latlng)

df1['Latitude'] = latlng['Latitude']
df1['Longitude'] = latlng['Longitude']
df1.tail()

    Postal Code   Latitude  Longitude
0           M1B  43.806686 -79.194353
1           M1C  43.784535 -79.160497
2           M1E  43.763573 -79.188711
3           M1G  43.770992 -79.216917
4           M1H  43.773136 -79.239476
..          ...        ...        ...
98          M9N  43.706876 -79.518188
99          M9P  43.696319 -79.532242
100         M9R  43.688905 -79.554724
101         M9V  43.739416 -79.588437
102         M9W  43.706748 -79.594054

[103 rows x 3 columns]


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
102,M9W,Etobicoke,"Northwest, West Humber - Clairville",43.706748,-79.594054


Great! Although geocoder was unable to query google, we were able to use the associated csv file to appropriately modify our dataframe to include latitude and longitude. Now, we will proceed to train clustering models on this to group neighbourhood. We are can collect only the boroughs that contain the word 'Toronto'.

In [60]:
dft = df1[df1['Borough'].str.contains("Toronto")]
dft

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


 This reveals that there are only 4 major boroughs. These are East, West, Central, and Downtown Toronto. Therefore, we will run our K means clustering algorithm with k=4 centroids and create a map to visualize the neighbourhood clusters.

In [61]:
import folium
from sklearn.cluster import KMeans
print('libraries imported')

#downtown latitude longitude coordinates
latitude = 43.662301
longitude = -79.389494
map_toronto = folium.Map(location = [latitude,longitude], zoom_start=10)

#add markers
for lat, lng, borough, neighbourhood, in zip(dft['Latitude'],dft['Longitude'],dft['Borough'],dft['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3016cc',
        fill_opacity=0.6,
        parse_html=False).add_to(map_toronto)

map_toronto

libraries imported


Great! We are able to use folium to visualize the different clusters for various Toronto neighbourhoods. The anchor shape clearly shows there are upper, left, right, and central parts of "toronto". We will now initiate KMeans with 4 centroids and 4 iterations and observe how the centroids converge upon the local optimum.

In [69]:
#set number of clusters
kclusters = 4

#make sure we only have the relevant numeric data
toronto_group_cluster = dft.drop(['Neighbourhood','PostalCode','Borough'], 1)

#initiate K-means algorithm
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_group_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_


array([3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 3], dtype=int32)

Great! We were able to find the cluster each point is in. Now we will create a new dataframe that includes the cluster.

In [77]:
#dft.insert(0, 'Cluster', kmeans.labels_)
dft.head()

#The above commented line inserts the cluster

Unnamed: 0,Cluster,PostalCode,Borough,Neighbourhood,Latitude,Longitude,clus
37,3,M4E,East Toronto,The Beaches,43.676357,-79.293031,1.0
41,3,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,
42,3,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,
43,3,M4M,East Toronto,Studio District,43.659526,-79.340923,
44,2,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,


Alright, now we can use folium to plot the centroids to identify the respective Buroughs of Toronto.

In [80]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dft['Latitude'],dft['Longitude'],dft['Neighbourhood'],dft['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
    [lat,lon],
    radius=5,
    popup=label,
    color=rainbow[cluster-1],
    fill=True,
    fill_color=rainbow[cluster-1],
    fill_opacity=0.7).add_to(map_clusters)

map_clusters

Good! The yellow colour is a little difficult to make out, but as we can see, the various Buroughs have been partitioned via Kmeans into their respective clusters!