Segmenting and Clustering Neighbourhoods in Toronto

The main purpose of this notebook is to Explore and cluster the neighborhoods in Toronto.

Importing the libraries needed

In [5]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup

In [6]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [7]:
soup=BeautifulSoup(source,'lxml')

In [8]:
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


In [10]:
from IPython.display import display_html
tab = str(soup.table)
#display_html(tab,raw=True)

In [12]:
dfs = pd.read_html(tab)
df=dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [24]:
# Dropping the rows where Borough is 'Not assigned'
df= df[df.Borough != 'Not assigned']

In [25]:
# Combining the neighbourhoods with same Postalcode
df = df.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True)

In [26]:
# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 'Not assigned',df['Borough'], df['Neighbourhood'])


In [28]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [29]:
df.shape

(103, 3)

Importing the csv file conatining the latitudes and longitudes for various neighbourhoods in Canada

In [40]:
temp_df=df.groupby('Postal Code')['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
temp_df=temp_df.reset_index(drop=False)
temp_df.rename(columns={'Neighbourhood':'Neighbourhood_joined'},inplace=True)

In [41]:
df_merge = pd.merge(df, temp_df, on='Postal Code')

Merging the two tables for getting the Latitudes and Longitudes for various neighbourhoods in Canada

In [42]:
df_merge.drop(['Neighbourhood'],axis=1,inplace=True)


In [43]:
df_merge.drop_duplicates(inplace=True)

In [44]:
df_merge.rename(columns={'Neighbourhood_joined':'Neighbourhood'},inplace=True)

In [45]:
df_merge.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [46]:
df_merge.shape

(103, 3)

In [47]:
def get_geocode(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude 

In [54]:
geo_df=pd.read_csv('http://cocl.us/Geospatial_data')

In [55]:
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [56]:
geo_df.rename(columns={'Postal Code':'Postal Code'},inplace=True)
geo_merged = pd.merge(geo_df, df_merge, on='Postal Code')

In [57]:
geo_merged.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighbourhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


the Clustering and the plotting of the neighbourhoods of Canada which contain Toronto in their Borough

In [66]:
df= geo_merged[geo_merged['Borough'].str.contains('Toronto',regex=False)]

In [67]:
df

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighbourhood
37,M4E,43.676357,-79.293031,East Toronto,The Beaches
41,M4K,43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale"
42,M4L,43.668999,-79.315572,East Toronto,"India Bazaar, The Beaches West"
43,M4M,43.659526,-79.340923,East Toronto,Studio District
44,M4N,43.72802,-79.38879,Central Toronto,Lawrence Park
45,M4P,43.712751,-79.390197,Central Toronto,Davisville North
46,M4R,43.715383,-79.405678,Central Toronto,"North Toronto West, Lawrence Park"
47,M4S,43.704324,-79.38879,Central Toronto,Davisville
48,M4T,43.689574,-79.38316,Central Toronto,"Moore Park, Summerhill East"
49,M4V,43.686412,-79.400049,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest..."


Visualizing all the Neighbourhoods of the above data frame using Folium

In [77]:
import folium # plotting library
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)



In [78]:
for lat,lng,borough,neighbourhood in zip(df['Latitude'],df['Longitude'],df['Borough'],df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)

In [79]:
map_toronto

Check out the README for the map.(it's not visible on github)

KMeans clustering

In [81]:
from sklearn.cluster import KMeans
k=5
toronto_clustering = df.drop(['Postal Code','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
#df.insert(0, 'Cluster Labels', kmeans.labels_)

array([4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3, 3, 1, 1, 1, 2, 2, 2, 1, 0, 1, 1, 0, 0, 0, 2, 4], dtype=int32)

In [82]:
df

Unnamed: 0,Cluster Labels,Postal Code,Latitude,Longitude,Borough,Neighbourhood
37,4,M4E,43.676357,-79.293031,East Toronto,The Beaches
41,4,M4K,43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale"
42,4,M4L,43.668999,-79.315572,East Toronto,"India Bazaar, The Beaches West"
43,4,M4M,43.659526,-79.340923,East Toronto,Studio District
44,2,M4N,43.72802,-79.38879,Central Toronto,Lawrence Park
45,2,M4P,43.712751,-79.390197,Central Toronto,Davisville North
46,2,M4R,43.715383,-79.405678,Central Toronto,"North Toronto West, Lawrence Park"
47,2,M4S,43.704324,-79.38879,Central Toronto,Davisville
48,2,M4T,43.689574,-79.38316,Central Toronto,"Moore Park, Summerhill East"
49,2,M4V,43.686412,-79.400049,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest..."


In [86]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)
import matplotlib.cm as cm
import matplotlib.colors as colors
# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

In [87]:
map_clusters

check the map in the read me.