<h1>Segmentation and clustering Toronto</h1>

<h3>Run the commented commands to see results. Remove the # to do so.</h3>

We use BeautifulSoup to scrape data from the web. Run the df command to see the entire dataset.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df_can = pd.read_html(str(table))


The original shape of the dataframe is (288,3).

In [2]:
df = df_can[0]

df.columns = df.iloc[0]
df = df[1:]

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [3]:
df.shape

(288, 3)

Now, we begin cleaning the dataset. First we remove those records which do not have any Borough assigned. Run the shape command to get (211,3).

In [4]:
df = df[df.Borough != 'Not assigned']

#df.shape

Now, we equate the name of the Borough and Neighbourhood, if there is a Borough name for a Postcode but Neighbourhood is not assigned.

In [5]:
df.loc[df.Neighbourhood == 'Not assigned', ['Borough', 'Neighbourhood']] =\
df.loc[df.Neighbourhood == 'Not assigned'].Borough


df.shape

(211, 3)

Below we can see all rows which have duplicate entries for one postcode, i.e. multiple neighbourhoods under one postcode. These list includes the first entry as well, hence 'keep=False'. Run the 'dt' command to see this subset or the 'dt.shape' command to see number of original entries and all the duplicate rows too.

In [6]:
dt = df[df.duplicated('Postcode', keep=False)]

dt.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
12,M1B,Scarborough,Rouge


In [7]:
dt.shape

(165, 3)

Here we merge Postcodes with multiple Neighbourhood entries. Running the 'df_mod' command will show final output.

The last command will return empty set as no rows remaining have duplicate records.

In [8]:
df_mod = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()

df_mod.head()

#df_mod[df_mod.duplicated('Postcode', keep=False)]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The final shape of the dataframe is as below:

In [9]:
df_mod.shape

(103, 3)

<h2>Part 2</h2>

In [10]:
#!wget -q -O 'toronto_data.csv' https://cocl.us/Geospatial_data

df2 = pd.read_csv('toronto_data.csv')

df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df3 = pd.merge(df_mod, df2, how='inner', left_on = 'Postcode', right_on='Postal Code')

df4 = df3.drop(['Postal Code'], axis=1)

df4.shape

(103, 5)

<h2>Part 3</h3>

In [12]:
#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [13]:
#!conda install -c conda-forge folium=0.5.0 --yes 

In [14]:
import folium # map rendering library

In [15]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [16]:
neighborhoods = df4

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [17]:
tor_data = neighborhoods[neighborhoods['Borough'].str.contains('Toronto')].reset_index(drop=True)

tor_data.shape

(38, 5)

In [18]:
tor_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [19]:
map_toronto_only = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Borough'], tor_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_only)  
    
map_toronto_only

<h1>Clustering</h1>

In [21]:
from sklearn.cluster import KMeans

In [22]:
kclusters = 4

toronto_loc = tor_data.drop(['Postcode','Borough','Neighbourhood'], axis=1)

toronto_loc.head()

Unnamed: 0,Latitude,Longitude
0,43.676357,-79.293031
1,43.679557,-79.352188
2,43.668999,-79.315572
3,43.659526,-79.340923
4,43.72802,-79.38879


In [23]:
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_loc)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 2, 2, 2, 2, 2, 2], dtype=int32)

In [24]:
tor_data.insert(0, 'Cluster Labels', kmeans.labels_)

In [25]:
tor_data.head()

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,0,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,0,M4M,East Toronto,Studio District,43.659526,-79.340923
4,2,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [26]:
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

In [27]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_data['Latitude'], tor_data['Longitude'], tor_data['Neighbourhood'], tor_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters