<h2>Segmenting and Clustering Neighbourhoods in Toronto</h2>

The project includes scraping the Wikipedia page for the postal codes of Canada and then process and clean the data for the clustering. The clustering is carried out by K Means and the clusters are plotted using the Folium Library. The Boroughs containing the name 'Toronto' in it are first plotted and then clustered and plotted again.



<h3>All the 3 tasks of <i>web scraping</i>, <i>cleaning</i> and <i>clustering</i> are implemented in the same notebook for the ease of evaluation.</h3>

<h3>Installing and Importing the required Libraries</h3>

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Traceback (most recent call last):
  File "/anaconda3/bin/pip", line 7, in <module>
    from pip._internal import main
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/__init__.py", line 40, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/autocompletion.py", line 8, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/main_parser.py", line 12, in <module>
    from pip._internal.commands import (
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/__init__.py", line 6, in <module>
    from pip._internal.commands.completion import CompletionCommand
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/commands/completion.py", line 6, in <module>
    from pip._internal.cli.base_command import Command
  File "/anaconda3/lib/python3.7/site-packages/pip/_internal/cli/base_comman

<h3>Scraping the Wikipedia page for the table of postal codes of Canada</h3>

BeautifulSoup Library of Python is used for web scraping of table from the Wikipedia. The title of the webpage is printed to check if the page has been scraped successfully or not. Then the table of postal codes of Canada is printed.

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighborhood
M1A,Not assigned,
M2A,Not assigned,
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


<h3>The html table is converted to Pandas DataFrame for cleaning and preprocessing.</h3>

In [115]:
dfs = pd.read_html(tab, header=0)
df=dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<h3>Data preprocessing and cleaning</h3>

In [116]:
df.columns

Index(['Postal Code', 'Borough', 'Neighborhood'], dtype='object')

In [121]:
# Dropping the rows where Borough is 'Not assigned'
df1 = df[df.Borough != 'Not assigned']

# Combining the neighbourhoods with same Postalcode
df2 = df1.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighborhood'] = np.where(df2['Neighborhood'] == 'Not assigned',df2['Borough'], df2['Neighborhood'])

df2

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [122]:
# Shape of data frame
df2.shape

(103, 3)

<h3>Importing the csv file conatining the latitudes and longitudes for various neighbourhoods in Canada</h3>

In [143]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.81,-79.19
1,M1C,43.78,-79.16
2,M1E,43.76,-79.19
3,M1G,43.77,-79.22
4,M1H,43.77,-79.24


<h3>Merging the two tables for getting the Latitudes and Longitudes for various neighbourhoods in Canada</h3>

In [146]:
lat_lon.rename(columns={'Postal Code':'Postcode'},inplace=True)
df2.rename(columns={'Postal Code':'Postcode'},inplace=True)
lat_lon.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.81,-79.19
1,M1C,43.78,-79.16
2,M1E,43.76,-79.19
3,M1G,43.77,-79.22
4,M1H,43.77,-79.24


In [147]:
lat_lon.set_index('Postcode')
df3 = pd.merge(df2,lat_lon,on='Postcode')
df3.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75,-79.33
1,M4A,North York,Victoria Village,43.73,-79.32
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65,-79.36
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72,-79.46
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66,-79.39


<h2>The notebook from here includes the Clustering and the plotting of the neighbourhoods of Canada which contain Toronto in their Borough</h2>

<h3>Getting all the rows from the data frame which contains Toronto in their Borough.</h3>

In [148]:
df4 = df3[df3['Borough'].str.contains('Toronto',regex=False)]
df4

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65,-79.36
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66,-79.39
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.66,-79.38
15,M5C,Downtown Toronto,St. James Town,43.65,-79.38
19,M4E,East Toronto,The Beaches,43.68,-79.29
20,M5E,Downtown Toronto,Berczy Park,43.64,-79.37
24,M5G,Downtown Toronto,Central Bay Street,43.66,-79.39
25,M6G,Downtown Toronto,Christie,43.67,-79.42
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.65,-79.38
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.67,-79.44


<h3>Visualizing all the Neighbourhoods of the above data frame using Folium</h3>

In [151]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df4['Latitude'],df4['Longitude'],df4['Borough'],df4['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

<h3>The map might not be visible on Github. Check out the README for the map.</h3>

<h3>Using KMeans clustering for the clsutering of the neighbourhoods</h3>

In [153]:
k=5
toronto_clustering = df4.drop(['Postcode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df4.insert(0, 'Cluster Labels', kmeans.labels_)


In [154]:
df4

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65,-79.36
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66,-79.39
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.66,-79.38
15,0,M5C,Downtown Toronto,St. James Town,43.65,-79.38
19,4,M4E,East Toronto,The Beaches,43.68,-79.29
20,0,M5E,Downtown Toronto,Berczy Park,43.64,-79.37
24,0,M5G,Downtown Toronto,Central Bay Street,43.66,-79.39
25,3,M6G,Downtown Toronto,Christie,43.67,-79.42
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.65,-79.38
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.67,-79.44


In [156]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df4['Latitude'], df4['Longitude'], df4['Neighborhood'], df4['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h3>The map might not be visible on Github. Check out the README for the map.</h3>