# Segmenting and Clustering the Neighborhoods in Toronto

*by: Sabrina El Mouhib*

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

# 1. Download The Required Libraries 


In [1]:
# installing beautiful soup for pulling data out of HTML files 
!pip install beautifulsoup4
!pip install lxml



In [2]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests
from bs4 import BeautifulSoup
from IPython.display import display_html

# Scraping the data from Wikipedia page


In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(source.text, 'lxml')
table = soup.find('table', class_ = 'wikitable sortable') 
# print the title of the page to make sure the right page has been scraped 
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


In [4]:
#display the table in wikipedia to an HTML table 
table = str(soup.table)
display_html(table,raw=True)

Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


In [5]:
# convert the HTML table to a Dataframe
dfs = pd.read_html(table)
df=dfs[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


### Cleaning and Preprocessing The Data 

In [6]:
# drop the rows where borough = not assigned 
df_clean=df[df['Borough']!='Not assigned']
df_clean

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
# combine neighborhoods with the same postal code 
df_clean2= df_clean.groupby(['Postal Code','Borough'],sort=False).agg(','.join)
df2=df_clean2.reset_index()
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [8]:
# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighbourhood'] = np.where(df2['Neighbourhood'] == 'Not assigned',df2['Borough'], df2['Neighbourhood'])
df2

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [9]:
df2.shape

(103, 3)

###  Create dataframe with Longitude and Latitude value for postal codes 

In [12]:
# read the Latitude and Longitude values from the CSV file
lat_lon_values = pd.read_csv('http://cocl.us/Geospatial_data')
lat_lon_values.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# merge the two tables to get the final DataFrame
df_final = pd.merge(df2,lat_lon_values,on='Postal Code')
df_final.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [15]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
print('folium installed')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                       

### Create map of Toronto using Folium  and add neighberhoods as markers to the map 

In [18]:
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# Geograpical coordinate of Toronto needed to visualize the neighbehoods in Toronto's map 
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto")
location = geolocator.geocode(address)
latitude_toronto = location.latitude
longitude_toronto = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude_toronto, longitude_toronto))


The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [39]:
import matplotlib.cm as cm
import matplotlib.colors as colors
from IPython.display import Image 
from IPython.core.display import HTML 

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude_toronto,longitude_toronto], zoom_start=10)

# add neighborhoods as markers to map
for lat, lng, borough, Neighbourhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The map may not visible in Github. Please refer to README

### Cluster neighborhoods using Kmeans 



In [41]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [61]:
# number of cluster chosen is 5
k=5
toronto_clustering = df_final.drop(['Postal Code','Borough','Neighbourhood'],1)

# run k-means clustering
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_


array([0, 0, 1, 4, 1, 3, 2, 0, 0, 1, 4, 3, 2, 0, 0, 1, 1, 3, 2, 0, 1, 1,
       2, 0, 1, 1, 2, 4, 4, 0, 1, 1, 2, 4, 4, 0, 1, 1, 2, 4, 4, 0, 1, 1,
       0, 4, 3, 0, 1, 3, 3, 2, 4, 3, 0, 4, 3, 3, 0, 4, 3, 4, 4, 3, 3, 2,
       4, 4, 1, 3, 3, 2, 4, 4, 1, 1, 3, 3, 2, 1, 1, 3, 2, 1, 1, 2, 1, 1,
       3, 3, 2, 1, 1, 3, 3, 2, 1, 1, 3, 1, 0, 3, 3], dtype=int32)

In [62]:
# insert the cluster labels to the data frame
data_frame=df_final.drop(['Labels','Cluster Labels'],1)
data_frame.insert(0,'Label Cluster',kmeans.labels_)
data_frame

Unnamed: 0,Label Cluster,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,0,M3A,North York,Parkwoods,43.753259,-79.329656
1,0,M4A,North York,Victoria Village,43.725882,-79.315572
2,1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,4,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,3,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,2,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,0,M3B,North York,Don Mills,43.745906,-79.352188
8,0,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [64]:
# create map of Toronto using latitude and longitude values
map_clusters = folium.Map(location=[latitude_toronto,longitude_toronto], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(data_frame['Latitude'], data_frame['Longitude'], data_frame['Neighbourhood'], data_frame['Label Cluster']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Please refer to README for the map 