##  Toronto Neighborhoods Segmenting and Clustering  

#### *Author: Mohammad Sayeb*

#### Let's import the relevant modules

In [16]:
import pandas as pd #pandas data frame from efficient dataframe manipulation
import numpy as np #for dealing with multidimensional arrays and matrices
import plotly #visualization tool
import matplotlib #visualization tool
import matplotlib.pyplot as plt # for visualization purposes
import requests #requesting information from webpages 
import bs4 #beautiful soup library for website scraping 
from bs4 import BeautifulSoup #scraping tool
import lxml #needed to convert html bs4 object to data frame
import geocoder # import geocoder

## Section 1

We scrape table data from the wiki page and assign it to a DataFrame table

In [17]:
URL='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

In [18]:
table = soup.find_all('table')

In [19]:
df = pd.read_html(str(table))[0]

In [20]:
df.shape

(180, 3)

Ignore the rows that don't have an assigned borough

In [21]:
df = df[df['Borough']!='Not assigned']
print (df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Now we would like to put the neighborhoods that belong to the same postal code in the same row separated by commas. The fist step here is to see if there are any duplicated values for the Postal Codes.

In [22]:
duplicate_boolean = df.duplicated(subset=None, keep='first')
duplicate_boolean[duplicate_boolean ==True]

Series([], dtype: bool)

We see that there are no duplicated postal code values we don't need to worry about combining the Neighbourhoods that belong to the same Postal Code into one row separated by commas

If a cell has a borough but a Not assigned neighbourhood, then the neighborhood will be the same as the borough

In [23]:
df[df['Neighbourhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


There are no rows with a Not assinged neighbourhood

In [24]:
print ('the data frame has {} rows and {} columns'.format(df.shape[0],df.shape[1]))
df.shape

the data frame has 103 rows and 3 columns


(103, 3)

## Section 2

Now let's try to get the lattitude and Longitude for each neightbourhood

In [25]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


We use geocoder to get the longitude and latitude for each postal code. 

In [11]:
latitude=[]
longitude=[]
for postal_code in df['Postal Code']:

    # initialize your variable to None
    lat_lng_coords = None
#     print ('{}, Toronto, Ontario'.format(postal_code))
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude.append( lat_lng_coords[0])
    longitude.append( lat_lng_coords[1])

In [26]:
df['Latitude'] = latitude
df['longitude'] = longitude
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,longitude
2,M3A,North York,Parkwoods,43.75245,-79.32991
3,M4A,North York,Victoria Village,43.73057,-79.31306
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


In [27]:
df.reset_index(inplace=True)
df.drop(columns='index', inplace=True)

In [28]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945
