# Segmenting and Clustering Neighborhoods in Toronto VR

## Question 1

1.1: Importing basic libraries for the exercise

In [2]:
import numpy as np 
import pandas as pd 
import requests 

1.2: Web page with the postal codes of Canada scraped (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)


In [3]:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import csv

In [4]:
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

In [5]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')

In [6]:
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()]  # to filter rows


Check some observations and get information to verify that is a pandas dataframe

In [7]:
print(df.head(10))
print('***')
print(df.tail(10))
df.info()
df.shape

   PostalCode           Borough     Neighbourhood
1         M1A      Not assigned      Not assigned
2         M2A      Not assigned      Not assigned
3         M3A        North York         Parkwoods
4         M4A        North York  Victoria Village
5         M5A  Downtown Toronto      Harbourfront
6         M6A        North York  Lawrence Heights
7         M6A        North York    Lawrence Manor
8         M7A  Downtown Toronto      Queen's Park
9         M8A      Not assigned      Not assigned
10        M9A      Queen's Park      Not assigned
***
    PostalCode       Borough             Neighbourhood
278        M4Z  Not assigned              Not assigned
279        M5Z  Not assigned              Not assigned
280        M6Z  Not assigned              Not assigned
281        M7Z  Not assigned              Not assigned
282        M8Z     Etobicoke  Kingsway Park South West
283        M8Z     Etobicoke                 Mimico NW
284        M8Z     Etobicoke        The Queensway West
285   

(287, 3)

There are cells with an assigned neighbouhood (for example M7A), so i assign their boroughs as their neighbourhood, as you can see in the next cell:

In [11]:
df.loc[df['Neighbourhood']=="Not assigned",'Neighbourhood']=df.loc[df['Neighbourhood']=="Not assigned",'Borough']
df2 = df.reset_index()

After this we can remove the duplicate boroughts and verify that we obtained the dataframe required in Question 1.

In [16]:
df2['Borough']= df2['Borough'].str.replace('nan|[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",")
df2.head(10)


Unnamed: 0,level_0,index,PostalCode,Borough,Neighbourhood
0,0,1,M1A,Notassigned,Not assigned
1,1,2,M2A,Notassigned,Not assigned
2,2,3,M3A,NorthYork,Parkwoods
3,3,4,M4A,NorthYork,Victoria Village
4,4,5,M5A,DowntownToronto,Harbourfront
5,5,6,M6A,NorthYork,Lawrence Heights
6,6,7,M6A,NorthYork,Lawrence Manor
7,7,8,M7A,DowntownToronto,Queen's Park
8,8,9,M8A,Notassigned,Not assigned
9,9,10,M9A,Queen'sPark,Queen's Park


We obtain information about the dataframe in the next cell:

In [24]:
df2.info()
df2.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 5 columns):
level_0          287 non-null int64
index            287 non-null int64
PostalCode       287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: int64(2), object(3)
memory usage: 11.3+ KB


(287, 5)

# Question 2 (reading the csv file option to get the geographical coordinates)

I load pandas and i read the csv from the website, and after this i see the first ten rows to verify is what i need:

In [26]:
import pandas as pd 
data2 = pd.read_csv("http://cocl.us/Geospatial_data") 
data2.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


I rename the postal code of the dataframe "data2" to set it to merge it with the dataframe of the question 1:

In [27]:
data2.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
data2.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I merge the two dataframes and i verify the result:

In [29]:
data= pd.merge(data2, df2, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

data.head(10)

Unnamed: 0,PostalCode,Latitude,Longitude,level_0,index,Borough,Neighbourhood
0,M1B,43.806686,-79.194353,10,11,Scarborough,Rouge
1,M1B,43.806686,-79.194353,11,12,Scarborough,Malvern
2,M1C,43.784535,-79.160497,26,27,Scarborough,Highland Creek
3,M1C,43.784535,-79.160497,27,28,Scarborough,Rouge Hill
4,M1C,43.784535,-79.160497,28,29,Scarborough,Port Union
5,M1E,43.763573,-79.188711,41,42,Scarborough,Guildwood
6,M1E,43.763573,-79.188711,42,43,Scarborough,Morningside
7,M1E,43.763573,-79.188711,43,44,Scarborough,West Hill
8,M1G,43.770992,-79.216917,52,53,Scarborough,Woburn
9,M1H,43.773136,-79.239476,61,62,Scarborough,Cedarbrae


I arrange the columns to obtain a dataframe as its shown in the example, and i check it worked looking at the first rows:

In [30]:
cols = data.columns.tolist()
cols

new_column_order = ['PostalCode',
 'Borough',
 'Neighbourhood',
 'Latitude',
 'Longitude']
new_column_order

data = data[new_column_order]
data.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Rouge,43.806686,-79.194353
1,M1B,Scarborough,Malvern,43.806686,-79.194353
2,M1C,Scarborough,Highland Creek,43.784535,-79.160497
3,M1C,Scarborough,Rouge Hill,43.784535,-79.160497
4,M1C,Scarborough,Port Union,43.784535,-79.160497
5,M1E,Scarborough,Guildwood,43.763573,-79.188711
6,M1E,Scarborough,Morningside,43.763573,-79.188711
7,M1E,Scarborough,West Hill,43.763573,-79.188711
8,M1G,Scarborough,Woburn,43.770992,-79.216917
9,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
