## Segmenting and Clustering Neighbourhoods in Toronto

The code below extracts the website's html document and sythesises a table specified in the assignment given. First we import the necessary libraries for the task:

In [43]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import requests # library to handle requests

from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


The next step creates a BeautifulSoup object and assigns the website's html to the object 'soup'.

Then we find the table elements within the html with find_all function. Finally we have to read the html via pandas's read_html() and inspect the shape.

In [44]:
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text # This will ping a website and return the HTML of the website

soup = BeautifulSoup(wiki_url,'lxml') # This creates a BeautifulSoup object via BeautifulSoup function. This package is for parsing HTML and XML documents
#print(soup.prettify()) # 'Prettify' will let us view how the tags are nested in the document
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df = df[0]
df.shape

(288, 3)

The data above now needs some preprocessing done. We tackle this by:
- Remove rows that had 'Not assigned' in Borough field.
- Replace 'Not assigned' in Neighbourhood field with the Borough of that row
- Group by Postcode and Borough and concat multiple row entries with ',' between each Neighbourhood value

In [55]:
df = df.loc[df['Borough'] != 'Not assigned' ,:]
Neighbourhood = df['Neighbourhood'].replace('Not assigned',df['Borough'])
df['Neighbourhood'] = Neighbourhood
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
print('The dataframe shape for Q1 is:',df.shape)

The dataframe shape for Q1 is: (103, 3)


### Lat & Long Extraction
We will now proceed to read the csv data in the link provided to get latitude and longitude coordinates

In [46]:
latlng_df = pd.read_csv('https://cocl.us/Geospatial_data')
latlng_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


From inspecting the two dataframes we note that their postal code data columns are matching. All we have to do is to append the coordinates without any further maniplulation necessary.

In [56]:
df['Latitude'] = latlng_df['Latitude']
df['Longitude'] = latlng_df['Longitude']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
