<h2>1. Crawl the list of Toronto neighborhoods</h2>

In [3]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

Crawl the URL and parse the HTML doc

In [4]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.content, 'html.parser')

In [5]:
data = []
table = soup.find('table', attrs={'class':'wikitable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    if len(cols) > 0:
        cols = [ele.text.strip() for ele in [x for x in cols if x != None and len(x) > 0]] # Get rid of the first fow which contains table header
        data.append([ele for ele in cols if ele])  

Generate the dataframe

In [6]:
df = pd.DataFrame(data)
df.columns = ['postcode', 'borough', 'neighborhood']
df.head()

Unnamed: 0,postcode,borough,neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Filter out the "Not assigned" boroughs

In [7]:
df_2 = df[df.borough != 'Not assigned']

Replace "Not assigned" neighborhoods using their boroughs

In [8]:
df_2.loc[df_2.neighborhood == 'Not assigned','neighborhood'] = df_2['borough']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Check the data after cleaning

In [9]:
df_2.head()

Unnamed: 0,postcode,borough,neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [10]:
print('check the number of rows before and after data cleaning')
print('before cleaning: {} rows'.format(len(df)))
print('after cleaning: {} rows'.format(len(df_2)))

check the number of rows before and after data cleaning
before cleaning: 288 rows
after cleaning: 211 rows


In [96]:
df_2[df_2.postcode == 'M5A'].head()

Unnamed: 0,postcode,borough,neighborhood
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park


In [54]:
df_group = pd.DataFrame(df_2.groupby(['postcode', 'borough'])['neighborhood'].apply(lambda x: ','.join(x)))
df_group.reset_index(inplace=True)

In [25]:
df_group.head()

Unnamed: 0,postcode,borough,neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<h2>2. Merging the location lat/long to the neighborhoods</h2>

In [1]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 17.2MB/s a 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


Download the geospatial data because geocoder cannot return the lat long

In [35]:
url = 'https://cocl.us/Geospatial_data'
r = requests.get(url, allow_redirects=True)
open('Geospatial_Coordinates.csv', 'wb').write(r.content)

2891

In [50]:
df_geo = pd.read_csv('Geospatial_Coordinates.csv')

In [51]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [52]:
df_geo.columns = ['postal_code', 'lat', 'lon']
df_geo.set_index(['postal_code'], inplace=True)
df_geo.head()

Unnamed: 0_level_0,lat,lon
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Join the geospatial data to the neighborhood dataframe

In [55]:
df_group.rename(columns={'postcode':'postal_code'}, inplace=True)
df_group.set_index('postal_code', inplace=True)
df_group.head()

Unnamed: 0_level_0,borough,neighborhood
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [57]:
df_3 = df_group.join(df_geo, how='inner')

In [58]:
df_3.head()

Unnamed: 0_level_0,borough,neighborhood,lat,lon
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476
