In [1]:
# The code was removed by Watson Studio for sharing.

<h1 style="text-align: center">Battle: Neighborhoods in Toronto - data preparation</h1>

<h2>1. Prepare required Toronto neighborhoods data</h2>
<h3>1.1 Scrape Toronto postal codes wiki page</h3>
<p>Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe</p>

In [2]:
!pip install beautifulsoup4
!pip install lxml

[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m


In [3]:
# import required modules
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [4]:
# assign link to Wiki page to variable
wiki_Toronto_postal_codes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

I will use **BeautifulSoup** together with *lxml* to scrape Wikipedia page

In [5]:
# get page text and parse using BeautifulSoup
source = requests.get(wiki_Toronto_postal_codes).text
soup = BeautifulSoup(source, 'lxml')

First I will find **table** HTML tag, then all table rows **tr** and all **td** cells within row, using nested loops.
List of lists (rows data) is created, and then used to create pandas DataFrame.


In [6]:
# table with data
pcode_table = soup.find('table',{'class':'wikitable sortable'})
table_data = []
# find all table rows
for tr in pcode_table.find_all('tr'):
    row = []
    # find all cells within row
    for td in tr.find_all('td'):
        # append extracted and trimmed cell text into row data  
        row.append(td.get_text(strip=True))
    # skip adding row to table_data in case is empty (header row)
    if len(row):
        table_data.append(row)
# create data frame from list of lists
df_wiki = pd.DataFrame(data=table_data, columns=['PostalCode', 'Borough', 'Neighborhood'])
# filter out rows with Borough equal to 'Not assigned'
df_wiki = df_wiki[df_wiki.Borough != 'Not assigned']
df_wiki.head(15)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


For *Neighborhood* with 'Not assigned' value, we need to use the *Borough* name 

In [7]:
df_wiki.loc[df_wiki['Neighborhood'] == 'Not assigned','Neighborhood'] = df_wiki['Borough']

I will group data frame by **PostalCode** and **Borough**, and *apply* function join more than one neighborhood for one postal code area.

In [8]:
df_grouped = df_wiki.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda neighborhoods: ', '.join(neighborhoods)).to_frame().reset_index()
df_grouped.head(30)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
# df_grouped shape
print('Data frame shape: ', df_grouped.shape)
df_grouped.head()

Data frame shape:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
# df_grouped.to_csv('neighborhood_toronto_grouped.csv')
project.save_data(data=df_grouped.to_csv(index=False),file_name='neighborhood_toronto_grouped.csv',overwrite=True)

{'asset_id': '6a283801-00ef-40da-b7bb-df7ba4298b39',
 'bucket_name': 'courseracapstone-donotdelete-pr-2afewjmpcmomni',
 'file_name': 'neighborhood_toronto_grouped.csv',
 'message': 'File saved to project storage.'}

<h3>1.2 Get the latitude and the longitude coordinates of each postal code in data frame</h3>

<b>Note:</b> Unfortunately because unstable results, eigther using geogeocoder, Nominatim, and Nominatim with RateLimiter, I will use CSV from this location https://cocl.us/Geospatial_data. 

In [12]:
csv_url = 'https://cocl.us/Geospatial_data'
df_location = pd.read_csv(csv_url)
# rename column
df_location.rename(index=str, columns={'Postal Code': 'PostalCode'}, inplace=True)
df_location.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# save to project
project.save_data(data=df_location.to_csv(index=False),file_name='geospatial_data_toronto.csv',overwrite=True)

{'asset_id': 'b467b3ab-2ec3-48dd-966a-5c73b33842bd',
 'bucket_name': 'courseracapstone-donotdelete-pr-2afewjmpcmomni',
 'file_name': 'geospatial_data_toronto.csv',
 'message': 'File saved to project storage.'}

<h3>1.3 Merge Toronto neighborhood data with geospatial_data</h3>

In [15]:
# marge datasets on PostalCode column value
df_grouped_merged = pd.merge(df_grouped, df_location, on='PostalCode')
df_grouped_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [16]:
print('Shape of df_grouped_merged', df_grouped_merged.shape)
# save to project
project.save_data(data=df_grouped_merged.to_csv(index=False),file_name='neighborhood_toronto_geospatial_data.csv',overwrite=True)

Shape of df_grouped_merged (103, 5)


{'asset_id': '2f1db118-8aa8-426b-b82f-026ff11db168',
 'bucket_name': 'courseracapstone-donotdelete-pr-2afewjmpcmomni',
 'file_name': 'neighborhood_toronto_geospatial_data.csv',
 'message': 'File saved to project storage.'}

<h2>Summary</h2>
<p>I've prepared CSV files with Toronto neighborhoods data, for later use, during Coursera Capstone assignment.</p>
<ul>
    <li>neighborhood_toronto_grouped.csv - <b>PostalCode, Borough, Neighborhood</b> cleaned and grouped by PostalCode
    <li>geospatial_data_toronto.csv - <b>PostalCode, Latitude, Longitude</b> for Toronto postal codes
    <li>neighborhood_toronto_geospatial_data.csv - <b>PostalCode, Borough, Neighborhood, Latitude, Longitude</b> for each Toronto postal code
</ul>
<p><b>Note:</b> The <b>neighborhood_toronto_geospatial_data.csv</b> will be used as initial data for Toronto neighborhoods.</p>