# Getting Location of the Neighborhoods in Toronto

This notebook is part of the capstone project for the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) course.
In this project I’m getting location of the neighborhoods in Toronto.

In [1]:
import requests
import lxml.html as html
import pandas as pd

Downloading and parsing the wiki page:

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = requests.get(wiki_url)
wiki_doc = html.fromstring(wiki_page.content)

Finding the table element.
NOTE: Changes to the page will require changes in this code.

In [3]:
wiki_table = wiki_doc.xpath('//*[@id="mw-content-text"]/div/table[1]')

# Let's make sure we found the right table
import re

re_postal_code = r'\nM\d[A-Z]\n'
re_postal_code_flags = re.MULTILINE | re.UNICODE
if len(wiki_table) == 0 or re.search(re_postal_code, wiki_table[0].text_content(), re_postal_code_flags) == None:
    raise Exception('Could not find the table of postal codes. Consider updating the XPath.')

postal_codes_table = wiki_table[0]
postal_codes_table

<Element table at 0x290e2f2f098>

Getting location data from the given CSV document. I tried using geocoder, but it takes forever.

In [4]:
df_locations = pd.read_csv('https://cocl.us/Geospatial_data')
df_locations.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Creating the dataframe. I’m creating a dictionary first and then I’m creating a dataframe out of it. Location data is also added right away.

In [5]:
rows = postal_codes_table.findall('.//tr')
if len(rows) == 0:
    raise Exception('Could not find any rows in the table')

EXPECTED_COLS_NUM = 3
table_dict = { 'PostalCode': [], 'Borough': [], 'Neighborhood': [], 'Latitude': [], 'Longitude': [] }

for row in rows:
    cols = row.findall('.//td')
    num_cols = len(cols)

    # Skip rows without td elements (like the header)
    if num_cols == 0:
        continue

    # Make sure we always the expected number of columns
    if num_cols != EXPECTED_COLS_NUM:
        raise Exception('Expected exactly {} columns but got {}.'.format(EXPECTED_COLS_NUM, num_cols))

    borough = cols[1].text_content().strip()

    # Ignore rows without borough as per the task description
    if borough == 'Not assigned':
        continue

    neighborhoods = cols[2].text_content().strip().split(' / ')

    # Make neighborhood same as borough if the former isn't specified
    if len(neighborhoods) == 0 or neighborhoods[0] == 'Not assigned':
        neighborhoods = [borough]

    postal_code = cols[0].text_content().strip()
    table_dict['PostalCode'].append(postal_code)
    table_dict['Borough'].append(borough)
    table_dict['Neighborhood'].append(', '.join(neighborhoods))

    # Getting location data
    loc = df_locations.loc[df_locations['Postal Code'] == postal_code]
    table_dict['Latitude'].append(loc['Latitude'].values[0])
    table_dict['Longitude'].append(loc['Longitude'].values[0])

df = pd.DataFrame.from_dict(table_dict)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
