# IBM Data Science Certification: Applied Data Science Capstone

This notebook will be used for the Applied Data Science Capstone

## Read Canadian Postal Codes

1. Use the pandas [`read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) function to read all the tables from the [List of postal codes of Canada: M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The first table (i.e. data frame) in the list will be the first (index = 0) element in the list.

1. After extracting the first table, get all the rows that have an *assigned* Borough (i.e. a Borough not equal to `Not assigned`).

1. According to the lab, there could be a postal code listed multiple times and that the neighborhood values should be combined in a comma separated list. *However*, it appears that the table has been updated and the neighborhoods are listed with `/` as a separator. Let's check...

1. Lastly, I'll just stick with the `/` separator since it seems to be a unique character in the list. I'm assuming that the `Garden District, Ryerson` entry is actually a single place and not two places. Especially since the Garden District is where Ryerson University is located. 

In [98]:
# The code was removed by Watson Studio for sharing.

In [106]:
%%capture
# Get stuff installed
!pip install geocoder
import pandas as pd
import numpy as np
import geocoder

def geocode_it(postal_code): 
    g = geocoder.google(f'{postal_code}, Toronto, Ontario', key=api_key)
    return g.latlng


In [100]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
tables = pd.read_html(url)
postal_codes_raw = tables[0]
postal_codes_raw.head()

pcs = postal_codes_raw[postal_codes_raw['Borough'] != 'Not assigned'].reset_index(drop=True)
grps = pcs.groupby('Postal code').count()
print(f"Checking for any postal code listed more than once: {len(grps[grps['Borough'] > 1])}")


Checking for any postal code listed more than once: 0


Let's take a look at the resulting data frame:

In [101]:
pcs.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


## Number of rows

In [102]:
pcs.shape

(103, 3)

## GeoCode It

Add Latitude and Longitude to the dataframe. First just add the coordinates as a list (that's what google provides back). Then split those values out to latitude and longitude.

In [103]:
pcs['LatLon'] = pcs.apply({'Postal code': lambda code: geocode_it(code)})
pcs[['lat', 'lon']] = pd.DataFrame(pcs['LatLon'].tolist(), index=pcs.index)

In [104]:
pcs.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood,LatLon,lat,lon
0,M3A,North York,Parkwoods,"[43.7532586, -79.3296565]",43.753259,-79.329656
1,M4A,North York,Victoria Village,"[43.72588229999999, -79.3155716]",43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,"[43.6542599, -79.36063589999999]",43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,"[43.718518, -79.4647633]",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,"[43.6623015, -79.3894938]",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,"[43.6678556, -79.5322424]",43.667856,-79.532242
6,M1B,Scarborough,Malvern / Rouge,"[43.8066863, -79.1943534]",43.806686,-79.194353
7,M3B,North York,Don Mills,"[43.7459058, -79.352188]",43.745906,-79.352188
8,M4B,East York,Parkview Hill / Woodbine Gardens,"[43.7063972, -79.30993699999999]",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson","[43.6571618, -79.3789371]",43.657162,-79.378937
