# IBM Data Science Professional Certificate
# Applied Data Science Capstone
## Week 3
## Segmenting and Clustering Neighborhoods in Toronto

## Part 1 - Parsing Toronto poastal codes from Wikipedia
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

* The dataframe will consist of three columns: `PostalCode`, `Borough`, and `Neighborhood`
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
* If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [18]:
import numpy as np
import pandas as pd

import requests
import urllib.request
from urllib.request import urlopen

import time
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

Now we will scrape the table we need from the source page. There are multiple tables on the webpage, we need the one with `wikitable sortable` class.

In [19]:
soup = soup.find('table', attrs={'class':'wikitable sortable'})
table_str = str(soup.extract())
df = pd.read_html(table_str)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


As per bullet point to we need to ignore *Not assigned* rows. Also, according to the exercise requirements *Postal Code* must be named without space.

In [26]:
# drop the rows with Not assigned values
df_dropna = df[df.Borough != 'Not assigned'].reset_index(drop=True)

# rename the Postal Code column
df_dropna.rename(columns={'Postal Code' : 'PostalCode'}, inplace=True)
df = df_dropna
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [22]:
df.shape

(103, 3)

## Part 2 - Get the latitude and the longitude coordinate

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

As instructed, we will use the Geocoder Python package for this task: https://geocoder.readthedocs.io/index.html.