# Segmenting and Clustering Neighbourhoods in Toronto - Part 01
## Applied Data Science Capstone 
### IBM Data Science Professional Certificate

In [2]:
from bs4 import BeautifulSoup
import lxml, requests
import pandas as pd

In [3]:
url    = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup   = BeautifulSoup(source, 'lxml')

### Approach for Generating the Dataframe

* The Wikipedia page appears to contain 3 HTML 'table' tags, but we're interested in the first one
* This table contains each postal code, its assigned borough, and neighbourhood/s as 3 cells in a row
* Some postal codes are not assigned a borough, but there don't appear to be any that are assigned a borough without a neighbourhood
* We use the beautiful soup method to find all of the rows in the table (via the 'tr' tag) and loop over them
* We then find all of the cells in the table (via the 'td' tag)
* If there are 3 of them found (which then ignores the table headers with the 'th' tag) we then extract the text and do a bit of simple string replacement to get the data into the correct format for insertion to the dataframe
* Finally we append a row to the postal codes dataframe

In [8]:
postalCodesTBL = soup.find_all("table")[0] # There are 2 tables, we want the first one
postalCodesDF  = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in postalCodesTBL.find_all("tr"): # All of the rows in our table
    cells = row.find_all("td")
    if len(cells)==3:
        postalCode   = cells[0].text.replace("\n","")
        borough      = cells[1].text.replace("\n","")
        neighborhood = cells[2].text.replace(" /", ",").replace("\n","")
        if borough.find("Not assigned")==-1:
            postalCodesDF = postalCodesDF.append({"PostalCode":   postalCode, 
                                                  "Borough":      borough, 
                                                  "Neighborhood": neighborhood},
                                                  ignore_index=True)
postalCodesDF.head()
postalCodesDF.to_csv("postalCodes.csv", index=False)

In [9]:
postalCodesDF.shape

(103, 3)