# Segmenting and Clustering Neighbourhoods in Toronto
## Applied Data Science Capstone 
### IBM Data Science Professional Certificate

In [54]:
from bs4 import BeautifulSoup
import lxml, requests
import pandas as pd

In [55]:
url    = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup   = BeautifulSoup(source, 'lxml')

### Approach for Generating the Dataframe

* The Wikipedia page appears to contain 4 HTML 'table' tags, but we're interested in the first one
* After selecting this table, we find all of the 'td' tags (i.e. standard table cells) and loop through them
* In the cell there is a paragraph that contains the postal code inside a 'b' tag, so we pull this out
* If the postal code has an assigned borough, it will be inside an 'a' (i.e. hyperlink) tag and similar for the neighborhoods
* We find all of the 'a' tags and loop over them. If a postal code doesn't have a borough we won't enter this loop
* Assuming that some hyperlinks, we assign the borough to be the text from the first one
* We then create a list of neighborhoods from the remaining hyperlinds 
* We then conditionally define the neighborhood to be either a comma separated list of neighborhoods that were found, or the borough, if no neighborhoods were found
* Finally we append a row to the postal codes dataframe

In [113]:
postalCodesTBL = soup.find_all("table")[0] # There are a few tables, we want the first one
postalCodesDF  = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for cell in postalCodesTBL.find_all("td"): # All of the standard cells in our table
    postalCode = cell.p.b.text             # The postal code is in the bold text part of the cell paragraph
    hyperlinks = cell.p.span.find_all("a") # If there is an assigned borough there will be a hyperlink for it
    if len(hyperlinks)>0:                  # If there are neighborhoods assigned, these will subsequent hyperlinks
        borough       = hyperlinks[0].text # We're assuming that the borough is always the first hyperlink
        neighborhoods = list([hyperlinks[n].text for n in range(1, len(hyperlinks))])        
        neighborhood  = ', '.join(map(str, neighborhoods)) if len(neighborhoods)>0 else borough
        postalCodesDF = postalCodesDF.append({"PostalCode":   postalCode, 
                                              "Borough":      borough, 
                                              "Neighborhood": neighborhood},
                                              ignore_index=True)

In [111]:
postalCodesDF.shape

(101, 3)