# Segmenting and Clustering Neighbourhoods in Toronto - Part 01
## Applied Data Science Capstone 
### IBM Data Science Professional Certificate

In [11]:
from bs4 import BeautifulSoup
import lxml, requests
import pandas as pd

In [64]:
url    = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
soup   = BeautifulSoup(source, 'lxml')

### Approach for Generating the Dataframe

* The Wikipedia page appears to contain 2 HTML 'table' tags, but we're interested in the first one
* This table contains each postal code, its assigned borough, and neighbourhood/s as 3 cells in a row
* We use the beautiful soup method to find all of the rows in the table (via the 'tr' tag) and loop over them
* We then find all of the cells in the table (via the 'td' tag)
* If there are 3 of them found (which then ignores the table headers with the 'th' tag) we then extract the text and do a bit of simple string replacement to get the data into the correct format for insertion to the dataframe
* Finally we append a row to the postal codes dataframe

In [177]:
postalCodesTBL = soup.find_all("table")[0] # There are 2 tables, we want the first one
postalCodesDF  = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in postalCodesTBL.find_all("tr"): # All of the rows in our table
    cells = row.find_all("td")
    if len(cells)==3:
        postalCode   = cells[0].text.replace("\n","")
        borough      = cells[1].text.replace("\n","")
        neighborhood = cells[2].text.replace(" /", ",").replace("\n","")
        if borough.find("Not assigned")==-1:
            postalCodesDF = postalCodesDF.append({"PostalCode":   postalCode, 
                                                  "Borough":      borough, 
                                                  "Neighborhood": neighborhood},
                                                  ignore_index=True)
postalCodesDF.head()
postalCodesDF.to_csv("postalCodes.csv", index=False)

In [141]:
postalCodesDF.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae


### Original Approach for Generating the Dataframe

*As an interesting side note, the wikipedia page appears to have changed during the time that I was writing the code to extract the postal codes/boroughs/neighbourhoods and create the dataframe!*

* The Wikipedia page appears to contain 4 HTML 'table' tags, but we're interested in the first one
* This table contains the postal code, borough, and neighbourhoods in a single cell and lays out the cells in a 2D array format
* After selecting this table, we find all of the 'td' tags (i.e. standard table cells) and loop through them
* In the cell there is a paragraph that contains the postal code inside a 'b' tag, so we pull this out
* If the postal code has an assigned borough, it will be inside an 'a' (i.e. hyperlink) tag and similar for the neighborhoods
* We find all of the 'a' tags and loop over them. If a postal code doesn't have a borough we won't enter this loop
* Assuming that some hyperlinks, we assign the borough to be the text from the first one
* We then create a list of neighborhoods from the remaining hyperlinds 
* We then conditionally define the neighborhood to be either a comma separated list of neighborhoods that were found, or the borough, if no neighborhoods were found
* Finally we append a row to the postal codes dataframe

In [14]:
postalCodesTBL = soup.find_all("table")[0] # There are a few tables, we want the first one
postalCodesDF  = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for cell in postalCodesTBL.find_all("td"): # All of the standard cells in our table
    postalCode = cell.p.b.text             # The postal code is in the bold text part of the cell paragraph
    hyperlinks = cell.p.span.find_all("a") # If there is an assigned borough there will be a hyperlink for it
    if len(hyperlinks)>0:                  # If there are neighborhoods assigned, these will subsequent hyperlinks
        borough       = hyperlinks[0].text # We're assuming that the borough is always the first hyperlink
        neighborhoods = list([hyperlinks[n].text for n in range(1, len(hyperlinks))])        
        neighborhood  = ', '.join(map(str, neighborhoods)) if len(neighborhoods)>0 else borough
        postalCodesDF = postalCodesDF.append({"PostalCode":   postalCode, 
                                              "Borough":      borough, 
                                              "Neighborhood": neighborhood},
                                              ignore_index=True)

<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td>
<td>M4B
</td>
<td>East York
</td>
<td>Parkview Hill / Woodbine Gardens
</td>
<td>M5B
</td>
<td>Downtown Toronto
</td>
<td>Garden District, Ryerson
</td>
<td>M6B
</td>
<td>North York
</td>
<td>Glencairn
</td>
<td>M7B
</td>
<td>Not assigned
</td>
<td>
</td>
<td>

In [57]:
postalCodesDF.shape

(103, 3)