### Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

We first import all of our dependent modules.

In [297]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup

Now we get the HTML from the Wikipedia website and pass it to a BeautifulSoup object.
We show the first 250 characters of the HTML.

In [298]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
print(soup.prettify()[:250])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className


Locate the data table and find its rows.
We show the first 5 rows of the table HTML.

In [299]:
table = soup.findAll("table", {"class": "wikitable sortable"})[0]
tbody = soup.find("tbody")
trs = tbody.findAll("tr")
print(trs[:5])

[<tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>, <tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>, <tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>, <tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>]


Now, we go through each row of the data table and store it in a list called `data`.

In [300]:
col_postalcode = 'PostalCode'
col_borough = 'Borough'
col_neighborhood = 'Neighborhood'
not_assigned = 'Not assigned'
data = []

# Iterate through the rows
for tr in trs:
    tds = tr.findAll("td")
    # Three td elements, one for each column value
    if len(tds) == 3:
        postalcode = tds[0].text.strip()
        borough = tds[1].find("a")
        neighborhood = tds[2].find("a")
        
        # Sometimes a row has a hyperlink (<a> tag) and sometimes it doesn't.
        # Check if the .find("a") above returned None. If so, strip text from td instead.
        borough = borough.text.strip() if borough is not None else tds[1].text.strip()
        neighborhood = neighborhood.text.strip() if neighborhood is not None else tds[2].text.strip()
        
        # Skip if borough is not assigned
        if borough == not_assigned:
            continue
        
        # Use borough name for neighborhood if neighborhood is not assigned
        if neighborhood == not_assigned:
            neighborhood = borough
        
        # Add to data
        data.append({
            col_postalcode: postalcode,
            col_borough: borough,
            col_neighborhood: neighborhood
        })
    else:
        continue

Create a dataframe using the rows in `data`.
We sort by PostalCode and show the top 10 rows.

In [301]:
df = pd.DataFrame(data, columns=[col_postalcode, col_borough, col_neighborhood])
df.sort_values(col_postalcode).head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern
23,M1C,Scarborough,Port Union
22,M1C,Scarborough,Rouge Hill
21,M1C,Scarborough,Highland Creek
33,M1E,Scarborough,Guildwood
34,M1E,Scarborough,Morningside
35,M1E,Scarborough,West Hill
39,M1G,Scarborough,Woburn
43,M1H,Scarborough,Cedarbrae


We see that Scarborough has multiple neighborhoods per postal code.
We will group these together, as follows.
We sort by PostalCode and show the top 5 rows.

In [302]:
df = df.groupby([col_postalcode, col_borough])\
    .apply(lambda x: ', '.join(x[col_neighborhood].values))\
    .reset_index()

df.columns=[col_postalcode, col_borough, col_neighborhood]
df.sort_values(col_postalcode).head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We save the final dataframe and print its shape.

In [303]:
df.to_csv(os.path.join('..', 'data', 'week3_a.csv'), index=False)
df.shape

(103, 3)