# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
This notebook extracts and processes the data in the table of postcodes from the wikipedia page at: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

We ignore unused postcodes and merge all neighbourhoods with the same postcode into a single row, and assign the suburb name to the neighbourhood where one the neighbourhood has not been assigned a separate name.


In [27]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

### First let's download the wiki page and convert into an object tree

In [28]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from bs4 import BeautifulSoup

#!wget -q -O 'canadian_postcodes.html' url
#print('Data downloaded!')
#with open('canadian_postcodes.html') as wiki_page:

wiki_page = requests.get(url).text
    
soup = BeautifulSoup(wiki_page,'lxml')
#print(soup.prettify())

### Find the table part of the page and extract the headers
3.1 The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [29]:
My_table = soup.find('table',{'class':'wikitable sortable'})
#print(My_table)

ths = My_table.findAll('th')
headers=[]
for th in ths:
    headers.append(th.text.strip())
headers

['Postcode', 'Borough', 'Neighbourhood']

### Now add the data
* 3.2 Only process the cells that have an assigned borough. _Ignore cells with a borough that is Not assigned._
* 3.3 More than one neighborhood can exist in one postal code area. _For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table._
* 3.4 If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. _So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park._

In [30]:
rows = My_table.findAll('tr')
data=[]
postcode_list={}
# lose the first row as that's the headers that we have already processed
rows=rows[1:]
for row in rows:
    tds = row.findAll('td')
    postcode = tds[0].text.strip()
    borough = tds[1].text.strip()
    neigh = tds[2].text.strip()
    
    #3.4 If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
    if (neigh == "Not assigned"):
        neigh = borough

    ## 3.2 Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    if (borough != 'Not assigned'):
        postcode_dict = {
            "Postcode": postcode, 
            "Borough": borough, 
            "Neighbourhood": neigh
        }
        ## 3.3 If we already have an entry for this postcode, then concatenate the neighbourhoods and updating the existing entry
        if (postcode in postcode_list):
            concatenated_neigh = postcode_list[postcode]["Neighbourhood"] + ", " + neigh
            postcode_dict.update({"Neighbourhood": concatenated_neigh})
            postcode_list[postcode].update(postcode_dict)
        else:
            data.append(postcode_dict)
            postcode_list[postcode] = postcode_dict
        
df = pd.DataFrame(data, columns=headers)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


# Now check the shape of the data frame
3.6 In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [31]:
df.shape

(103, 3)