# Segmentation of neighboorhood at Toronto


## 1. Objectif:
Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


### Notice that: **the data from this page may be differences with the example you see on the image** since data is updated.

## 2. Scrapping data from wiki

In [1]:
import urllib.request
from bs4 import BeautifulSoup 
import pandas as pd
print("All of package is loaded!")

All of package is loaded!


In this example, it would be simpler to just copy the table from wikipage and process. But I choose to do with BeautifoulSoup because it would be useful later for my next projects.

There are few assumptions that I used here which I found from the wikipage:
1. The table is stored on the table tag of html page
2. Rows of data is stored under tr tag and may contains newline character. 

In [2]:
with urllib.request.urlopen('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') as f:
    scrapper = BeautifulSoup(f, 'html.parser')
    table = scrapper.find('table').tbody
    all_row = table.find_all('tr')
    all_row = all_row[1:]
    data = [[x.text.rstrip() for x in row.find_all('td')] for row in all_row]

*data* object is a list contains all the data of table  

## 3. Prepare the dataframes

In [3]:
## prepare for the dataframe
header = ['PostalCode', 'Borough', 'Neighborhood']

In [4]:
pandas_table = pd.DataFrame(data, columns = header)
pandas_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Filtering out all the Borough is empty

In [6]:
pandas_table = pandas_table[pandas_table['Borough'] != 'Not assigned']
pandas_table.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


### Group the Neighborhoods have same postal code. 

In [7]:
grb_tbl = pandas_table.groupby(by=['PostalCode']).agg(lambda x: ','.join(set(x))).reset_index()

In [8]:
grb_tbl.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek"
2,M1E,Scarborough,"West Hill,Guildwood,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
pandas_table[pandas_table == 'Not assigned'].sum()

PostalCode      0.0
Borough         0.0
Neighborhood    0.0
dtype: float64

As you could see from above, all cell have a borough and at least a valid neighborhood, then we do not need to fill the neighborhood to be the same as the borough.

Finally, we check the shape of pandasframe.

In [14]:
grb_tbl.shape

(103, 3)

This dataframe has 103 rows!!!