# Part1. Scraping and cleaning dataframe
The code in this notebook is built to scrape the table of postal codes in following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.  The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.

For the analysis which will be done later, the data need to be cleaned follwing the steps below:

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.  
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.  
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.  


In [1]:
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

## 1. Scraping table
1.1. Scrape the table with the class, wikitable sortable.  
1.2. Store each value in the table in lists and make them into a dtaframe.

In [4]:
#1.1. Scrape the table with the class, wikitable sortable. 
weburl = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(weburl,'lxml')
wiki_table = soup.find('table',{'class':'wikitable sortable'})
wiki_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

In [5]:
# 1.2 Store each value in the table in lists and make them into a dtaframe.
row_list = []

for tr in wiki_table.find_all('tr'):
    rows = []
    for tds in tr.find_all('td'):
        rows.append(tds.text[:-1])
    row_list.append(rows)

row_list = row_list[1:]    
df = pd.DataFrame(row_list,columns=["Postalcode","Borough","Neighborhood"])
df.head(5)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 2. Cleaning dataframe
2.1. Filter out the cells with a borough that is Not assigned.  
2.2. Combine into one row with the neighborhoods separated with a comma if needed.  
2.3. Make the neighborhood the same as the borough if a cell has a borough but a Not assigned neighborhood.  

In [6]:
# 2.1. Filter out the cells with a borough that is Not assigned.
df_borough_assigned = df[df["Borough"] != "Not assigned"]
# 2.2. Combine into one row with the neighborhoods separated with a comma.
df_borough_assigned = df_borough_assigned.groupby(['Postalcode','Borough'], sort=False).agg( ', '.join).reset_index()
# 2.3. Make the neighborhood the same as the borough if a cell has a borough but a Not assigned neighborhood.
df_borough_assigned.loc[df_borough_assigned['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df_borough_assigned['Borough']
df_borough_assigned.head(5)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The wiki page already has nieghborhood merged Postal Code and Borough

In [7]:
# The sape of dataframe
df_borough_assigned.shape

(103, 3)