### 1. Instal lxml parser

In [20]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 4.9MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Note: you may need to restart the kernel to use updated packages.


### 2. Import all required libraries. Assign all the values from the wikipedia table to a pandas dataframe.

In [122]:
# This page helped me a lot: 
# https://simpleanalytical.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas

import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)

# print(soup.prettify()) # This will return HTML code for the whole wikipedia page: "List_of_postal_codes_of_Canada:_M" 

table = soup.find('table', class_='wikitable sortable')

A = []
B = []
C = []

for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. All values from lists A,B, and C is assigned to the dataframe
        
df_extracted = pd.DataFrame(A,columns=['PostalCode'])
df_extracted['Borough']=B
df_extracted['Neighborhood']=C

df_extracted.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


### 3. Remove the '\n' in each cell from the dataframe:

In [123]:
# Remove the '\n' in each cell from the dataframe:

df = df_extracted
df['PostalCode'] = df['PostalCode'].str[0:-1]
df['Borough'] = df['Borough'].str[0:-1]
df['Neighborhood'] = df['Neighborhood'].str[0:-1]

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 4. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [131]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

df_assigned = df[df['Borough'] != 'Not assigned']
df_assigned.reset_index(drop=True, inplace=True) # The index values are reset
df_assigned.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### 5. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table:

In [125]:
# The Wikipedia page has been adjusted and this has already been done on the Wikipedia page directly.
# This step is thus not required since the dataframe is already correct

### 6. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough:

In [195]:
# We already removed all the rows where the Borough is not assigned. Now all we have to do is assign the
# values of the Borough to all cells where the Neighborhood has not been assigned i.e. NaN

#Let's first check if there are unassigned neighborhoods:

tot = 0

for i, b in df_assigned['Neighborhood'].items():
    if b == "" or b == "NaN":
        tot = tot + 1
print("Total unassigned neighborhoods: ", tot)

Total unassigned neighborhoods:  0


##### Since there are no unassigned neighborhoods, we do not have to replace any neigherborhood value with its equivalent Borough value.

### 7. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe:

In [197]:
df_final = df_assigned # We now know that the dataframe is final. We give it the name df_final

df_final.shape

(103, 3)