## Using BeautifulSoup to webscrape the Wikipedia article "List of postal codes of Canada"

First, import necessary libraries.

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen

Set the article's address equal the the variable url.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

Use `urlopen()` from urllib and set equal to the variable html.

In [3]:
html = urlopen(url)

Parse the html using BeautifulSoup and set equal to the variable soup.

In [4]:
soup = BeautifulSoup(html, 'html.parser')

In the article, right click and choose the "Inspect" element to look at the HTML. You will see that the table is listed under `<table>` and that the class is listed as `wikitable sortable`.

You can use the `soup.find_all()` function to get all tables with class: wikitable sortable from the article. Set that equal to the variable tables.

In [5]:
tables = soup.find_all('table',{'class':'wikitable sortable'})

After doing this, create three empty lists, one for each column.

Each row is listed under `<tr>`, so use a for loop using `find_all('tr')` to get all the rows from the table.

Similarly, each cell is listed under `<td>`, so use a for loop again with `find_all('td')` to get all the cells in each row.

Use `if len(cells) > 1` to skip the first row of the table, which contains the column headers.

Finally, append text content of the cells of each column to the corresponding empty list. This can be done using the column's index position. You will also need to strip the text to remove the \n from each cell.

In [6]:
PostalCode = []
Borough = []
Neighborhood = []

for table in tables:
    rows = table.find_all('tr')
    
    for row in rows:
        cells = row.find_all('td')
        
        if len(cells) > 1:
        
            postalcode = cells[0]
            PostalCode.append(postalcode.text.strip())
            
            borough = cells[1]
            Borough.append(borough.text.strip())
            
            neighborhood = cells[2]
            Neighborhood.append(neighborhood.text.strip())

Now you can create the dataframe using `pd.DataFrame`. Include each of the lists, now filled with the cell data from the table, and choose the column names accordingly. 

In [7]:
df = pd.DataFrame(list(zip(PostalCode, Borough, Neighborhood)), columns =['PostalCode', 'Borough', 'Neighborhood']) 

In [8]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


We only want the cells with assigned boroughs, so drop the rows where the Borough column has a cell value of 'Not assigned'. You may also want to reset the index here so the rows are properly numbered. 

In [9]:
df1 = df[~df['Borough'].str.contains('Not assigned')].reset_index(drop = True)

Check that the first 10 rows of the dataframe look correct using the `head()` function.

In [10]:
df1.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Finally, check the shape of the dataframe using the function `shape`.

In [11]:
df1.shape

(103, 3)