### AIM:
Use a Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.
1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re

### Getting the Postal Codes data from wikipedia

**We can do this by using BeautifulSoup to scrape the required data from Wikipedia.**  
**But first we must inspect the source code of the webpage to figure out where and how the data is actually stored. The data is stored in a Table format using ```<table>``` tag with 3 ```<td>``` tags nested inside a ```<tr>``` tag for each row.**  
**There is also a class assigned to the table ```class="wikitable"```.**
**Now that we know where to look for the data let's start.**

In [2]:
# Retreiving the Page HTML
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

Find the table tag with class wikitable.

In [3]:
table = soup.find('table', class_ = 'wikitable')

Find all the ```<td>``` tags and store their values in a list.  
Then we convert that list to a numpy array and replace the empty values(or '') with NA for ease of understanding.  
We reshape the array to 180x3 which represents 180 rows and 3 columns which the actual shape of the data.
Once the array is reshape, it is converted to a Dataframe.

In [4]:
tableData = table.find_all('td')
temp = []
for value in tableData:
    string = str(value.string).strip('\n')
    temp.append(string)
temp

['M1A',
 'Not assigned',
 '',
 'M2A',
 'Not assigned',
 '',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Regent Park / Harbourfront',
 'M6A',
 'North York',
 'Lawrence Manor / Lawrence Heights',
 'M7A',
 'Downtown Toronto',
 "Queen's Park / Ontario Provincial Government",
 'M8A',
 'Not assigned',
 '',
 'M9A',
 'Etobicoke',
 'Islington Avenue',
 'M1B',
 'Scarborough',
 'Malvern / Rouge',
 'M2B',
 'Not assigned',
 '',
 'M3B',
 'North York',
 'Don Mills',
 'M4B',
 'East York',
 'Parkview Hill / Woodbine Gardens',
 'M5B',
 'Downtown Toronto',
 'Garden District, Ryerson',
 'M6B',
 'North York',
 'Glencairn',
 'M7B',
 'Not assigned',
 '',
 'M8B',
 'Not assigned',
 '',
 'M9B',
 'Etobicoke',
 'West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale',
 'M1C',
 'Scarborough',
 'Rouge Hill / Port Union / Highland Creek',
 'M2C',
 'Not assigned',
 '',
 'M3C',
 'North York',
 'Don Mills',
 'M4C',
 'East York',
 'W

In [5]:
temp = np.array(temp)
temp = np.where(temp=='','NA',temp)
data = np.reshape(temp,(180,3))
print("Shape of the Data:",data.shape)
dataFrame = pd.DataFrame(data=data, columns=['PostalCode','Borough','Neighborhood'])
dataFrame.head()

Shape of the Data: (180, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [6]:
dataFrame['Neighborhood'].replace(to_replace='None',value='CN Tower / King and Spadina / Railway Lands \
/ Harbourfront West / Bathurst / Quay / South Niagara / Island airport', inplace=True)

**As per the guidlines, we will drop the rows with ```Borough='Not assigned'```.**

In [7]:
dataFrame.drop(dataFrame.Borough.loc[dataFrame.Borough=='Not assigned'].index,inplace = True,
               axis=0)
dataFrame.reset_index(drop=True, inplace=True)

**Notice how all the rows with Neighborhood='NA' were also remove. The reason is that only the rows with Borough='Not assigned' had Neigborhood='NA'**

We also have a '/' instead of a ',' to seperate the Neigborhoods, let's fix that.

In [8]:
for index, item in enumerate(dataFrame['Neighborhood']):
    dataFrame['Neighborhood'].iloc[index] = re.sub(' /',',',item)
for index, item in enumerate(dataFrame['Neighborhood']):
    dataFrame['Neighborhood'].iloc[index] = re.sub('"','',item)
dataFrame.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [9]:
dataFrame.shape

(103, 3)

**Our data is now ready to be exported to a CSV**

In [10]:
dataFrame.to_csv('postalCodes_scraped.csv')