# Notebook for Scraping Postal Codes From Wikipedia

### Library Imports:

In [3]:
import pandas as pd

### Link to the wiki page:

In [4]:
link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

### Scraping Data:

In [5]:
# load the first table into a dataframe
postal_df = pd.read_html(link)[0]
postal_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Drop the not-assigned values.

In [6]:
# here dropna() should suffice since all the not-assigned values are NaN
postal_df = postal_df[postal_df['Borough'] != 'Not assigned']
postal_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### Pre-processing:

In [7]:
# reset index, drop the index
postal_df = postal_df.reset_index().drop('index', axis = 1)
postal_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


wiki contained '/' for postal codes with multiple neighborhoods  
Replacing the ' /' with ',' to meet the assignment requirements

In [8]:
# lambda function to replace
replace = lambda s : s.replace(' /', ',')

# Applying 'replace' to the 'Neighborhood' column
postal_df['Neighborhood'] = postal_df['Neighborhood'].apply(replace)
postal_df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Assigning the borough name to the Not Assigned Neighborhoods:

In [9]:
# Lets check if there is any not-assigned neighborhood:
postal_df.isna().sum()

Postal code     0
Borough         0
Neighborhood    0
dtype: int64

***Since there is no neighborhood with 'Not assigned' names, we do not need to copy over the borough names.***

### Data Shape:

In [10]:
postal_df.shape

(103, 3)

So the dataframe contains 103 rows and 3 columns.

### Store the dataframe into a csv file:

In [12]:
postal_df.to_csv('postal_data.csv', index = None)