# IBM Data Science Capstone Week 3 Assignment
---
This notebook contains the assignment for week 3 of the IBM Data Science Capstone Project on Coursera. It was created by Tim de Zwart.

### Part 1

First of, the necessary packages are imported.

In [1]:
import pandas as pd

The list of postal codes of Canada is scraped off of the provided Wikipedia page and loaded into a pandas dataframe. Check **[here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)** for the list of postal codes. The pd.read_html functions returns a lists of the dataframes on the website, so we will first check which is the right one.

In [2]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

for df in dfs:
    print(df.head())

  Postal code           Borough                Neighborhood
0         M1A      Not assigned                         NaN
1         M2A      Not assigned                         NaN
2         M3A        North York                   Parkwoods
3         M4A        North York            Victoria Village
4         M5A  Downtown Toronto  Regent Park / Harbourfront
                                                  0   \
0                                                NaN   
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
2                                                 NL   
3                                                  A   

                                                  1   \
0                              Canadian postal codes   
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
2                                                 NS   
3                                                  B   

                                                  2    3    4    5    6    7  

---
It turns out the first table is the one with the boroughs and the postal codes, which is the one that we need. This is therefore assigned to the dataframe `canada_pc`.

In [3]:
canada_pc = dfs[0]
canada_pc.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


The table is now cleaned according to the instructions in the assignment. The following tasks are performed:
* Only cells with an assigned borough are processed. Cells where the borough has a value of **Not assigned** are dropped.
* Postal codes that have multiple neighborhoods assigned to it are listed now with a " / " in-between (see for example Regent Park / Harbourfront in the cell above). Postal codes with multiple neighborhoods will now be separated by comma's.

In [4]:
# Drop the postal codes that are not assigned to a borough
canada_pc.drop(canada_pc[canada_pc['Borough']=="Not assigned"].index, inplace=True)

# Replace forward slash with comma's
canada_pc['Neighborhood'].replace({" / ": ", "}, regex=True, inplace=True)

# Show cleaned dataframe
canada_pc.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Finally, the assignment states that there might be postal codes that are assigned to a borough, but are not assigned to a neighborhood. In order to check this, we see if there are any rows in the cleaned dataframe where the `Neighborhood` column is **Not assigned**.

In [5]:
canada_pc[canada_pc['Neighborhood']=="Not assigned"]

Unnamed: 0,Postal code,Borough,Neighborhood


As can be seen above, the returned dataframe is empty, which means there are no postal codes assigned to boroughs but not to neighborhoods. The only thing that remains now is to reset the index to start at 0 again, and to check the shape of the cleaned dataframe.

In [6]:
canada_pc.reset_index(drop=True, inplace=True)
canada_pc.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
print("The shape of the cleaned dataframe is", canada_pc.shape)

The shape of the cleaned dataframe is (103, 3)
