### Week 3 Capstone

**(1) Imports**

In [32]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

**(2) Scrape Wikipedia Link**  
Assumptions made:
- Table has 3 columns
- The first 3 elements inside 'th' are the column names of the table
    - This is only needed as I'm using the 'find_all' function. If one uses another approach to get the table headers, then this assumption is not needed
- The next content inside the 'td' after the last line of the table is blank
    - This is only needed as I'm using the 'find_all' functionality of the 'soup'. If one uses another approach to get table content, for example by looking for "/tbody>""/table>" to find the end of the table, then this assumption is not needed. 

In [112]:
page_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")

In [58]:
col_names = [col_name.replace('\n', '') for col_name in [c.text for c in page_content.find_all("th")][0:3]] # Column names are inside <th>. There are 3 columns in total
df = pd.DataFrame(columns=col_names)
tbl_content = [c.text for c in page_content.find_all("td")]
for i in range(len(tbl_content)//3):
    if tbl_content[i*3] == '': # hack: the string after last line of the table is blank, thus one can use this to judge the end of the table
        break
    df.loc[len(df)] = [tbl_content[i*3], tbl_content[i*3+1], tbl_content[i*3+2].replace('\n', '')]

**(3) Process raw dataframe**

In [60]:
# Filter out entries with Borough = Not assigned
df = df[df['Borough']!='Not assigned']

In [66]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
df_sub_has_neighborhood = df[df['Neighbourhood']!='Not assigned']
df_sub_no_neighborhood = df[df['Neighbourhood']=='Not assigned']
df_sub_no_neighborhood['Neighbourhood'] = df_sub_no_neighborhood['Borough']
df = pd.concat([df_sub_has_neighborhood, df_sub_no_neighborhood])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [92]:
# Use comma to connect Neighbourhood if there are multiple entries for the same Postcode
df_clean = pd.concat([pd.DataFrame(df.groupby('Postcode')['Borough'].agg('min')), df.groupby('Postcode')['Neighbourhood'].agg(','.join)], axis=1).reset_index(drop=False)

In [95]:
# Rename the df to be the same as instruction provided
df_clean.rename(columns={'Postcode':'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)

**(4) Output**

In [111]:
df_clean.shape

(103, 3)