# Coursera Capstone Project - Segmenting and Clustering part 3

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Scraping the Wikipedia page

First we download the given Wikipedia page and convert it to a `BeautifulSoup` object. 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)

In [3]:
soup = BeautifulSoup(r.text, "html.parser")

Let's find the first occurrence of the `<table>` tag, which happens to be the table we're interested in. 

In [4]:
table_html = soup.find("table")

Here we grab all the rows in the table, except for the header, which we can safely discard. 

In [5]:
table_data = [tr.find_all("td") for tr in table_html.find_all("tr")]
table_data = table_data[1:]
table_data = [[el.get_text(strip=True) for el in td] for td in table_data]

In [6]:
table_data[:5]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront']]

Now we can convert this nested list into a proper pandas dataframe. 

In [7]:
toronto_fsa = pd.DataFrame.from_records(table_data, columns=["PostalCode", "Borough", "Neighborhood"])

In [8]:
toronto_fsa.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [9]:
toronto_fsa.shape

(288, 3)

## Cleaning the table

Let's remove all rows with a `Not assigned` borough. 

In [10]:
toronto_fsa = toronto_fsa[toronto_fsa["Borough"] != "Not assigned"]
toronto_fsa.reset_index(drop=True, inplace=True)
toronto_fsa.shape

(211, 3)

In [11]:
toronto_fsa.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


We should also fill in empty neighborhoods with the related borough value: 

In [12]:
toronto_fsa[toronto_fsa["Neighborhood"] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood
6,M7A,Queen's Park,Not assigned


Since there is just one such occurrence, we will simply fill it in by hand, instead of creating something more programmatic, because we are lazy developers.

In [13]:
toronto_fsa.at[6, "Neighborhood"] = toronto_fsa.at[6, "Borough"]
toronto_fsa.iloc[6]

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 6, dtype: object

Now we can group together neighborhoods that have the same postal code. 

In [14]:
toronto_fsa["new_neighborhood"] = toronto_fsa.groupby("PostalCode")["Neighborhood"].transform(lambda x: ", ".join(x))

In [15]:
toronto_fsa.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,new_neighborhood
0,M3A,North York,Parkwoods,Parkwoods
1,M4A,North York,Victoria Village,Victoria Village
2,M5A,Downtown Toronto,Harbourfront,"Harbourfront, Regent Park"
3,M5A,Downtown Toronto,Regent Park,"Harbourfront, Regent Park"
4,M6A,North York,Lawrence Heights,"Lawrence Heights, Lawrence Manor"
5,M6A,North York,Lawrence Manor,"Lawrence Heights, Lawrence Manor"
6,M7A,Queen's Park,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue,Islington Avenue
8,M1B,Scarborough,Rouge,"Rouge, Malvern"
9,M1B,Scarborough,Malvern,"Rouge, Malvern"


Now we just have to remove the original `Neighborhood` column as well as all redundant rows. 

In [16]:
toronto_fsa = toronto_fsa.drop(["Neighborhood"], axis=1)
toronto_fsa = toronto_fsa.rename({"new_neighborhood": "Neighborhood"}, axis=1)
toronto_fsa = toronto_fsa.drop_duplicates(["PostalCode", "Borough"])
toronto_fsa.reset_index(drop=True, inplace=True)

In [17]:
toronto_fsa.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


Our dataframe is now ready to be processed, and we shall save it for future use. 

In [18]:
toronto_fsa.to_csv("data/toronto_fsa.csv", index=False)

In [19]:
toronto_fsa.shape

(103, 3)