### Segmenting and Clustering Neighborhoods in Toronto:
1. Scrape the information from the Wikipedia. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
2. Populate "Not assigned" neighborhoods with burough.
3. Drop "Not assigned" buroughs

In [7]:
from requests import get

wiki = get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

#### Steps taken to get content from wiki
- Got the content from above piece of code.
- Printed out the contents once it was fed into BeautifulSoup to get some idea of where the table was.
- Installed lxml because it was a requirement for pandas read_html.
- Realized that I could just use pd.read_html and get the first element which is the table value.

In [31]:
from bs4 import BeautifulSoup as soup

parsed_wiki = soup(wiki.text, "lxml")
table_rows = parsed_wiki.body.table.tbody.find_all("tr")

In [29]:
import pandas as pd

postal_code_dfs = pd.read_html(wiki.text, na_values=["Not assigned"])
postal_code_df = postal_code_dfs[0]
postal_code_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,
9,M8A,,


#### Cleaning up the data
- Fill in the missing Neighbourhood information from Boroughs.
- Remove NA valued rows.
- Deduplicate on Postcode and Borough
- Print out the shape of the resulting dataframe.

In [30]:
postal_code_df["Neighbourhood"].fillna(value=postal_code_df["Borough"], inplace=True)
postal_code_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,,


In [32]:
postal_code_df.dropna(inplace=True)
postal_code_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [44]:
grouped_df = postal_code_df.groupby(["Postcode", "Borough"]).aggregate(", ".join).reset_index()
grouped_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [45]:
grouped_df.shape

(103, 3)