## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

#### Scrape Wikipedia page that includes postal codes, boroughs and neighborhoods of Toronto 

In [85]:
# import necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [86]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
url_req = requests.get(url).text
# parse the html document with BeautifulSoup
toronto_html = BeautifulSoup(url_req, "lxml")

In [88]:
# generate toronto dataframe from html
toronto_df = pd.read_html(url, header = 0)[0]
toronto_df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood:

In [89]:
# rename columns as needed
toronto_df = toronto_df.rename(columns={"Postcode":"PostCode", "Neighbourhood":"Neighborhood"})
toronto_df

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned:

In [90]:
# clean not assigned rows for Borough, and reset indices of dataframe
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']
toronto_df = toronto_df.reset_index()
del toronto_df["index"]
toronto_df

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


More than one neighborhood can exist in one postal code area. Let's put them into one row separating with a comma:

In [91]:
toronto_df = toronto_df.groupby(["PostCode","Borough"])["Neighborhood"].apply(list).apply(lambda x:", ".join(x)).to_frame().reset_index()
toronto_df

Unnamed: 0,PostCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [92]:
for index, row in toronto_df.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]

In [103]:
# Let's see where the Queen's Park in the table
toronto_df[toronto_df.Borough == "Queen's Park"]

Unnamed: 0,PostCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


Use the .shape method to print the number of rows of your dataframe.

In [102]:
toronto_df.shape

(103, 3)