# Segmenting and Clustering Neighborhoods in Toronto 

## This section contains the code cells for the first part of the assignment - Scraping the Toronto data from Wikipedia

### Read the data from wiki and assign it to a dataframe

In [2]:
import pandas as pd

wikiUrl = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
data = pd.read_html(wikiUrl)[0]
data.columns = ["PostalCode","Borough","Neighborhood"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Ignore cells with a Borough that is not assigned

In [3]:
data = data[data["Borough"] != "Not assigned"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


### Combine neighborhoods belonging to the same postal codes. 

#### We can acheive this by grouping the dataframe on postal code and aggregating the boroughs and neighborhoods.

In [4]:
borough_function = lambda b: b.iloc[0]
neighborhood_function = lambda n: ", ".join(n)
agg_functions = {"Borough": borough_function, "Neighborhood": neighborhood_function}
groupedData = data.groupby(by="PostalCode").aggregate(agg_functions)
data = groupedData.reset_index()[data.columns]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### If a neighborhood has a value "Not assigned" then the borough will be the neighborhood

In [5]:
data[data["Neighborhood"] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood
93,M9A,Queen's Park,Not assigned


#### Here we can see that the postal code M9A has a borough but not a neighborhood. Hence we replace the neighborhood with borough

In [7]:
for (i, row) in data.iterrows():
    if row["Neighborhood"] == "Not assigned":
        borough = row["Borough"]
        row["Neighborhood"] = borough
        

In [8]:
data.iloc[90:95]

Unnamed: 0,PostalCode,Borough,Neighborhood
90,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
91,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."
92,M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen..."
93,M9A,Queen's Park,Queen's Park
94,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ..."


#### Total number of rows in the dataframe

In [9]:
data.shape

(103, 3)