# Segmenting and Clustering neighborhood in Toronto

For the Toronto neighborhood data, a Wikipedia page exists that has all the information that need to be explor and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

We use python most popular library *pandas* in order to get the dataframe we need.

In [48]:
import pandas as pd
# "url" represents the HTML file where the data is present
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In the above dataframe we can clearly see that there are rows with *NULL/NaN* values, we use python function *dropna* which removes null value rows from the data frame. And in this case we need to remove the rows which has *Not Assigned* present in it. Since *NaN* is present in the row where _Not assigned_ is present this makes easy to remove the *Not assigned* rows.

In [50]:
df = df.dropna()
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


Though we where able to obtain the dataframe we need, there still a small problem exists that is the *index* so we use _DataFrame.reset_index(inplace=True,drop=True)_.

In [51]:
df.reset_index(inplace = True, drop = True) 
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Just renaming the one of the column to match the desired outcome

In [40]:
df = df.rename(columns={'Postal Code': 'PostalCode'})

#### *DataFrame.shape* method helps us to print the number of _rows_ of our dataframe.

In [41]:
df.shape

(103, 3)