# Segmenting and Clustering Neighborhoods in Toronto Part 1

In this notebook,first,We are scrapping Wikipedia page, "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" to get the data.
The data that will use,includes postal code,boroughs and neighborhoods in Toronto.

Then we clean and wrangle the data.

In [82]:
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

First, we have to get url or html in order to get the data from Wikipedia.

In [83]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wikipedia_page = requests.get(url).text

In [84]:
soup = BeautifulSoup(wikipedia_page,'xml')
table = soup.find('table')
column_name = ['PostalCode','Borough','Neighbourhood']
df = pd.DataFrame(columns=column_name)

For a good dataframe, we have to strip cells

In [85]:
for tr in table.find_all('tr'):
    data=[]
    for td in tr.find_all('td'):
        data.append(td.text.strip())
    if len(data)==3:
        df.loc[len(df)] = data

In [86]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [87]:
# Cleaning values as'Not assigned'
df = df[df.Borough != 'Not assigned']
df = df[df.Neighbourhood != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [88]:
# Checking the data,if there is any 'Not assigned' value for boroughs
df.loc[df.Borough == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [89]:
# Checking the data,if there is any 'Not assigned' value for neighbourhoods
df.loc[df.Neighbourhood == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood


As you can see the above, the data does not include raw data.Now we can group neighbourhoods by postalcode and drop duplicates.

In [90]:
# Grouping Neighborhoods with same PostCode
grouped_df = df.groupby('PostalCode')['Neighbourhood'].apply(lambda x: "%s" % ','.join(x))
grouped_df = grouped_df.reset_index(drop=False)
grouped_df.rename(columns={'Neighbourhood':'joined_neighbourhood'},inplace=True)

In [91]:
# Merging two dataframe for final version
df_merged = pd.merge(df,grouped_df,on='PostalCode')

In [92]:
df_merged.drop(['Neighbourhood'],axis=1,inplace=True)
df_merged.drop_duplicates(inplace=True)

In [93]:
df_merged.rename(columns={'joined_neighbourhood':'Neighbourhood'},inplace=True)

In [94]:
df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [95]:
# Shape of merged dataframe
df_merged.shape

(103, 3)