## Toronto Neighbourhood Clustering

In [31]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
doc = r.text

soup = BeautifulSoup(doc, 'html.parser')

data = []
table = soup.find('table', attrs = {'class' : 'wikitable'})
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
    
df = pd.DataFrame(data)
df.drop([0], inplace=True)
df.columns = ['PostalCode','Borough','Neighbourhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Great! We scraped the html and used beautifulsoup to extract the data and convert it to a pandas dataframe. To use this dataframe for data analysis, we will now clean the data. We have to 

1) ignore cells with borough that is Not assigned. 

2) Join Neighbourhoods with the same postal area. 

3) If Neighborhood is Not assigned, it is the same as the borough. 

4) use the .shape method to print the number of rows in dataframe. 

In [46]:
print(list(df[df['Borough'] != 'Not assigned']['Neighbourhood'])) #No more not assigned
print(df[df['Borough'] != 'Not assigned'].groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index())

# we see that the transformed dataframe has the same dimensions as the csv file implying correctness
# the first line shows that there are no cases in which neighborhood is 'not assigned' once we filter out Bouroughs
# now it is safe to destructively update the dataframe and find the .shape

df1 = df[df['Borough'] != 'Not assigned'].groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df1.shape

['Parkwoods', 'Victoria Village', 'Regent Park, Harbourfront', 'Lawrence Manor, Lawrence Heights', "Queen's Park, Ontario Provincial Government", 'Islington Avenue, Humber Valley Village', 'Malvern, Rouge', 'Don Mills', 'Parkview Hill, Woodbine Gardens', 'Garden District, Ryerson', 'Glencairn', 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale', 'Rouge Hill, Port Union, Highland Creek', 'Don Mills', 'Woodbine Heights', 'St. James Town', 'Humewood-Cedarvale', 'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood', 'Guildwood, Morningside, West Hill', 'The Beaches', 'Berczy Park', 'Caledonia-Fairbanks', 'Woburn', 'Leaside', 'Central Bay Street', 'Christie', 'Cedarbrae', 'Hillcrest Village', 'Bathurst Manor, Wilson Heights, Downsview North', 'Thorncliffe Park', 'Richmond, Adelaide, King', 'Dufferin, Dovercourt Village', 'Scarborough Village', 'Fairview, Henry Farm, Oriole', 'Northwood Park, York University', 'East Toronto, Broadview North (Old East York)', '

(103, 3)

Great! The data has been grouped and processed for removal of "Not assigned". We can proceed to the next step of appending the latitude and longitude to the dataframe.