# Segmenting and Clustering Neighborhoods in Toronto
#### A Peer graded assignment for Applied Data Science Capstone Course

### Task 1 : Web Scraping

Obtain the postal codes of Canada using the link : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.<br>
Use web scraping technique to convert the table to dataframes.

In [43]:
import pandas as pd
import requests 
from bs4 import BeautifulSoup as bs

In [44]:
#store the url
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#Using requests get the contents of the web page of URL
page = requests.get(url)
print("Page : ",page)

htmlContent = bs(page.text)
# print("htmlContent : ",htmlContent)

#Using BeautifulSoup parse the HTMl Document to find our data in table - <table> with class "wikitable sortable"
postalCodeTable = htmlContent.findAll('table',attrs={"class","wikitable sortable"})[0];

#Loop through the postalcodeTable to extract each row and push it to a list
data = [];
columns = [];
headerRow = postalCodeTable.find('tr');
for th in headerRow.findAll('th') : 
    columns.append(th.text.strip())    

for row in postalCodeTable.findAll('tr') : 
    value = []
    for td in row.findAll('td') :
        value.append(td.text.strip())
    if (len(value) > 0 and value[1] != 'Not assigned'): #Value of Borough should not be  'Not Assigned'
        if(value[2] == 'Not assigned') : #if value of Neighbourhood is 'Not assigned' then value of Borough is set as value of Neighbourhood
            value[2] = value[1]
        data.append(value)

#put data and columns into a dataframe using pandas
postalCodeDF = pd.DataFrame(data,columns=columns)
print("postalCodeDF - top 5 rows : ",postalCodeDF.head())
print("postalCodeDF - bottom 5 rows : ",postalCodeDF.tail())
postalCodeDF

Page :  <Response [200]>
postalCodeDF - top 5 rows :    Postcode           Borough     Neighbourhood
0      M3A        North York         Parkwoods
1      M4A        North York  Victoria Village
2      M5A  Downtown Toronto      Harbourfront
3      M5A  Downtown Toronto       Regent Park
4      M6A        North York  Lawrence Heights
postalCodeDF - bottom 5 rows :      Postcode    Borough             Neighbourhood
206      M8Z  Etobicoke  Kingsway Park South West
207      M8Z  Etobicoke                 Mimico NW
208      M8Z  Etobicoke        The Queensway West
209      M8Z  Etobicoke     Royal York South West
210      M8Z  Etobicoke            South of Bloor


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Use groupby function to merge the neighborhood values of same postal code to one row

In [45]:
#Merge Neighbourhoods with same postal code
df = postalCodeDF.groupby(['Postcode','Borough'], as_index=False).agg({"Neighbourhood" : ",".join})
df = df.rename(columns={'Postcode' : 'PostalCode','Neighbourhood' : 'Neighborhood'})
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [46]:
df.shape

(103, 3)

### Task 2 : Find the Latitude and Longitude of the Postal codes 
To get the postal codes, I have used the csv file from the link :  http://cocl.us/Geospatial_data


In [47]:
geospatial_data_url = "http://cocl.us/Geospatial_data"
geospatial_df = pd.read_csv(geospatial_data_url)
geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


To Merge this dataframe with the postal code dataframe, we need to make sure 'Postal Code' column name is same. So we rename the column name to 'PostalCode' to match the column in 'df'

In [48]:
#Lets rename the Column Name 'Postal Code' of geospatial_df to 'PostalCode'
geospatial_df = geospatial_df.rename(columns={'Postal Code':'PostalCode'})
geospatial_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [49]:
#Merge df and geospatial_df to single dataframe
location_df = pd.merge(df,geospatial_df,how='inner',on='PostalCode',left_on=None,right_on=None)
location_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
