# Segmenting and Clustering Neighborhoods in Toronto

### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto.

### Part 1: create a dataframe scraping the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [41]:
#import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
#webpage we want to scrape
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
#download the html content
results=requests.get(url)

In [4]:
#check if the html was downloaded properly, if it was you should get the code 200
results.status_code

200

In [5]:
#parsing
soup = BeautifulSoup(results.content,'html.parser')

In [6]:
#extracting the table, for this you'll need to look for the tag 'table' either on the website (right-clicking inspect) or in the variable soup we just created
#in my case I got: <table class="wikitable sortable">

#the tag is 'table' and atribute is the class_
stat_table = soup.find_all('table',class_='wikitable sortable')

In [7]:
#check how many tables, as it is just one, we found the one we are looking for
len(stat_table)

1

In [8]:
#checking the type of stat_table
type(stat_table)

bs4.element.ResultSet

In [9]:
#this is because find_all returns a resultset, a list. But we need a tag. 
stat_table = stat_table[0]
type(stat_table)

bs4.element.Tag

In [10]:
#creating dataframe with the stat_table 
#in the html code 'tr' represents the rows, 'th' the headers and 'td' the cells of the table

#create lists to store the table
data = []
columns = []

#filling every row
for i,row in enumerate(stat_table.find_all('tr')):
    section = []
    #filling every cell
    for cell in row.find_all(['th','td']):
        section.append(cell.text.rstrip())
    
    #make first row of data (index=0) the header
    if (i == 0):
        columns = section
    else:
        data.append(section)

#convert list into pandas DataFrame
df_can = pd.DataFrame(data = data,columns = columns)
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [11]:
#dropping any Borough that is not assigned
df_can = df_can[df_can['Borough'] != 'Not assigned']
df_can.reset_index(drop=True,inplace=True)
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [44]:
#For the rows with the same postcode, they will be combined into one row with the neighborhoods separated with a comma, using .goupby():
df_grouped=df_can.groupby(['Postcode','Borough'],as_index=False, sort=False).agg( ','.join)
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned


In [72]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough:
df_grouped['Neighbourhood'] = np.where(df_grouped['Neighbourhood'] == 'Not assigned', df_grouped['Borough'], df_grouped['Neighbourhood'])
df_grouped.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [40]:
#shape of the dataframe
df_grouped.shape

(103, 3)

### Part 2: adding the geographical coordinates

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. We will use the Geocoder Python package: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

### First I tried using the geocoder package but it was impossible to extract the data because it took too long, so I will use the .csv file available

In [73]:
#!pip install geocoder
#import geocoder

#function
#def get_geocoder(postal_code):
    # initialize your variable to None
    #ll_coords = None
    
    # loop until you get the coordinates
    #while(ll_coords is None):
        #g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        #ll_coords = g.latlng
        
    #latitude = ll_coords[0]
    #longitude = ll_coords[1]
    #return latitude,longitude

#df_grouped['Latitude'], df_grouped['Longitude'] = get_geocoder(df_grouped['Postcode'].values)
#df_grouped.head()

### Using the .csv file:

In [88]:
#using the csv file
ll_coords=pd.read_csv('http://cocl.us/Geospatial_data')
ll_coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [94]:
#rename columns so both dataframes match
ll_coords.rename(columns={'Postal Code':'Postcode'},inplace=True)
ll_coords.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [95]:
#merging the two dataframes based on the column 'Postcode'
df_coords=pd.merge(df_grouped, ll_coords, on='Postcode')
df_coords.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
