<div style="background: yellow; color: blue; font-size: 2rem;"><h1>Segmenting and Clustering Neighborhoods in Toronto</h1></div>

<span style="padding: 10px; background: white; color: red; font-size: 1.5rem;">by <b>Santanu Sikder</b></span>

<h1><u>Part-1:</u> Scraping the data (table) from the Wikipedia page and preparing the dataframe</h1>

In [42]:
# Install lxml for reading HTML tables using pandas' read_html method
!pip install lxml



You should consider upgrading via the 'python -m pip install --upgrade pip' command.





In [2]:
# Import the necessary modules first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import folium
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

I'll use Pandas' **read_html** method to get the tables from the given link in the form of a list of dataframes.

Then I'll assign the first dataframe in the list (because that is what we want) to **dfMain**.

In [15]:
dfMain = pd.read_html("http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

# Check if the dataframe has been read in successfully from the table
dfMain.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [16]:
# Rename the column "Postal Code" to "PostalCode"
dfMain.rename(columns = {"Postal Code" : "PostalCode"}, inplace = True)
# Preview the dataframe
dfMain.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [17]:
# Export it to a local file as csv file
# dfMain.to_csv("canada_postal_codes.csv")

In [40]:
# Drop such rows where the Borough is "Not assigned"
# The same can be accomplished by filtering the dataframe using a mask, but I'll use the drop method
df = dfMain.drop(dfMain[dfMain["Borough"] == "Not assigned"].index, axis = 0).reset_index(drop = True)
# Preview the dataframe
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


The scenario specified in the 3rd and 4th points in the 3rd step of the first part of the assignment instructions **do not exist**, i.e., neither there is any need to combine the Neighbourhood names using commas (as it is already done in the Wikipedia page) nor is there any row with only the Neighbourhood Not Assigned (as of August 2020).

Therefore, I'll move on skipping these two unnecessary steps

In [41]:
# Print out the shape of the dataframe
df.shape

(103, 3)

This is **THE END OF PART - 1**

<h1><u>Part-2:</u> Creating the dataframe containing the latitudes and longitudes of the postal codes</h1>

I'll use the CSV file whose link has been provided in the instructions for this assignment to create the required dataframe.

In [44]:
dfLatLong = pd.read_csv("geospatial_coordinates.csv").rename(columns = {"Postal Code" : "PostalCode"})
# Preview the dataframe
dfLatLong

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


I'll set the PostalCode columns in both df as well as dfLatLong as the **index**. This will be helpful in the next step.

In [48]:
df.set_index("PostalCode", inplace = True)
dfLatLong.set_index("PostalCode", inplace = True)
# Preview df
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [49]:
# Preview dfLatLong
dfLatLong.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


I'll now insert the latitudes and longitudes from the above dataframe to our df.

In [50]:
df[["Latitude", "Longitude"]] = dfLatLong
# Preview the dataframe
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Finally, reset the index and obtain the desired dataframe

In [51]:
df.reset_index(inplace = True)
# Preview the dataframe
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


This is **THE END OF PART - 2**