# Segmenting and Clustering Neighborhoods in Toronto
## Applied Data Science Capstone - Week 3
### Ver 1.0, Dated Sunday, 02-Aug-2020
### Debesh Roy

In this assignment, we will **Explore, Segment, and Cluster** the neighborhoods in the city of Toronto. 
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.

We need to **Scrape the Wikipedia page and Wrangle the data, Clean it, and then read it into a pandas dataframe** so that it is in a structured format.

In [1]:
# Import the libraries

# Need to use pandas dataframe
import pandas as pd

# Need to import the library we will be using to connect to the Wikipedia page and fetch the contents of that page:
import urllib.request

# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

In [2]:
# Next we specify the URL of the Wikipedia page we are looking to scrape:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

In [3]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables = soup.find_all("table")
#all_tables

Looking through the output of ”all_tables” we can again see that the class id of our chosen table is ”wikitable sortable”. We can use this to get BS to only bring back the table data for this particular table and keep that in a variable called ”right_table“:

In [4]:
right_table = soup.find('table', class_='wikitable sortable')

Loop through the rows

We have to start looping through the rows to get the data for every Postal Code in the table.

There are three columns in our table that we want to scrape the data from so we will set up three empty lists (A, B and C) to store our data in.

In [5]:
PC=[]
BO=[]
NH=[]
for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        PC.append(cells[0].find(text=True).strip())
        BO.append(cells[1].find(text=True).strip())
        NH.append(cells[2].find(text=True).strip())

Now we’ll use pandas to create a dataframe with it, assigning each of the lists PC, BO and NH into a column with the name of our source table columns i.e. Postal Code, Borough and Neighbourhood.

In [6]:
df_M = pd.DataFrame(PC,columns=['Postal Code'])
df_M['Borough'] = BO
df_M['Neighbourhood'] = NH

# Now filter out those rows which does not contain any Borough data 
df_M = df_M[df_M['Borough'] != 'Not assigned'] 
df_M.loc[df_M.Neighbourhood == 'Not assigned', "Neighbourhood"] = df_M.Borough
df_M

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Tried to use Geocoder package but not able to get the oordinates of each postal code, hence using CSV file, provided here; accessed through http://cocl.us/Geospatial_data

In [7]:
!wget -O Geospatial_Coordinates.csv http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
df_PC = pd.read_csv('Geospatial_Coordinates.csv')
df_PC.head()

--2020-08-02 22:30:30--  http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Resolving cocl.us (cocl.us)... 119.81.168.75, 119.81.168.76, 161.202.50.39
Connecting to cocl.us (cocl.us)|119.81.168.75|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv [following]
--2020-08-02 22:30:30--  https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
Connecting to cocl.us (cocl.us)|119.81.168.75|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-08-02 22:30:32--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 103.116.4.197
Connecting to ibm.box.com (ibm.box.com)|103.116.4.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# Now need to join both the dataframes using Postal Code, and get the Latitude and Longitude

df_M.set_index(['Postal Code'], inplace=True)
df_PC.set_index(['Postal Code'], inplace=True)

df_Total = df_M.join(df_PC).reset_index()

Lets count the coumber based on Borough

In [9]:
# Lets see the Borough wise count using Group By feature
df_Total.groupby(['Borough'])['Postal Code'].count()

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Scarborough         17
West Toronto         6
York                 5
Name: Postal Code, dtype: int64

In [10]:
# Now filter out those rows which does not contain Borough as Toronto
df_Final = df_Total[df_Total['Borough'].str.contains('Toronto')]

In [11]:
# Just to reindex starting from ZERO
df_Final.index = range(df_Final.shape[0])

In [12]:
df_Final.shape

(39, 5)

In [13]:
df_Final

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
