# IBM Applied Data Science Capstone
## Peer-graded Assignment
## Segmenting and Clustering Neighborhoods in Toronto
### Sidclay da Silva
### June 2020
---

### Introduction

This notebook contains the Peer-graded Assignment for the Week 3 for the Course IBM Applied Data Science Capstone on Coursera, which requires to explore, segment, and cluster the neighborhoods in the city of Toronto. In short words, the assignment is composed of three main tasks as following:

1. Build a dataframe with the Toronto Postal Codes from a web page.
1. Include the coordinates for each neighborhood in the dataframe.
1. Explore, cluster and display the neighborhoods clusters on a map.

Most of the code could be groupped having shorter notebook, but the objective is to clarify each step, for this reason the code has been broken with Markdown explanations.
The tool of my choice to perform this assignment was a Jupyter Notebook on Jupyter Lab. The notebook is going to be available in a GitHub repository allowing peers to grade it.

---

### Task 1 - Build a dataframe with the Toronto Postal Codes from a web page

Import required libraries. For this task the __Requests__ library will be used to send web request, and __BeautifulSoup__ to parse the data from the web.

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

Send request to the provided URL and check if data was successfully loaded.

In [4]:
# send a request to the URL and store the response
raw = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the raw data from web using __BeautifulSoup__. The provided page on _Wikipedia_ contains more tables than the required __Toroto Postal Code table__, but it will be the only one to be loaded. The __tag table__ will be used to load only tables from the parsed data, and the __index 0__ will be used to select only the first table, which is the required for this assignment.

In [5]:
# parse the raw data
par = BeautifulSoup(raw.text, 'html.parser')

# load only the first table from parsed data [tag 'table' / index 0]
par_table = par.findAll('table')[0]

Check the number of columns and their headers. The headers will be used to name the dataframe columns, the __tag th__ will be used to select them when runnig a loop.

In [6]:
# print the number of columns and the columns' headers
print('The source table has {} columns'.format(len(par_table.find_all('th'))))
par_table.find_all('th')

The source table has 3 columns


[<th>Postal Code
 </th>,
 <th>Borough
 </th>,
 <th>Neighborhood
 </th>]

Store the the columns' headers in a list.

In [7]:
# define a empty list object
headers = list()

# run a loop to append the headers to the list [tag 'th']
for h in par_table.find_all('th'):
    headers.append(h.get_text())

# check the headers
headers

['Postal Code\n', 'Borough\n', 'Neighborhood\n']

Unfortunatelly the *get_text()* also returned unwanted characters, suchs as __'\n'__, they wiil be removed as the blank spaces between words, to be used as dataframe column names.

In [8]:
# run a loop to remove the '\n' from headers
for i, h in enumerate(headers):
    headers[i] = h.replace('\n','')

# run a loop to remove the blank spaces between words
for i, h in enumerate(headers):
    headers[i] = h.replace(' ','')

# check the clean headers
headers

['PostalCode', 'Borough', 'Neighborhood']

Create an empty dataframe using the table headers as column names.

In [9]:
pcode = pd.DataFrame(columns=headers)
pcode.reset_index()
pcode

Unnamed: 0,PostalCode,Borough,Neighborhood


Before populating the dataframe with the postal code data, first check how many rows the tables contains, excluding the header, the __tag tr__ will be used for this.

In [10]:
# print the number of rows the table contains
print('The source table has {} rows'.format(len(par_table.find_all('tr'))-1))

The source table has 180 rows


Populate the dataframe can be done running a nested loop. The first level will run by row, the __tag tr__ will be used to identify them,  for each row the second level will run on column, the __tag td__ will be used as identification. The data will be stored temporary in a list, then after reading each row, the list will stored into the dataframe, in case of Borough is not assigned the complete row will be ignored.

In [11]:
# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # skip the first row [headers]
    if i > 0:
        # create an empty list
        d = list()
        
        # run a loop by column for the current row [tag 'td']
        for column in row.find_all('td'):
            # append the text of current cell to the list, already removing the '\n'
            d.append(column.get_text().replace('\n',''))

        # if Borough is not 'not assigned' then store the list into the dataframe 
        if d[1].lower()!='not assigned':
            pcode = pcode.append(pd.Series(d, index = pcode.columns), ignore_index=True)
            
# inform when it is finished
print('Dataframe populated.')

Dataframe populated.


Check the first 10 observations in the dataframe.

In [12]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Quick resume of the data.

In [13]:
print('There are {} unique postal codes.'.format(len(pcode['PostalCode'].unique())))
print('There are {} unique boroughs.'.format(len(pcode['Borough'].unique())))

There are 103 unique postal codes.
There are 10 unique boroughs.


Check how many observations the dataframe contains.

In [15]:
pcode.shape

(103, 3)

This completes the __Task 1__.

---

### Task 2. Include the coordinates for each neighborhood in the dataframe.

Import required library. For this task the __PGeocode__ will be used to get the coordinates, latitude and longitude.

In [16]:
import pgeocode

Get the coordinates for each neighborhood in the Toronto Postal Code dataframe. It will be accomplished using the __query_postal_code__ from __pgeocode__, which only requires a list of target postal codes as input. Its output is a Pandas Data Frame containing among others latitude and longitude information.

In [17]:
# define the user agent
geol = pgeocode.Nominatim('ca')

# get the geo data
loct = geol.query_postal_code(pcode['PostalCode'].tolist())

# inform when it is finished
print('Coordinates loaded.')

Coordinates loaded.


Check the first 10 returned observations.

In [18]:
loct.head(10)

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,M3A,CA,North York (York Heights / Victoria Village / ...,Ontario,ON,North York,,,,43.7545,-79.33,1.0
1,M4A,CA,North York (Sweeney Park / Wigmore Park),Ontario,ON,,,,,43.7276,-79.3148,6.0
2,M5A,CA,Downtown Toronto (Regent Park / Port of Toronto),Ontario,ON,Toronto,8133394.0,,,43.6555,-79.3626,6.0
3,M6A,CA,North York (Lawrence Manor / Lawrence Heights),Ontario,ON,North York,,,,43.7223,-79.4504,6.0
4,M7A,CA,Queen's Park Ontario Provincial Government,Ontario,ON,,,,,43.6641,-79.3889,
5,M9A,CA,Etobicoke (Islington Avenue),Ontario,ON,Etobicoke,,,,43.6662,-79.5282,6.0
6,M1B,CA,Scarborough (Malvern / Rouge River),Ontario,ON,Scarborough,,,,43.8113,-79.193,6.0
7,M3B,CA,Don Mills North,Ontario,ON,Don Mills,,,,43.745,-79.359,4.0
8,M4B,CA,East York (Parkview Hill / Woodbine Gardens),Ontario,ON,East York,,,,43.7063,-79.3094,6.0
9,M5B,CA,Downtown Toronto (Ryerson),Ontario,ON,Toronto,8133394.0,,,43.6572,-79.3783,6.0


Update the Toronto Postal Code dataframe with the coordinates. It will be done by simply adding the two returned columns, latitude and ongitude, at the end of the Toronto Postal Code dataframe.

In [19]:
# add the two returned columns to the dataframe
pcode['Latitude'] = loct.latitude
pcode['Longitude'] = loct.longitude

print('Coordinates added to the dataframe.')

Coordinates added to the dataframe.


Check the first 10 observations in the dataframe.

In [20]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.6662,-79.5282
6,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193
7,M3B,North York,Don Mills,43.745,-79.359
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.7063,-79.3094
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783


Check if tere are any missing coordinate.

In [21]:
pcode[np.isnan(pcode['Latitude'])]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Canada Post Gateway Processing Centre,,


For Postal Code M7R __pgeocode__ could not find the coordinates. The __M7R__ is the postal code for __Canada Post Gateway Processing Centre__, it will be droped out of the postal code dataframe.

In [23]:
didx = pcode[np.isnan(pcode['Latitude'])]['Neighborhood'].index.tolist()
pcode = pcode.drop(index=didx)
pcode.shape

(102, 5)

This completes the __Task 2__.

---

### Task3. Explore, cluster and display the neighborhoods clusters on a map.

Import required libraries. To create maps, __Folium__ will be used, to handle colors __Matplotlib Colors__ and __Pyplot__ will be used. 

In [24]:
import folium
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt

A __color table by Borough__ will be created, allowing each Neighborhood to be pointed on the Toronto's map with its Borough specific color. The color list will be create using __Pyplot__, but for each color it creates a list with four values (RGBA format), they will be converted to hexadecimal (HEX format) using __Matplolib Colors__, this way the colors can be used in folium map.

In [25]:
# create a list of unique boroughs
boroughs = pcode['Borough'].unique()

# create a color list by borough [RGBA format]
# tab10 is the chosen color map 
trgb = plt.cm.tab10(np.linspace(0, 1, len(boroughs)))

# convert the color from RGBA to HEX
# RGBA format contains 4 positions, to convert RGB to HEX only the first 3 positions are taken
thex = list()
for i in range(len(trgb)):
    thex.append(mcolors.rgb2hex(trgb[i][:3]))

# create a temporary list to connect bouroughs and colors
tmplist = list()
for b, c in zip(boroughs, thex):
    tmplist.append([b,c])

# create a dataframe from the temporary list
colortable = pd.DataFrame(columns=['Borough','Color'], data=tmplist)

# show the borough color table
colortable

Unnamed: 0,Borough,Color
0,North York,#1f77b4
1,Downtown Toronto,#ff7f0e
2,Etobicoke,#2ca02c
3,Scarborough,#d62728
4,East York,#8c564b
5,York,#e377c2
6,East Toronto,#7f7f7f
7,West Toronto,#bcbd22
8,Central Toronto,#17becf


Merge the Toronto Postal Code with the Borough Color Table into a new dataframe using the Borough feature as key.

In [26]:
# merge postal code and color table data frames
dfmap = pd.merge(pcode, colortable, on='Borough')

# show the first 10 observations
dfmap.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
0,M3A,North York,Parkwoods,43.7545,-79.33,#1f77b4
1,M4A,North York,Victoria Village,43.7276,-79.3148,#1f77b4
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504,#1f77b4
3,M3B,North York,Don Mills,43.745,-79.359,#1f77b4
4,M6B,North York,Glencairn,43.7081,-79.4479,#1f77b4
5,M3C,North York,Don Mills,43.7334,-79.3329,#1f77b4
6,M2H,North York,Hillcrest Village,43.8015,-79.3577,#1f77b4
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.7535,-79.4472,#1f77b4
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.7801,-79.3479,#1f77b4
9,M3J,North York,"Northwood Park, York University",43.7694,-79.4921,#1f77b4


Create a Toronto map showing the neighborhoods. Initially the map was centered using the Toronto coordinates, but the neighborhoods on the top right corner were out of the view, and an area without any neighborhood at the bottom portion was visible, I decided to use the coordinates from the neighborhoods in Toronto Postal Code dataframe to center the map, in this way all of them will be visible.

In [27]:
# calculate mid point for latitude and longitude
tlat = round((min(pcode['Latitude'])+max(pcode['Latitude']))/2, 4)
tlng = round((min(pcode['Longitude'])+max(pcode['Longitude']))/2, 4)

# print calculated coordinates
print('Latitude: {} , Longitude: {}'.format(tlat, tlng))

Latitude: 43.7181 , Longitude: -79.3736


Create the Toronto map using the calculated coordinates. The __Folium__ library will be used to create it.

In [28]:
toronto_map = folium.Map(location=[tlat, tlng], zoom_start=11)

Create the neighborhoods points on the map, having each one the borough specific color and a label.

In [29]:
# run a loop through the dataframe creating the points (circle marks)
for lat, lng, borough, neighborhood, cl in zip(dfmap['Latitude'], dfmap['Longitude'], dfmap['Borough'], dfmap['Neighborhood'], dfmap['Color']):
    # create the label
    label = '{}; Borough: {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    
    # create the circle mark
    folium.CircleMarker([lat, lng],
                        radius = 5,
                        popup = label,
                        color = cl,
                        fill = True,
                        fill_color = cl,
                        fill_opacity = 0.3,
                        parse_html = False).add_to(toronto_map)

Show the map

In [37]:
toronto_map

Looking at the map, having each borough its own color, it was easy to notice that between the red circles there is a gray one. The gray color represents borough __East Toronto__. Check what is in the Toronto Postal Code dataframe.

In [31]:
# check observations for East Toronto
dfmap[dfmap['Borough']=='East Toronto']

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
82,M4E,East Toronto,The Beaches,43.6784,-79.2941,#7f7f7f
83,M4K,East Toronto,"The Danforth West, Riverdale",43.6803,-79.3538,#7f7f7f
84,M4L,East Toronto,"India Bazaar, The Beaches West",43.6693,-79.3155,#7f7f7f
85,M4M,East Toronto,Studio District,43.6561,-79.3406,#7f7f7f
86,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.7804,-79.2505,#7f7f7f


From the postal code pattern, it really seems that __M7Y__ does not fit to East Toronto borough, its coordinates are also a bit away from the others, as we could see from the map. Check the borough represented by red color - __Scarborough__.

In [32]:
# check observations for Scarborough
dfmap[dfmap['Borough']=='Scarborough']

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
55,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193,#d62728
56,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564,#d62728
57,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866,#d62728
58,M1G,Scarborough,Woburn,43.7712,-79.2144,#d62728
59,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,#d62728
60,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,#d62728
61,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.7298,-79.2639,#d62728
62,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.7122,-79.2843,#d62728
63,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.7247,-79.2312,#d62728
64,M1N,Scarborough,"Birch Cliff, Cliffside West",43.6952,-79.2646,#d62728


From the postal code pattern, __M7Y__ does not fit to Scarborough as well. It is quite strange, maybe a deeper search on the Toronto Postal Code methodology would be required to understand it, but it is not in the scope of this assignment, it will just be left it as it is.

For the __clustering__ assignment the borough __Downtown Toronto__ will be used. Create a new dataframe only with borough Downtown Toronto.

In [33]:
# copy the observation from Downtown Toronto to a new dataframe
dtdata = dfmap[dfmap['Borough']=='Downtown Toronto']

# reset the index
dtdata = dtdata.reset_index(drop=True)

# print how many observations are
print('There are {} neighborhoods in Downtown Toronto.'.format(dtdata.shape[0]))

# print the first 10 observations
dtdata.head(10)

There are 19 neighborhoods in Downtown Toronto.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Color
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626,#ff7f0e
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6641,-79.3889,#ff7f0e
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3783,#ff7f0e
3,M5C,Downtown Toronto,St. James Town,43.6513,-79.3756,#ff7f0e
4,M5E,Downtown Toronto,Berczy Park,43.6456,-79.3754,#ff7f0e
5,M5G,Downtown Toronto,Central Bay Street,43.6564,-79.386,#ff7f0e
6,M6G,Downtown Toronto,Christie,43.6683,-79.4205,#ff7f0e
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6496,-79.3833,#ff7f0e
8,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.623,-79.3936,#ff7f0e
9,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.6469,-79.3823,#ff7f0e


Have a closer look at Downtown Toronto with its neighborhoods. Calculate the coordinates to centralize the map using the neighborhoods coordinates.

In [34]:
# calculate mid point for latitude and longitude
tlat = round((min(dtdata['Latitude'])+max(dtdata['Latitude']))/2, 4)
tlng = round((min(dtdata['Longitude'])+max(dtdata['Longitude']))/2, 4)

# print calculated coordinates
print('Latitude: {} , Longitude: {}'.format(tlat, tlng))

Latitude: 43.6529 , Longitude: -79.3915


Create the Downtown Toronto map using the calculated coordinates and create the neighborhoods points on the map.

In [35]:
# create the map
dt_map = folium.Map(location=[tlat, tlng], zoom_start=13)

# run a loop through the dataframe creating the points (circle marks)
for lat, lng, neighborhood, cl in zip(dtdata['Latitude'], dtdata['Longitude'], dtdata['Neighborhood'], dtdata['Color']):
    # create the label
    label = folium.Popup(neighborhood, parse_html=True)
    
    # create the circle mark
    folium.CircleMarker([lat, lng],
                        radius = 5,
                        popup = label,
                        color = cl,
                        fill = True,
                        fill_color = cl,
                        fill_opacity = 0.3,
                        parse_html = False).add_to(dt_map)

# show the map
dt_map

Explore the Downtown Toronto. The __Foursquare API__ will be used to explore the borough, to be able to send requests to Foursquare API the __client id__ and __client secret__ must be used, they will be define as sensitive code to be kept as secret. The verson of the API shoud also be defined, but it is not secret.

In [39]:
# create a variable for Foursquare API version
VERSION = '20180605' # handout: exclude

Still working on it.