# IBM Applied Data Science Capstone
## Peer-graded Assignment
## Segmenting and Clustering Neighborhoods in Toronto
### Sidclay da Silva
### June 2020
---

### Introduction

This notebook contains the Peer-graded Assignment for the Week 3 for the Course IBM Applied Data Science Capstone on Coursera, which requires to explore, segment, and cluster the neighborhoods in the city of Toronto. In short words, the assignment is composed of three main tasks as following:

1. Build a dataframe with the Toronto Postal Codes from a web page.
1. Include the coordinates for each neighborhood in the dataframe.
1. Explore, cluster and display the neighborhoods clusters on a map.

Most of the code could be groupped having shorter notebook, but the objective is to clarify each step, for this reason the code has been broken with Markdown explanations.
The tool of my choice to perform this assignment was a Jupyter Notebook runnig Python 3.6 kernel on IBM Watson Studio. The notebook is going to be available in a GitHub repository allowing peers to grade it.

---

### Task 1 - Build a dataframe with the Toronto Postal Codes from a web page

Import required libraries. For this task the __Requests__ library will be used to send web request, and __BeautifulSoup__ to parse the data from the web.

In [1]:
import pandas as pd           # perform data analysis
import numpy as np            # handle data as vector
import requests               # make web resquest
from bs4 import BeautifulSoup # parse content from web

Send request to the provided URL and check if data was successfully loaded.

In [2]:
# send a request to the URL and store the response
raw = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# check if data was loaded [status 200 means success]
if raw:
    print('Data loaded, status', raw.status_code)
else:
    print('Error loading data', raw.status_code)

Data loaded, status 200


Parse the raw data from web using __BeautifulSoup__. The provided page on _Wikipedia_ contains more tables than the required __Toroto Postal Code table__, but it will be the only one to be loaded. The __tag table__ will be used to load only tables from the parsed data, and the __index 0__ will be used to select only the first table, which is the required for this assignment.

In [3]:
# parse the raw data
par = BeautifulSoup(raw.text, 'html.parser')

# load only the first table from parsed data [tag 'table' / index 0]
par_table = par.findAll('table')[0]

Check the number of columns and their headers. The headers will be used to name the dataframe columns, the __tag th__ will be used to select them when runnig a loop.

In [4]:
# print the number of columns and the columns' headers
print('The source table has {} columns'.format(len(par_table.find_all('th'))))
par_table.find_all('th')

The source table has 3 columns


[<th>Postal Code
 </th>, <th>Borough
 </th>, <th>Neighborhood
 </th>]

Store the the columns' headers in a list.

In [5]:
# define a empty list object
headers = list()

# run a loop to append the headers to the list [tag 'th']
for h in par_table.find_all('th'):
    headers.append(h.get_text())

# check the headers
headers

['Postal Code\n', 'Borough\n', 'Neighborhood\n']

Unfortunatelly the *get_text()* also returned unwanted characters, suchs as __'\n'__, they wiil be removed as the blank spaces between words, to be used as dataframe column names.

In [6]:
# run a loop to remove the '\n' from headers
for i, h in enumerate(headers):
    headers[i] = h.replace('\n','')

# run a loop to remove the blank spaces between words
for i, h in enumerate(headers):
    headers[i] = h.replace(' ','')

# check the clean headers
headers

['PostalCode', 'Borough', 'Neighborhood']

Create an empty dataframe using the table headers as column names.

In [7]:
pcode = pd.DataFrame(columns=headers)
pcode.reset_index()
pcode

Unnamed: 0,PostalCode,Borough,Neighborhood


Before populating the dataframe with the postal code data, first check how many rows the tables contains, excluding the header, the __tag tr__ will be used for this.

In [8]:
# print the number of rows the table contains
print('The source table has {} rows'.format(len(par_table.find_all('tr'))-1))

The source table has 180 rows


Populate the dataframe can be done running a nested loop. The first level will run by row, the __tag tr__ will be used to identify them,  for each row the second level will run on column, the __tag td__ will be used as identification. The data will be stored temporary in a list, then after reading each row, the list will stored into the dataframe, in case of Borough is not assigned the complete row will be ignored.

In [9]:
# run a loop by row [tag 'tr']
for i, row in enumerate(par_table.find_all('tr')):
    # skip the first row [headers]
    if i > 0:
        # create an empty list
        d = list()
        
        # run a loop by column for the current row [tag 'td']
        for column in row.find_all('td'):
            # append the text of current cell to the list, already removing the '\n'
            d.append(column.get_text().replace('\n',''))

        # if Borough is not 'not assigned' then store the list into the dataframe 
        if d[1].lower()!='not assigned':
            pcode = pcode.append(pd.Series(d, index = pcode.columns), ignore_index=True)
            
# inform when it is finished
print('Dataframe populated.')

Dataframe populated.


Check the first 10 observations in the dataframe.

In [10]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Quick resume of the data.

In [11]:
print('There are {} unique postal codes and {} unique boroughs.'.format(len(pcode['PostalCode'].unique()), len(pcode['Borough'].unique())))

There are 103 unique postal codes and 10 unique boroughs.


Check how many observations the dataframe contains.

In [12]:
pcode.shape[0]

103

This completes the __Task 1__.

---

### Task 2. Include the coordinates for each neighborhood in the dataframe.

Import required libraries. For this task the __GeoPy Nominatim__ will be used to get the coordinates, latitude and longitude.

In [13]:
# import Nominatim
from geopy.geocoders import Nominatim # get coordinates from address

Get the coordinates for each neighborhood in the Toronto Postal Code dataframe. It will be accomplished running a loop through the dataframe using the feature __Neighbohood as search key__. For some observations there are more than one neighborhood combined in one unique string, in this case only the first name will be considered for searching, a string split can be used to manage it.
The coordinates will be stored temporary in two separate lists, Latitide and Longitude.

In [14]:
# create empty lists to store the corrdinates
latlist = list()
lnglist = list()

# run a loop on the postal code dataframe using the Neighborhood as search key
for c in pcode['Neighborhood']:
    # define the user agent
    geol = Nominatim(user_agent='course_assignment')
    
    # set a variable with '[Neighborhood], Toronto, Canada'
    # split is used to get only the first name when there are more than one combined
    addr = c.split(',',1)[0]+', Toronto, Canada'

    # get the geo data
    loct = geol.geocode(addr)
    
    # check if geocoder has return any data
    if loct != None:
        # store the coordinates into the coordinates lists
        latlist.append(loct.latitude)
        lnglist.append(loct.longitude)
    else:
        # store NaN into the coordinates lists
        latlist.append(np.nan)
        lnglist.append(np.nan)

# inform when it is finished
print('Coordinates loaded.')

Coordinates loaded.


Update the Toronto Postal Code dataframe with the coordinates. It will be done by simply adding the two lists at the end of the dataframe.

In [15]:
# add the two lists to the dataframe
pcode['Latitude'] = latlist
pcode['Longitude'] = lnglist

print('Coordinates added to the dataframe.')

Coordinates added to the dataframe.


Check the first 10 observations in the dataframe.

In [16]:
pcode.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.660706,-79.360457
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.722079,-79.437507
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.659659,-79.39034
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.679484,-79.538909
6,M1B,Scarborough,"Malvern, Rouge",43.809196,-79.221701
7,M3B,North York,Don Mills,43.775347,-79.345944
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.653482,-79.383935
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.6565,-79.377114


This completes the __Task 2__.

---

### Task3. Explore, cluster and display the neighborhoods clusters on a map.

Working on it.

---