## Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import requests
import pandas as pd
from lxml import etree
from bs4 import BeautifulSoup as bsoup
import os

In [2]:
# The code was removed by Watson Studio for sharing.

In [3]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [4]:
raw_random_wikipedia_page = requests.get(wikipedia_link)

In [5]:
page = raw_random_wikipedia_page.text
#print(page)

In [6]:
soup = bsoup(page, "lxml")
tablePostalCode = soup.find_all("table")[0]
rows = tablePostalCode.find_all("tr")
listPostalCode = []
for row in rows:
    tds = row.find_all("td")
    for td in tds:
        if (td.p.span.find("i") is not None) and ("Not assigned" in td.p.span.i.text):
            continue
        spanText = td.p.span.text
        code = td.p.b.text
        aTags = td.find_all("a")
        if len(aTags) <= 0:
            continue
        borough = aTags[0].text
        listNeighborhood = []
        if len(aTags) > 1:
            neighborhood = spanText[spanText.find("(") + 1: len(spanText) - 1]
        else:
            neighborhood = borough
        listPostalCode.append({"PostalCode": code, "Borough": borough, "Neighborhood": neighborhood})
#listPostalCode

In [7]:
columns = ["PostalCode", "Borough", "Neighborhood"]
dfPostalCode = pd.DataFrame.from_records(data=listPostalCode, columns=columns)
dfPostalCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Queen's Park,Queen's Park


Explain my work and any assumptions I made:
Firstly, I used etree lib from lxml to scrap Postal Code table content, then pd.read_html lib to convert html content to data frame as code below:

root = etree.XML(page)

tableTag = etree.XPath("//table")

postalCodeTorontoElems = tableTag(root)[0]

listDfTorontoPostalCode = pd.read_html(etree.tostring(postalCodeTorontoElems,method='html'))

listCol = list(listDfTorontoPostalCode[0].columns)

print(listDfTorontoPostalCode[0].shape)

dfTorontoPostalCode = listDfTorontoPostalCode[0]

I realized that there were some difficult cases to seperate which is borough or neighborhood, because the format/structure of cells in Wiki Postal Code table was not the same while all content in each cell of data frame is text and no spaces. For example, cell(1,7), cell(3,3), cell(2,8), cell(3,8).

Thus, I used etree lib to read and seperate each element that I was interested in. But it's not a good way.

Then, it's realy luck to me when I saw the note in this assignment. I used BeautifulSoup lib as recommendation in note of assignment. It's really easy and make my task on the fly!

I supposed that the first a tag is represented the borough, and all text in "()" is represented the list of neighborhoods in borough.

In [8]:
dfPostalCode.shape

(101, 3)

## Make calls to the Google Geocoding API to get the latitude and longitude coordinates of the postal codes in dataframe

In [23]:
# The code was removed by Watson Studio for sharing.

In [24]:
def updateLatLngForTorontoPostalCode(dfPostalCode):
    if os.path.isfile('TorontoPostalCode.csv'):
        print("Load data from csv")
        dfPostalCode = pd.read_csv('TorontoPostalCode.csv')
    else:
        print("Make request to Google Geocoding API")
        for row in list(range(0,dfPostalCode.shape[0])):
            #print("At index: {0}, Postal Code: {1}".format(row, dfPostalCode.iloc[row,0]))
            url = URL.format(API_KEY, dfPostalCode.iloc[row,0])
            #print("Get url: {0}".format(url))
            response = requests.get(url).json() # get response
            if response["status"] == "ZERO_RESULTS":
                print("NO RESULT at row index {0}: row".format(row))
                continue
            else:
                geographical_data = response["results"][0]["geometry"]["location"] # get geographical coordinates
                latitude = geographical_data['lat']
                longitude = geographical_data['lng']
                dfPostalCode.iloc[row, dfPostalCode.columns.get_loc('Latitude')] = latitude
                dfPostalCode.iloc[row, dfPostalCode.columns.get_loc('Longitude')] = longitude
    return dfPostalCode

In [17]:
dfPostalCode = updateLatLngForTorontoPostalCode(dfPostalCode)

Make request to Google Geocoding API
NO RESULT at row index 1: row
NO RESULT at row index 2: row
NO RESULT at row index 3: row
NO RESULT at row index 4: row
NO RESULT at row index 5: row
NO RESULT at row index 6: row
NO RESULT at row index 7: row
NO RESULT at row index 9: row
NO RESULT at row index 10: row
NO RESULT at row index 11: row
NO RESULT at row index 12: row
NO RESULT at row index 15: row
NO RESULT at row index 16: row
NO RESULT at row index 17: row
NO RESULT at row index 18: row
NO RESULT at row index 19: row
NO RESULT at row index 21: row
NO RESULT at row index 22: row
NO RESULT at row index 23: row
NO RESULT at row index 24: row
NO RESULT at row index 25: row
NO RESULT at row index 26: row
NO RESULT at row index 29: row
NO RESULT at row index 32: row
NO RESULT at row index 33: row
NO RESULT at row index 37: row
NO RESULT at row index 38: row
NO RESULT at row index 41: row
NO RESULT at row index 42: row
NO RESULT at row index 44: row
NO RESULT at row index 48: row
NO RESULT 

In [19]:
dfPostalCode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,32.363577,-90.152413
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,Regent Park / Harbourfront,0.0,0.0
3,M6A,North York,Lawrence Manor / Lawrence Heights,0.0,0.0
4,M7A,Queen's Park,Queen's Park,0.0,0.0


In [20]:
# Backup to use later instead of making request to Google Geoencoding API
fnPostCodeCSV = 'TorontoPostalCode.csv'
dfPostalCode.to_csv(fnPostCodeCSV, sep=',', encoding='utf-8', index=False)

In [22]:
# Check csv again
dfTest = pd.read_csv('TorontoPostalCode.csv')
dfTest.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,32.363577,-90.152413
1,M4A,North York,Victoria Village,0.0,0.0
2,M5A,Downtown Toronto,Regent Park / Harbourfront,0.0,0.0
3,M6A,North York,Lawrence Manor / Lawrence Heights,0.0,0.0
4,M7A,Queen's Park,Queen's Park,0.0,0.0
