## Segmenting and Clustering Neighborhoods in Toronto

### Part 1

Load required libraries for this task

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import requests

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

Let's define now some helper functions to automatize the extract of html elements 

In [3]:
def find_all(text, sub):
    """ find all occurences of substring in larger string 
    and return their indices relative to the text string
    
    Key arguments
    -------------
    text:      text string
    sub:       substring to be found in text
    
    Return
    -------
    indices:   indices of the found occurences relative to text
    """
    occurencesIndex = []
    index = 0
    while index < len(text):
        index = text.find(sub, index)
        if index == -1:
            return occurencesIndex
        else:
            occurencesIndex.append(index)
            index += 1 
    return occurencesIndex

def getValue(text, b_element, e_element):
    """ get text snippet with given indices for beginning and ending
    Key arguments:
    -------------
    text:       text string
    b_element:  begin html element <element>
    e_element:  end html element </element>
    
    return:
    ------
    text snippet if element found in text, otherwise return original text
    """
    if text.find(b_element) != -1:
        return text[text.find(b_element) + len(b_element):text.find(e_element)]
    else:
        return text
    

The following function extracts the data from the table of the wikipedia homepage 

In [4]:
def getData(text):
    """ get Data from html code
    Key argument:
    ------------
    text:      text from which the data will be extracted (table text)
    
    return:
    -------
    pd.Dataframe 
    """
    rows = zip(find_all(text, "<tr>")[1:], find_all(text, "</tr>")[1:])
    data =  []
    for row_b, row_e in rows:
        row_data = text[row_b:row_e]
        sub_indices = zip(find_all(row_data, "<td>"), find_all(row_data, "</td>"))
        data.append([row_data[sub_b+4:sub_e] for sub_b, sub_e in sub_indices])
    return pd.DataFrame(data, columns = ['PostalCode' , 'Borough', 'Neighborhood'])

table html from the page

In [5]:
tabledata = page[page.find('<table class="wikitable sortable">') + len('<table class="wikitable sortable">'):page.find('<table class="multicol" role="presentation" style="border-collapse: collapse; padding: 0; border: 0; background:transparent; width:100%;">')]

get the data frame of the table html

In [6]:
postcode_can = getData(tabledata)

some decoration since Borough and Neighborhood have data elements of type hyperlink 

In [7]:
postcode_can["Borough"] = [getValue(iString, '">', "</a>") for iString in postcode_can["Borough"]]
postcode_can["Neighborhood"] = [getValue(iString, '">', "</a>").replace("\n", "") for iString in postcode_can["Neighborhood"]]

Only process the cells that have an assigned Borough. Ignore cells with a borough that is Not assigned.

In [8]:
postcode_can = postcode_can[postcode_can["Borough"] != "Not assigned"]

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [9]:
postcode_can['Neighborhood'] = postcode_can.groupby(['PostalCode'])['Neighborhood'].transform(lambda x: ', '.join(x))
postcode_can = postcode_can[['PostalCode', "Borough", "Neighborhood"]].drop_duplicates()
postcode_can.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
6,M6A,North York,"Lawrence Heights, Lawrence Manor"
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge, Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens, Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson, Garden District"


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
postcode_can.shape

(103, 3)

### Part 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In [11]:
def getGeomSpatialData():
    data = pd.read_csv("Geospatial_Coordinates.csv")
    return data

In [12]:
geomspatial = getGeomSpatialData()
geomspatial.set_index("Postal Code", inplace = True)
postcode_can.set_index("PostalCode", inplace = True)#_valiues(["PostalCode"])#[postcode_can["PostalCode"] == geomspatial["Postal Code"]]
postcode_can.join(geomspatial)

Unnamed: 0_level_0,Borough,Neighborhood,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3A,North York,Parkwoods,43.753259,-79.329656
M4A,North York,Victoria Village,43.725882,-79.315572
M5A,Downtown Toronto,"Harbourfront, Regent Park",43.654260,-79.360636
M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
M7A,Queen's Park,Not assigned,43.662301,-79.389494
M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
M3B,North York,Don Mills North,43.745906,-79.352188
M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
