## Segmenting and Clustering Neighborhoods in Toronto

### Part 1: Obtain the postal codes from toronto from wikipedia homepage

Load required libraries for this task

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import requests
import matplotlib.pylab as plt
%matplotlib inline

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

Let's define now some helper functions to automatize the extract of html elements 

In [3]:
def find_all(text, sub):
    """ find all occurences of substring in larger string 
    and return their indices relative to the text string
    
    Key arguments
    -------------
    text:      text string
    sub:       substring to be found in text
    
    Return
    -------
    indices:   indices of the found occurences relative to text
    """
    occurencesIndex = []
    index = 0
    while index < len(text):
        index = text.find(sub, index)
        if index == -1:
            return occurencesIndex
        else:
            occurencesIndex.append(index)
            index += 1 
    return occurencesIndex

def getValue(text, b_element, e_element):
    """ get text snippet with given indices for beginning and ending
    Key arguments:
    -------------
    text:       text string
    b_element:  begin html element <element>
    e_element:  end html element </element>
    
    return:
    ------
    text snippet if element found in text, otherwise return original text
    """
    if text.find(b_element) != -1:
        return text[text.find(b_element) + len(b_element):text.find(e_element)]
    else:
        return text
    

The following function extracts the data from the table of the wikipedia homepage 

In [4]:
def getData(text):
    """ get Data from html code
    Key argument:
    ------------
    text:      text from which the data will be extracted (table text)
    
    return:
    -------
    pd.Dataframe 
    """
    rows = zip(find_all(text, "<tr>")[1:], find_all(text, "</tr>")[1:])
    data =  []
    for row_b, row_e in rows:
        row_data = text[row_b:row_e]
        sub_indices = zip(find_all(row_data, "<td>"), find_all(row_data, "</td>"))
        data.append([row_data[sub_b+4:sub_e] for sub_b, sub_e in sub_indices])
    return pd.DataFrame(data, columns = ['PostalCode' , 'Borough', 'Neighborhood'])

table html from the page

In [5]:
tabledata = page[page.find('<table class="wikitable sortable">') + len('<table class="wikitable sortable">'):page.find('<table class="multicol" role="presentation" style="border-collapse: collapse; padding: 0; border: 0; background:transparent; width:100%;">')]

get the data frame of the table html

In [6]:
postcode_can = getData(tabledata)

some decoration since Borough and Neighborhood have data elements of type hyperlink 

In [7]:
postcode_can["Borough"] = [getValue(iString, '">', "</a>") for iString in postcode_can["Borough"]]
postcode_can["Neighborhood"] = [getValue(iString, '">', "</a>").replace("\n", "") for iString in postcode_can["Neighborhood"]]

Only process the cells that have an assigned Borough. Ignore cells with a borough that is Not assigned.

In [8]:
postcode_can = postcode_can[postcode_can["Borough"] != "Not assigned"]

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [9]:
postcode_can['Neighborhood'] = postcode_can.groupby(['PostalCode'])['Neighborhood'].transform(lambda x: ', '.join(x))
postcode_can = postcode_can[['PostalCode', "Borough", "Neighborhood"]].drop_duplicates()
postcode_can.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Harbourfront, Regent Park"
6,M6A,North York,"Lawrence Heights, Lawrence Manor"
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,"Rouge, Malvern"
14,M3B,North York,Don Mills North
15,M4B,East York,"Woodbine Gardens, Parkview Hill"
17,M5B,Downtown Toronto,"Ryerson, Garden District"


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
postcode_can.shape

(103, 3)

### Part 2: Extract Foursquare location data (latitude, longitude) of each neighborhood

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

In [11]:
def getGeomSpatialData():
    """ read in geom spatial data from csv file
    """
    data = pd.read_csv("Geospatial_Coordinates.csv")
    return data

In [12]:
geomspatial = getGeomSpatialData()
geomspatial.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
postcode_can = postcode_can.set_index('PostalCode').join(geomspatial.set_index('PostalCode'))

### Part 3: Explore and cluster the neighborhoods in Toronto. 

In [13]:
def getCityNeighborhoods(data, city):
    """ get Neighborhoods of city by searching through Borough
    """
    return np.where(data["Borough"].str.find(city) != -1)

In [14]:
neighbourhoods_index = getCityNeighborhoods(postcode_can, "Toronto")

In [15]:
toronto_df = postcode_can.iloc[neighbourhoods_index] # toronto neighbourhoods

Visualization using folium

In [16]:
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [17]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    

<folium.vector_layers.CircleMarker at 0x7f273f231710>

<folium.vector_layers.CircleMarker at 0x7f273f231dd8>

<folium.vector_layers.CircleMarker at 0x7f273f2315f8>

<folium.vector_layers.CircleMarker at 0x7f273f233588>

<folium.vector_layers.CircleMarker at 0x7f273f2334a8>

<folium.vector_layers.CircleMarker at 0x7f273f2337b8>

<folium.vector_layers.CircleMarker at 0x7f273f233940>

<folium.vector_layers.CircleMarker at 0x7f273f233a58>

<folium.vector_layers.CircleMarker at 0x7f273f2485f8>

<folium.vector_layers.CircleMarker at 0x7f273f248828>

<folium.vector_layers.CircleMarker at 0x7f273f248940>

<folium.vector_layers.CircleMarker at 0x7f273f248a20>

<folium.vector_layers.CircleMarker at 0x7f273f2337f0>

<folium.vector_layers.CircleMarker at 0x7f273f248cc0>

<folium.vector_layers.CircleMarker at 0x7f273f248c50>

<folium.vector_layers.CircleMarker at 0x7f273f248ac8>

<folium.vector_layers.CircleMarker at 0x7f273f248eb8>

<folium.vector_layers.CircleMarker at 0x7f273f248fd0>

<folium.vector_layers.CircleMarker at 0x7f273f1ee1d0>

<folium.vector_layers.CircleMarker at 0x7f273f1ee390>

<folium.vector_layers.CircleMarker at 0x7f273f1ee5c0>

<folium.vector_layers.CircleMarker at 0x7f273f1ee780>

<folium.vector_layers.CircleMarker at 0x7f273f1ee940>

<folium.vector_layers.CircleMarker at 0x7f273f1eeb38>

<folium.vector_layers.CircleMarker at 0x7f273f1eec50>

<folium.vector_layers.CircleMarker at 0x7f273f1eeeb8>

<folium.vector_layers.CircleMarker at 0x7f273f1eefd0>

<folium.vector_layers.CircleMarker at 0x7f273f1ee160>

<folium.vector_layers.CircleMarker at 0x7f273f1ee3c8>

<folium.vector_layers.CircleMarker at 0x7f273f248f60>

<folium.vector_layers.CircleMarker at 0x7f273f1fc1d0>

<folium.vector_layers.CircleMarker at 0x7f273f1fc390>

<folium.vector_layers.CircleMarker at 0x7f273f1fc5c0>

<folium.vector_layers.CircleMarker at 0x7f273f1fc780>

<folium.vector_layers.CircleMarker at 0x7f273f1fc940>

<folium.vector_layers.CircleMarker at 0x7f273f1fcb00>

<folium.vector_layers.CircleMarker at 0x7f273f1fccf8>

<folium.vector_layers.CircleMarker at 0x7f273f1fceb8>

In [18]:
map_toronto