# Segmenting and Clustering Neighborhoods in Toronto

This notebook is a sample work on geocoding and clustering locations in Python. The goal is to use postal codes of Toronto area to cluster neighborhoods. Techniques used include scraping web pages, cleaning and restrucuring data frames, data merging, making API calls, K-means clustering, and visualization.

### Part 1. Scraping the postal codes from wikipedia

Source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
This wiki page includes postal codes beginning with M, which correspond to areas located within the city of Toronto in the province of Ontario.
Canadian postal codes are in the format of AXAXAX, where A represents a letter and X represets a number from 0 to 9. The first 3 characters denote forward sortation area (FSA) and the last 3 characters denote a local delivery unit (LDU). In this table, only the first three characters are listed.

Library using: beautifulsoup

__Installing libraries:__

In [90]:
! pip install beautifulsoup4
! pip install lxml
! pip install folium

__Load libraries and scrape the whole page, then pick out the table from loaded html file.__

Use BeautifulSoup library to load the html page and then use the find function to select the embedded table.

In [2]:
# Loading libraries
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests

# Getting the webpage html code
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Loading html
soup = BeautifulSoup(wiki, 'lxml')
print(soup.title) # Check if source codes are successfully loaded

# Pull out the table
table = soup.find('table', class_='wikitable sortable')


<title>List of postal codes of Canada: M - Wikipedia</title>


**Pull out table headers and contents and store the table into a pandas data frame.**

Tasks excuted in this section:
1. Pull out names of the columns.
2. Find out number of rows and build the data frame.
3. Fill the table with scraped information.
4. Clean up unwanted strings.

In [17]:
# Pull out table header
column=[]
for x in table.find_all('th'):
    column.append(x.get_text())

print(column) # Check column names

# Find out total rows including the headers
row_count=0
for x in table.find_all('tr'):
    row_count+=1
print(row_count)
row_count-=1 # Adjust for table content rows

# Set up dataframe
postal = pd.DataFrame(columns=column, index=range(0, row_count))

# Fill in table contents
row_marker = 0
header = True
for row in table.find_all('tr'):
    if header: # Skipping header
        header = False
    else:
        column_marker = 0
        columns = row.find_all('td')
        for column in columns:
            postal.iat[row_marker,column_marker] = column.get_text()
            column_marker += 1
        row_marker += 1

# Fixing '\n':
postal.rename(columns = {'Neighbourhood\n':'Neighborhood'}, inplace = True)  
postal['Neighborhood'] = postal['Neighborhood'].str.replace(r'\n', '') 
print(postal.shape)
postal.tail()

['Postcode', 'Borough', 'Neighbourhood\n']
289
(288, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor
287,M9Z,Not assigned,Not assigned


__Cleaning the table to generate unique postcode lines with assigned borough names__

In this section, I make further cleaning and restructure the table.
Tasks excuted:
1. Remove lines without an assigned borough name.
2. If neighborhood name is not assigned, use the borough name as the neighborhood name.
3. Combine rows with the same postal code and join the neighborhood names with commas.

In [18]:
# Removing lines with boroughs "Not assigned".
print("Borough not assigned rows: ", (postal.Borough == 'Not assigned').sum()) # Check number of rows in question
postal.drop(postal.index[postal['Borough'] == 'Not assigned'], inplace = True)
print(postal.shape) # Result check

# Assigning borough name as neighbourhood name if neighbourhood name is not assigned
print("Neighborhood not assigned rows: ", (postal.Neighborhood == 'Not assigned').sum()) # Check number of rows without assigned neighbourhood
postal.loc[postal['Neighborhood'] == 'Not assigned', 'Neighborhood'] = postal.loc[postal['Neighborhood'] == 'Not assigned', 'Borough']
print("Neighborhood not assigned rows: ", (postal.Neighborhood == 'Not assigned').sum())


Borough not assigned rows:  77
(211, 3)
Neighborhood not assigned rows:  1
Neighborhood not assigned rows:  0


Reframe the data to combine neighborhoods with the same postal code

In [91]:
# Stack on postal codes and combining neighbourhoods
coderow = len(postal.Postcode.unique()) # Check number of unique postal codes
print("Total number of unique postal codes: ", coderow)
stack = pd.DataFrame(columns=list(postal.columns), index=range(0, coderow))

code_array = postal.Postcode.unique() # Unique postal codes
row_marker = 0
# Fill the new frame with corresponding boroughs and joined unique neighborhoods of each postal codes.
for code in code_array:
    stack.iat[row_marker, 0] = code
    stack.iat[row_marker, 1] = ', '.join(postal.loc[postal['Postcode'] == code, 'Borough'].unique())
    stack.iat[row_marker, 2] = ', '.join(postal.loc[postal['Postcode'] == code, 'Neighborhood'].unique())
    row_marker += 1

# Check results
print("Shape of the final table is: ", stack.shape)
stack.head()

Total number of unique postal codes:  103
Shape of the final table is:  (103, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


__The target table is generated with 103 rows of postal codes with corresponding boroughs and neighborhoods.__

__End of part 1__

## Part 2. Getting Lattitude and Longitude

The optimal method would be making API calls using Geocoding API service provided bt Google. However, as they are changing usage now in 2019, I use a pre-loaded csv file to get lattitude and longitude information of the locations in the previous table.

Tasks in this section:
1. Load CVS file with location infomation.
2. Merge with the table built in part 1.

In [47]:
# Load CVS file with lattitude and longitude information
geotable = pd.read_csv('https://cocl.us/Geospatial_data')
geotable.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [48]:
geotable.set_index('Postal Code', inplace = True) # Making postal codes index, allowing for Excel vlookup type merge.

In [50]:
# Merge latitude and longitude columns based on postal codes.
merge = postal.join(geotable, on=['Postcode'])
geo = pd.DataFrame(columns=list(merge.columns), index=range(0, coderow))

code_array = postal.Postcode.unique() # Unique postal codes
row_marker = 0

# Fill the new frame with corresponding boroughs and joined unique neighborhoods of each postal codes.
for code in code_array:
    geo.iat[row_marker, 0] = code
    geo.iat[row_marker, 1] = ', '.join(merge.loc[merge['Postcode'] == code, 'Borough'].unique())
    geo.iat[row_marker, 2] = ', '.join(merge.loc[merge['Postcode'] == code, 'Neighborhood'].unique())
    geo.iat[row_marker, 3] = np.asscalar(merge.loc[merge['Postcode'] == code, 'Latitude'].unique())
    geo.iat[row_marker, 4] = np.asscalar(merge.loc[merge['Postcode'] == code, 'Longitude'].unique())
    row_marker += 1

print("Shape of the final table is: ", geo.shape)
geo.head()

Shape of the final table is:  (103, 5)


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7185,-79.4648
4,M7A,Queen's Park,Queen's Park,43.6623,-79.3895


This ends part 2.

## Part 3. Mapping the neighborhoods

In this part, top 100 venues of each neighborhood are collected from Foursquare. I then use K-means clustering to group similar neighborhoods into the same cluster. Finally, mark neighborhoods on the map. Neighborhoods with similar venues are shown in the same color.

Pick out boroughs that contain "Toronto".

In [92]:
# Toronto Only
toronto = geo[geo['Borough'].str.contains("Toronto")]

Key in Foursquare API credential. The keys are removed for security reason.

In [61]:

CLIENT_ID = 'Foursquare ID' # Foursquare ID
CLIENT_SECRET = 'Foursquare Secret' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

Define a function to pull collect information of venues.
The function first generate API request URLs using location information and neighborhood names. Then from the returned JSON files, names, location, and categories of venues are stored.

In [75]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Making calls for Toronto neighborhoods and store the returned venues. Search radius is within 500M and only return the top 100 venues.

In [76]:
LIMIT = 100

toronto_venues = getNearbyVenues(names=toronto['Neighborhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

print(toronto_venues.shape)
toronto_venues.head()

Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
Fir

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Harbourfront, Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront, Regent Park",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Harbourfront, Regent Park",43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,"Harbourfront, Regent Park",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Harbourfront, Regent Park",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


Check totoal number of categories in venues.

In [78]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 238 uniques categories.


Creat dummies for each venue category to prepare for evaluation.

In [97]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Calculate the weight of each venue category in each neighborhood

In [98]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.066667,0.066667,0.066667,0.133333,0.133333,0.133333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Define a function to return top n most common (with highest weight) venue categories in each neighborhood.

In [99]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


Show top 10 most common venue categories in each neighborhood

In [104]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,American Restaurant,Steakhouse,Thai Restaurant,Cosmetics Shop,Hotel,Restaurant,Burger Joint
1,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Café,Cheese Shop,Farmers Market,Steakhouse,Bakery,Beer Bar,Italian Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Yoga Studio,Italian Restaurant,Convenience Store,Pet Store,Climbing Gym,Restaurant,Caribbean Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Pizza Place,Gym / Fitness Center,Recording Studio,Restaurant,Butcher,Burrito Place,Brewery,Skate Park
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Terminal,Airport Service,Boat or Ferry,Sculpture Garden,Bar,Boutique,Plane,Airport Gate,Airport Food Court


__Use K-means algorithm to cluster similar neighborhoods__

A random number of centers are selected (6 in this case). In each iteration, the distance of each data point to each center are calculated. Each data is assigned to the closest center. Then the mean of each cluster becomes the new center in the next iteration. This process continues until the result converges.

In [105]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

Adding cluster label to each neighborhood. Neighborhoods with the same cluster label are in the same cluster.

In [106]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns.

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606,0,Coffee Shop,Bakery,Park,Pub,Mexican Restaurant,Breakfast Spot,Theater,Restaurant,Café,Electronics Store
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3789,0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Middle Eastern Restaurant,Ramen Restaurant,Restaurant,Fast Food Restaurant,Diner,Tea Room
15,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754,0,Café,Coffee Shop,Hotel,Restaurant,Gastropub,Bakery,Breakfast Spot,Cocktail Bar,Clothing Store,Cosmetics Shop
19,M4E,East Toronto,The Beaches,43.6764,-79.293,0,Health Food Store,Other Great Outdoors,Pub,Trail,Falafel Restaurant,Event Space,Ethiopian Restaurant,Farmers Market,Diner,Fast Food Restaurant
20,M5E,Downtown Toronto,Berczy Park,43.6448,-79.3733,0,Coffee Shop,Cocktail Bar,Seafood Restaurant,Café,Cheese Shop,Farmers Market,Steakhouse,Bakery,Beer Bar,Italian Restaurant


__Visualization using Folium library__

Neighborhoods are marked on the map of Toronto. Similar neighborhoods are labeled with the same color.

In [107]:
# Load folium library for mapping
import folium
map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=12)

# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters