# Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import requests
import numpy as np
import pandas as pd

import requests
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
from bs4 import BeautifulSoup

## Part 1: parsing wiki into a data frame and cleaning data

Used the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [11]:
# parse Wiki page into a table
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_url,'lxml')
table = soup.find('table',attrs={'class':'wikitable'})
table_rows = table.find_all('tr')
l = []

for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        l.append(row)

In [12]:
# create data frame
df = pd.DataFrame(l, columns=["PostalCode", "Borough", "Neighborhood"])

#### The dataframe consists of three columns: PostalCode, Borough, and Neighborhood. 
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [15]:
df.drop(df[df.Borough == "Not assigned"].index, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge
11,M3B,North York,Don Mills
12,M4B,East York,Parkview Hill / Woodbine Gardens
13,M5B,Downtown Toronto,"Garden District, Ryerson"


#### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [17]:
df.Neighborhood.loc[df.Neighborhood == "Not assigned"] = df.Borough

#### More than one neighborhood can exist in one postal code area

In [21]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()
df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


#### Used the .shape method to print the number of rows of my dataframe.

In [22]:
df.shape

(103, 3)

## Part 2: getting geo coordinates for neighborhoods

In [24]:
# read data from csv file
data = pd.read_csv("http://cocl.us/Geospatial_data") 
data.rename(columns = {'Postal Code':'PostalCode'}, inplace = True)
data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [25]:
df = df.merge(data, on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Part 3: exploring and clusteringthe neighborhoods in Toronto

In [28]:
address = 'Toronto'

geolocator = Nominatim(user_agent="trnt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))
location

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Location(Toronto, Golden Horseshoe, Ontario, M5H 2N2, Canada, (43.6534817, -79.3839347, 0.0))

In [35]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

![Toronto neighborhoods](img/1.png)

### Now I'm going to explore and cluster  Toronto's downtown neighborhoods

In [36]:
td = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
td.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,St. James Town / Cabbagetown,43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [37]:
address = 'Downtown Toronto, Toronto'

geolocator = Nominatim(user_agent="trnt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6541737, -79.38081164513409.


In [41]:
map_td = folium.Map(location=[latitude, longitude], zoom_start=13)

for lat, lng, label in zip(td['Latitude'], td['Longitude'], td['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_td)  
    
map_td

![downtown](img/2.png)

In [103]:
# definiing FS credentials
CLIENT_ID = 'my_secret'
CLIENT_SECRET = 'my_secret'
VERSION = '20180605'

In [47]:

def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

In [48]:
td_venues = getNearbyVenues(names=td['Neighborhood'],
                                   latitudes=td['Latitude'],
                                   longitudes=td['Longitude']
                                  )

In [52]:
td_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,St. James Town / Cabbagetown,43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


In [53]:
td_venues.shape

(1280, 7)

as we can see there are 1280 venues

In [54]:
td_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst\n Quay / South Niagara / Island airport,16,16,16,16,16,16
Central Bay Street,77,77,77,77,77,77
Christie,19,19,19,19,19,19
Church and Wellesley,79,79,79,79,79,79
Commerce Court / Victoria Hotel,100,100,100,100,100,100
First Canadian Place / Underground city,100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
Harbourfront East / Union Station / Toronto Islands,100,100,100,100,100,100
Kensington Market / Chinatown / Grange Park,76,76,76,76,76,76


In [56]:
print('There are {} uniques categories.'.format(len(td_venues['Venue Category'].unique())))

There are 206 uniques categories.


In [58]:
# one hot encoding
td_onehot = pd.get_dummies(td_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
td_onehot['Neighborhood'] = td_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [td_onehot.columns[-1]] + list(td_onehot.columns[:-1])
td_onehot = td_onehot[fixed_columns]
td_onehot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
td_onehot.shape

(1280, 206)

In [61]:
td_grouped = td_onehot.groupby('Neighborhood').mean().reset_index()
td_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0
1,CN Tower / King and Spadina / Railway Lands / ...,0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.012987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,...,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,0.0,0.012987,0.0
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.025316,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,...,0.012658,0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Commerce Court / Victoria Hotel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
6,First Canadian Place / Underground city,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.02,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.01,0.0
8,Harbourfront East / Union Station / Toronto Is...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
9,Kensington Market / Chinatown / Grange Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039474,0.0,0.052632,0.013158,0.0


#### Let's print each neighborhood along with the top 3 most common venues

In [63]:
num_top_venues = 3

for hood in td_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = td_grouped[td_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2  Seafood Restaurant  0.04


----CN Tower / King and Spadina / Railway Lands / Harbourfront West / Bathurst
 Quay / South Niagara / Island airport----
              venue  freq
0    Airport Lounge  0.12
1   Airport Service  0.12
2  Airport Terminal  0.12


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.18
1  Italian Restaurant  0.05
2        Burger Joint  0.04


----Christie----
           venue  freq
0  Grocery Store  0.21
1           Café  0.16
2           Park  0.11


----Church and Wellesley----
                 venue  freq
0  Japanese Restaurant  0.06
1          Coffee Shop  0.06
2              Gay Bar  0.05


----Commerce Court / Victoria Hotel----
         venue  freq
0  Coffee Shop  0.12
1   Restaurant  0.07
2         Café  0.07


----First Canadian Place / Underground city----
         venue  freq
0  Coffee Shop  0.12
1         Café  0.07
2 

In [94]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = td_grouped['Neighborhood']

for ind in np.arange(td_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(td_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant
1,CN Tower / King and Spadina / Railway Lands / ...,Airport Lounge,Airport Service,Airport Terminal
2,Central Bay Street,Coffee Shop,Italian Restaurant,Thai Restaurant
3,Christie,Grocery Store,Café,Park
4,Church and Wellesley,Japanese Restaurant,Coffee Shop,Gay Bar
5,Commerce Court / Victoria Hotel,Coffee Shop,Restaurant,Café
6,First Canadian Place / Underground city,Coffee Shop,Café,Restaurant
7,"Garden District, Ryerson",Clothing Store,Coffee Shop,Japanese Restaurant
8,Harbourfront East / Union Station / Toronto Is...,Coffee Shop,Aquarium,Hotel
9,Kensington Market / Chinatown / Grange Park,Café,Coffee Shop,Vietnamese Restaurant


#### We can see that almost in each neighborhood most common venue is Coffee shop

Now let's do some clustering

In [95]:
kclusters = 3
td_grouped_clustering = td_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(td_grouped_clustering)

In [96]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
td_merged = td
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
td_merged = td_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
td_merged.dropna(inplace=True)
td_merged = td_merged.astype({'Cluster Labels': 'int32'})
td_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,0,Park,Trail,Playground
1,M4X,Downtown Toronto,St. James Town / Cabbagetown,43.667967,-79.367675,1,Coffee Shop,Restaurant,Pub
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Japanese Restaurant,Coffee Shop,Gay Bar
3,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636,1,Coffee Shop,Bakery,Park
4,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Clothing Store,Coffee Shop,Japanese Restaurant


In [97]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(td_merged['Latitude'], td_merged['Longitude'], td_merged['Neighborhood'], td_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

![clusters](img/3.png)

As we can see almost all the venues are from the same cluster, but in order to see some differences,
I've splited sata into 3 clusters
Let's examine them now

In [99]:
td_merged.loc[td_merged['Cluster Labels'] == 0, td_merged.columns[[2] + list(range(5, td_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Rosedale,0,Park,Trail,Playground


This cluster contains only one neighorhood and the most common venues are park, trail and playground

In [101]:
td_merged.loc[td_merged['Cluster Labels'] == 1, td_merged.columns[[2] + list(range(5, td_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
1,St. James Town / Cabbagetown,1,Coffee Shop,Restaurant,Pub
2,Church and Wellesley,1,Japanese Restaurant,Coffee Shop,Gay Bar
3,Regent Park / Harbourfront,1,Coffee Shop,Bakery,Park
4,"Garden District, Ryerson",1,Clothing Store,Coffee Shop,Japanese Restaurant
5,St. James Town,1,Coffee Shop,Café,Restaurant
6,Berczy Park,1,Coffee Shop,Cocktail Bar,Restaurant
7,Central Bay Street,1,Coffee Shop,Italian Restaurant,Thai Restaurant
8,Richmond / Adelaide / King,1,Coffee Shop,Restaurant,Café
9,Harbourfront East / Union Station / Toronto Is...,1,Coffee Shop,Aquarium,Hotel
10,Toronto Dominion Centre / Design Exchange,1,Coffee Shop,Café,Hotel


Second cluster in the biggest one and almost in all neighborhoods coffe shops and cafes are the most common venues. (People love to eat and drink there :) )

In [102]:
td_merged.loc[td_merged['Cluster Labels'] == 2, td_merged.columns[[2] + list(range(5, td_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
14,CN Tower / King and Spadina / Railway Lands / ...,2,Airport Lounge,Airport Service,Airport Terminal


And airport with it's facilities was marked as third cluster. Here we can find lounge and other common services for airport

## Conclusion

To sum up, Toronto has a lot of boroughs and downtown isn't the most interesting of them.
As expected, the most common venues there are typical venues for office workesrs and businessmen - coffe shops and restaurants.