# Segmenting and Clustering Neighborhoods in Toronto
## Capstone Project 2

### Part One

Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
import requests
import json

### Getting the data

Read the required WikiPedia page, into a Pandas DataFrame

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]

In [3]:
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Formatting the data

Rename Postal Code field to match the structure specified in the assignment

In [4]:
df.rename(columns={'Postal code': 'Postal Code'}, inplace=True)

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
df = df[df['Borough'] != 'Not assigned']

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
df['Neighborhood'] = df.apply(lambda x: x['Neighborhood'] if x['Neighborhood'] != 'NaN' else x['Borough'], axis=1)

Combine rows having the same post code/borough

In [7]:
df.groupby(['Postal Code','Borough'], sort = False).agg(lambda x: ', '.join(x))

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postal Code,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
M6A,North York,Lawrence Manor / Lawrence Heights
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...
M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
M4Y,Downtown Toronto,Church and Wellesley
M7Y,East Toronto,Business reply mail Processing CentrE
M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


Replace '/' separators in Neighborhood lists with coma

In [8]:
df['Neighborhood'] = df.apply(lambda x: ', '.join(n.strip() for n in x['Neighborhood'].split('/')), axis=1)

Let's see what we've got so far

In [9]:
df.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [10]:
df.shape

(103, 3)

### Part Two

I am using the supplied CSV file to extend the DataFrame with the geospatial coordinates

In [11]:
geo_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


The number of rows are matching.

Let's combine the two data frames

In [12]:
df = df.join(geo_df.set_index('Postal Code'), on='Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
165,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
168,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Part Three

Reduce the data to the boroughs of Toronto

In [13]:
toronto_df = df[df['Borough'].str.contains('Toronto')]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
30,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [14]:
toronto_df.shape

(39, 5)

Since we will only work with data of Toronto from here, let's calculate the geospatial center point of the dataset to help creating the map!

In [15]:
latitude = (toronto_df['Latitude'].min() + toronto_df['Latitude'].max())/2
longitude = (toronto_df['Longitude'].min() + toronto_df['Longitude'].max())/2
zoom_start=11
print('Latitude: {}. Longitude: {}'.format(latitude, longitude))

Latitude: 43.6784836. Longitude: -79.38874055


Let's visualize the seleted items on the map

In [16]:

toronto_map = folium.Map(location=[latitude, longitude], zoom_start=zoom_start)

# add markers to map
for lat, lng, label in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

#### Define FourSquare credentials and version

In [17]:
CLIENT_ID = os.environ['FOURSQUARE_CLIENT_ID']
CLIENT_SECRET = os.environ['FOURSQUARE_CLIENT_SECRET']
VERSION = '20200411' # Foursquare API version

I am going to reuse the function from the previous lab to query Foursquare Data. But instead of the neighborhood name, I use the postal code as a reference. Also omiting the neighborhood coordinates from the results, since it is already available in the _toronto_df_ DataFrame.

In [18]:
def getNearbyVenues(postal_codes, latitudes, longitudes, radius=500, LIMIT=50):
    
    venues_list=[]
    for postal_code, lat, lng in zip(postal_codes, latitudes, longitudes):
        print(postal_code)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postal_code,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
toronto_venues = getNearbyVenues(  postal_codes=toronto_df['Postal Code'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

M5A
M7A
M5B
M5C
M4E
M5E
M5G
M6G
M5H
M6H
M5J
M6J
M4K
M5K
M6K
M4L
M5L
M4M
M4N
M5N
M4P
M5P
M6P
M4R
M5R
M6R
M4S
M5S
M6S
M4T
M5T
M4V
M5V
M4W
M5W
M4X
M5X
M4Y
M7Y


It seems to be a good idea to store the downloaded data now. After running the folowing cell, i will comment it out, to avoid overwriting the backup on a subsequent run.

In [20]:
import pickle
#pickle.dump(toronto_venues, open("toronto_venues_pc.p", "wb"))

In [21]:
toronto_venues

Unnamed: 0,Postal Code,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
...,...,...,...,...,...
1186,M7Y,The Ten Spot,43.664815,-79.324213,Spa
1187,M7Y,Toronto Yoga Mamas,43.664824,-79.324335,Yoga Studio
1188,M7Y,TTC Stop #03049,43.664470,-79.325145,Light Rail Station
1189,M7Y,Greenwood Cigar & Variety,43.664538,-79.325379,Smoke Shop


Transforn the DataFrame to a weighted matrix of neighborhoods and venue categories.  
_Same as in the course lab_

In [22]:
# one hot encoding
onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Postal Code'] = toronto_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

# group by neighborhood and apply weights
grouped = onehot.groupby('Postal Code').mean().reset_index()

grouped.shape

(39, 217)

Check if the resulting dimensions are reasonable

In [23]:
assert toronto_df.shape[0] == grouped.shape[0]
assert len(toronto_venues['Venue Category'].unique())+1 == grouped.shape[1]

In [24]:
grouped.head()

Unnamed: 0,Postal Code,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.025,0.0,0.0,0.025
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Transform this data into a matrix, which stores the most common venue types per neighborhood.  
_Taken from the course lab._

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd'] + ['th']*(num_top_venues-3)

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = grouped['Postal Code']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Trail,Neighborhood,Pub,Health Food Store,Yoga Studio,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,M4K,Greek Restaurant,Italian Restaurant,Coffee Shop,Ice Cream Shop,Furniture / Home Store,Restaurant,Bookstore,Diner,Sports Bar,Indian Restaurant
2,M4L,Sandwich Place,Park,Pizza Place,Brewery,Liquor Store,Burrito Place,Board Shop,Restaurant,Italian Restaurant,Ice Cream Shop
3,M4M,Café,Coffee Shop,Gastropub,Bakery,American Restaurant,Brewery,Yoga Studio,Diner,Ice Cream Shop,Seafood Restaurant
4,M4N,Park,Swim School,Bus Line,Yoga Studio,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


#### Clustering the neighborhood

In [26]:
# set number of clusters
kclusters = 5

grouped_clustering = grouped.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge the results with the original dataset
merged = toronto_df.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

In [27]:
merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Pub,Park,Mexican Restaurant,Breakfast Spot,Café,Spa,Shoe Store,Restaurant
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Coffee Shop,Diner,Yoga Studio,Burrito Place,Juice Bar,Boutique,Café,Distribution Center,Discount Store,Beer Bar
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Café,Bookstore,Restaurant,Clothing Store,Cosmetics Shop,Theater,Italian Restaurant,Ramen Restaurant,Hotel
22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Café,Coffee Shop,Park,Farmers Market,Beer Bar,Cocktail Bar,Bakery,Cosmetics Shop,Japanese Restaurant,Restaurant
30,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Trail,Neighborhood,Pub,Health Food Store,Yoga Studio,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


In [28]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=zoom_start)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Analysing the clusters

In [29]:
merged.loc[merged['Cluster Labels'] == 0, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Downtown Toronto,0,Coffee Shop,Bakery,Pub,Park,Mexican Restaurant,Breakfast Spot,Café,Spa,Shoe Store,Restaurant
6,Downtown Toronto,0,Coffee Shop,Diner,Yoga Studio,Burrito Place,Juice Bar,Boutique,Café,Distribution Center,Discount Store,Beer Bar
13,Downtown Toronto,0,Coffee Shop,Café,Bookstore,Restaurant,Clothing Store,Cosmetics Shop,Theater,Italian Restaurant,Ramen Restaurant,Hotel
22,Downtown Toronto,0,Café,Coffee Shop,Park,Farmers Market,Beer Bar,Cocktail Bar,Bakery,Cosmetics Shop,Japanese Restaurant,Restaurant
31,Downtown Toronto,0,Coffee Shop,Beer Bar,Bakery,Cocktail Bar,Farmers Market,Seafood Restaurant,Cheese Shop,Café,Restaurant,Park
40,Downtown Toronto,0,Coffee Shop,Italian Restaurant,Spa,Burger Joint,Middle Eastern Restaurant,Gym / Fitness Center,Japanese Restaurant,Café,Bubble Tea Shop,Hotel
41,Downtown Toronto,0,Grocery Store,Café,Park,Baby Store,Diner,Restaurant,Italian Restaurant,Athletics & Sports,Candy Store,Coffee Shop
49,Downtown Toronto,0,Coffee Shop,Café,Pizza Place,Steakhouse,American Restaurant,Restaurant,Bar,Asian Restaurant,Hotel,Gym / Fitness Center
50,West Toronto,0,Bakery,Pharmacy,Middle Eastern Restaurant,Brewery,Café,Recording Studio,Bar,Supermarket,Bank,Brazilian Restaurant
58,Downtown Toronto,0,Coffee Shop,Aquarium,Plaza,Café,Hotel,Park,Bar,History Museum,IT Services,Salad Place


In [30]:
merged.loc[merged['Cluster Labels'] == 1, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,Central Toronto,1,Pool,Garden,Yoga Studio,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store


In [31]:
merged.loc[merged['Cluster Labels'] == 2, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
93,Central Toronto,2,Park,Swim School,Bus Line,Yoga Studio,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
103,Central Toronto,2,Park,Jewelry Store,Trail,Sushi Restaurant,Bus Line,Deli / Bodega,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
147,Downtown Toronto,2,Park,Playground,Trail,Yoga Studio,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [32]:
merged.loc[merged['Cluster Labels'] == 3, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
129,Central Toronto,3,Restaurant,Yoga Studio,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store


In [33]:
merged.loc[merged['Cluster Labels'] == 4, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,East Toronto,4,Trail,Neighborhood,Pub,Health Food Store,Yoga Studio,Dance Studio,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


### Conclusion

I see multiple problems with the above approach:
  * Treating the 10th most common venue category equally important than the 1st, obviously distorts the results
  * Using minimum, maximum, and distribution of weights might give additional insights, eg.: about the diversity of the neighborhood
  
But for this assignment, that's it.