<h1>Toronto neighborhoods clustering</h1>

Firstly I import all libraries that will be needed during the project.

In [1]:
import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import re
print('Libraries imported.')

Libraries imported.


<h2>1. Create dataset</h2>

Now I read Wiki page into <i>wiki1</i>, then find the needed table using regular expression and finally handle the header and all rows using regular expressions. As result I get the table from Wikipedia page in a form of DataFrame <i>df</i>.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
re1 = "<table[^>]*><tbody><tr><th>Postcode.*M9Z</td><td>Not assigned</td><td>Not assigned</td></tr></tbody></table>"
wiki1 = requests.get(url).text.replace('\n','')
table = re.search(re1, wiki1).group(0)
columns = re.findall("<th>([^<>]*)</th>",table)
rows = re.findall("(<tr><td>(?:(?!<tr>).)*</tr>)", table)
df = pd.DataFrame(columns=(columns+['Link']))
for row in rows:
    re2 = '<td>([^<>]*)</td><td>(?:<a[^<>]*>)*([^<>]*)(?:</a>)*</td><td>(?:<a[^<>]*href\s*=\s*"([^"]*)"[^>]*>)*([^<>]*)(?:</a>)*</td>'
    values = re.findall(re2, row)
    norm_values = [tuple((values[0][0],values[0][1],values[0][3],values[0][2]))]
    df = pd.concat([df,pd.DataFrame(norm_values,columns=df.columns.values)],axis=0)
df.reset_index(drop=True,inplace=True)

Next I convert the table in accordance to the task.

In [3]:
df2 = df[df['Borough']!='Not assigned'].drop('Link',axis=1)
df2.columns = ['PostalCode','Borough','Neighborhood']
df2['Neighborhood'].replace(to_replace='Not assigned', value=df2['Borough'], inplace=True)
PostalCodes = df2.groupby('PostalCode').agg({'Borough': 'first',
                                             'Neighborhood': lambda x: ', '.join(x)})
PostalCodes.reset_index(inplace=True)
PostalCodes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [4]:
PostalCodes.shape

(103, 3)

In [5]:
coord = pd.read_csv('Geospatial_Coordinates.csv')
coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
PostalCodes['Latitude'] = coord.loc[coord['Postal Code']==PostalCodes['PostalCode'],'Latitude']
PostalCodes['Longitude'] = coord.loc[coord['Postal Code']==PostalCodes['PostalCode'],'Longitude']
PostalCodes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


So now we have coordinates of the postal codes that represent one or several neighborhoods. Let's visualize in on the map.

In [7]:
address = 'Toronto'
geolocator = Nominatim(user_agent="bs")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [8]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
for lat, lng, code, borough in zip(PostalCodes['Latitude'], PostalCodes['Longitude'],
                                   PostalCodes['PostalCode'], PostalCodes['Borough']):
    label = '{}, {}'.format(code, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
map_toronto

Because map output desappears after sharing on GitHub I'm going to firstly save all maps and provide you a link to open these maps.

In [9]:
from IPython.core.display import display, HTML
map_toronto.save('map1.html')
display(HTML('<a href="http://www.aspanda.ru/map1.html">Map1</a>'))

<h2>2. Explore the postal code zones</h2>

In [10]:
# Credentials were removed before notebook import
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20190415'
LIMIT = 100

In [11]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

I decided to cluster not neighborhoods but postal codes. And I used not the same approach that was applied in study case.<br>
So firstly I stack all venues got from foursquare located within the certain radius from the centers of postal codes. But in order to match the venue with postal code I don't use information from initial query, I use the postal code provided with the response. Because the area of postal code zones can vary a lot I use 3 different radii and then drop appeared duplicates.

In [12]:
rads = list([500,1400,4000]) #3 different radii to apply
LIMIT = 100

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.postalCode', 'venue.location.lat', 'venue.location.lng']
#Additionally to name, category, latitude and longitude I need also postal code

#Create the DataFrame that will contains all found venues 
all_venues = pd.DataFrame(columns=filtered_columns)
for code, lat, lng in zip(PostalCodes['PostalCode'],PostalCodes['Latitude'],PostalCodes['Longitude']):
    for radius in rads:
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, radius, LIMIT)
        results = requests.get(url).json()
        venues = results['response']['groups'][0]['items']
        norm_venues = json_normalize(venues) # flatten JSON
        if norm_venues.shape[0] > 0:
            all_venues = pd.concat([all_venues, norm_venues.loc[:, filtered_columns]],axis=0)

# filter the category for each row
all_venues['venue.categories'] = norm_venues.apply(get_category_type, axis=1)

# clean columns
all_venues.columns = [col.split(".")[-1] for col in all_venues.columns]

all_venues.head()

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,name,categories,postalCode,lat,lng
0,Wendy's,Grocery Store,,43.807448,-79.199056
0,Images Salon & Spa,Grocery Store,M1B 3W3,43.802283,-79.198565
1,Canadiana exhibit,Hotel,,43.817962,-79.193374
2,Caribbean Wave,Steakhouse,M1B,43.798558,-79.195777
3,Staples Morningside,Hotel,M1B 5N7,43.800285,-79.196607


In [13]:
all_venues.drop_duplicates(inplace=True) #drop duplicates
all_venues.dropna(inplace=True, axis=0) #drop rows with missed values
all_venues['postalCode'] = all_venues['postalCode'].str[:3] #slice first 3 letters from postal code
all_venues.reset_index(drop=True, inplace=True)

mask = all_venues['postalCode'].isin(PostalCodes['PostalCode'])
all_venues.drop(all_venues.index.values[~mask], axis=0, inplace=True)
#drop all rows that have postal code that it's not in predefined list of postal codes
#it can be wrong values or values outside Toronro or something else

all_venues.shape

(12126, 5)

In [14]:
print('There are {} uniques categories.'.format(len(all_venues['categories'].unique())))

There are 50 uniques categories.


<h2>3. Analyze the postal code zones</h2>

So now I'm looking for share of different categories of venues for each postal code.

In [15]:
# one hot encoding
toronto_onehot = pd.get_dummies(all_venues[['categories']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['PostalCode'] = all_venues['postalCode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,PostalCode,Airport,Airport Lounge,American Restaurant,Asian Restaurant,Automotive Shop,Bakery,Bank,Beer Garden,Breakfast Spot,...,Shoe Repair,Skating Rink,Smoothie Shop,Steakhouse,Sushi Restaurant,Tea Room,Thai Restaurant,Theme Park,Train Station,Wings Joint
0,M1B,0.0,0.021505,0.043011,0.021505,0.010753,0.010753,0.0,0.010753,0.010753,...,0.0,0.010753,0.010753,0.053763,0.021505,0.0,0.0,0.0,0.021505,0.032258
1,M1C,0.0,0.0,0.043478,0.0,0.0,0.043478,0.0,0.0,0.0,...,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.008475,0.076271,0.033898,0.042373,0.0,0.016949,0.016949,0.016949,0.008475,...,0.016949,0.025424,0.008475,0.059322,0.008475,0.016949,0.008475,0.0,0.016949,0.016949
3,M1G,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.010753,0.043011,0.010753,0.032258,0.0,0.032258,0.021505,0.0,0.0,...,0.010753,0.010753,0.010753,0.075269,0.032258,0.021505,0.010753,0.0,0.021505,0.0


And form the table of most common venues for each postal code:

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for i in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[i, 1:] = return_most_common_venues(toronto_grouped.iloc[i, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Coffee Shop,Hotel,Chinese Restaurant,Steakhouse,American Restaurant,Mediterranean Restaurant,Diner,Rental Car Location,Fast Food Restaurant,Wings Joint
1,M1C,Hotel,Coffee Shop,Chinese Restaurant,Diner,Japanese Restaurant,Garden,Rental Car Location,Restaurant,Racecourse,Bakery
2,M1E,Coffee Shop,Hotel,Airport Lounge,Steakhouse,Fast Food Restaurant,Asian Restaurant,Mediterranean Restaurant,American Restaurant,Diner,Caribbean Restaurant
3,M1G,Hotel,Coffee Shop,Paper / Office Supplies Store,Airport Lounge,Garden,Automotive Shop,Steakhouse,Sandwich Place,Deli / Bodega,Wings Joint
4,M1H,Hotel,Coffee Shop,Chinese Restaurant,Steakhouse,Fast Food Restaurant,Airport Lounge,Diner,Racecourse,Bakery,Asian Restaurant


<h2>4. Cluster the postal code zones</h2>

Run Kmeans to cluster the postal codes into 5 clusters.

In [18]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=17).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 4, 3, 4, 1, 1, 0, 3, 0, 3], dtype=int32)

Merge the table of postal codes with most common venues and table with cluster labels

In [19]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = PostalCodes.loc[PostalCodes['PostalCode'].isin(neighborhoods_venues_sorted['PostalCode']),:]

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('PostalCode'), on='PostalCode')

toronto_merged.head(12) # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Coffee Shop,Hotel,Chinese Restaurant,Steakhouse,American Restaurant,Mediterranean Restaurant,Diner,Rental Car Location,Fast Food Restaurant,Wings Joint
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,4,Hotel,Coffee Shop,Chinese Restaurant,Diner,Japanese Restaurant,Garden,Rental Car Location,Restaurant,Racecourse,Bakery
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,3,Coffee Shop,Hotel,Airport Lounge,Steakhouse,Fast Food Restaurant,Asian Restaurant,Mediterranean Restaurant,American Restaurant,Diner,Caribbean Restaurant
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Hotel,Coffee Shop,Paper / Office Supplies Store,Airport Lounge,Garden,Automotive Shop,Steakhouse,Sandwich Place,Deli / Bodega,Wings Joint
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Hotel,Coffee Shop,Chinese Restaurant,Steakhouse,Fast Food Restaurant,Airport Lounge,Diner,Racecourse,Bakery,Asian Restaurant
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476,1,American Restaurant,Steakhouse,Bakery,Burger Joint,Hotel,Fast Food Restaurant,Asian Restaurant,Coffee Shop,Airport Lounge,Japanese Restaurant
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029,0,Hotel,Steakhouse,Coffee Shop,Chinese Restaurant,Diner,Fast Food Restaurant,Garden,Asian Restaurant,Mediterranean Restaurant,Sushi Restaurant
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577,3,Coffee Shop,Hotel,Steakhouse,Airport Lounge,Fast Food Restaurant,Chinese Restaurant,Rental Car Location,American Restaurant,Grocery Store,Sandwich Place
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476,0,Hotel,Coffee Shop,Diner,Burger Joint,Garden,Steakhouse,Chinese Restaurant,Rental Car Location,Train Station,American Restaurant
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848,3,Coffee Shop,Hotel,Steakhouse,Sandwich Place,Fast Food Restaurant,Diner,Airport Lounge,Bakery,Shoe Repair,Caribbean Restaurant


Visualize the clusters

In [20]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [21]:
map_clusters.save('map2.html')
display(HTML('<a href="http://www.aspanda.ru/map2.html">Map2</a>'))

<h2>5. Colorize the map</h2>

Additionally I want to fully colorize our map.<br>
I form the DataFrame that contains only latitude and longitude of each venue as well as the cluster of the postal code of this venue

In [22]:
coord_x_cluster = all_venues
coord_x_cluster = coord_x_cluster.join(toronto_merged[['PostalCode','Cluster Labels']].set_index('PostalCode'), 
                                       on='postalCode', how='left')
coord_x_cluster.drop(['name','categories','postalCode'],axis=1,inplace=True)
coord_x_cluster.head()

Unnamed: 0,lat,lng,Cluster Labels
0,43.802283,-79.198565,1
1,43.798558,-79.195777,1
2,43.800285,-79.196607,1
3,43.800106,-79.198258,1
4,43.802008,-79.19808,1


Initialize and fit the model that match the cluster to coordinates. I use the nearest neighbor for it.

In [23]:
from sklearn.neighbors import KNeighborsClassifier

In [24]:
x = coord_x_cluster[['lat','lng']].values
y = coord_x_cluster['Cluster Labels'].values
model = KNeighborsClassifier(n_neighbors=1)
model.fit(x,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

And finally make the visualization

In [25]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10, min_zoom=10, max_zoom=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#define minimal and maximal value of latitude and longitude
lat_min = toronto_merged['Latitude'].min()
lat_max = toronto_merged['Latitude'].max()
lng_min = toronto_merged['Longitude'].min()
lng_max = toronto_merged['Longitude'].max()

#create the ranges of latitude and longitude and define the distance between the values
lng_range = np.linspace(lng_min, lng_max, 36)
lat_range = np.linspace(lat_min, lat_max, 24)
lng_size_one = (lng_max - lng_min)/35
lat_size_one = (lat_max - lat_min)/23

#create the list of points, I will use hexagonal markers, so I form corresponding net
points = list()
for lng in lng_range[0::2]:
    for lat in lat_range:
        points.append([lat,lng])
for lng in lng_range[1::2]:
    for lat in (lat_range[1:]+lat_range[:-1])/2:
        points.append([lat,lng])

#remove points that are too far from Toronto (actually from the venues)
points_to_remove=list()
for point in points:
    if (coord_x_cluster.loc[(coord_x_cluster['lat']>point[0]-lat_size_one)&(coord_x_cluster['lat']<point[0]+lat_size_one)&(coord_x_cluster['lng']>point[1]-lng_size_one)&(coord_x_cluster['lng']<point[1]+lng_size_one),:].size) == 0 :
        points_to_remove.append(point)
for point in points_to_remove:
    points.remove(point)
    
#make the prediction of class for each point in the list
np_points = np.array(points)
point_cluster = model.predict(np_points)

#make the hexagonal marker for each point
for lat, lon, cluster in zip(np_points[:,0],np_points[:,1],point_cluster):
    color_int = int(cluster)-1
    folium.RegularPolygonMarker(
        [lat, lon],
        radius=5,
        number_of_sides=6,
        color=rainbow[color_int],
        opacity=0.3,
        fill_color=rainbow[color_int],
        fill_opacity=0.44).add_to(map_clusters)
       
map_clusters

In [26]:
map_clusters.save('map3.html')
display(HTML('<a href="http://www.aspanda.ru/map3.html">Map3</a>'))

Seems to me it looks nice, maybe not very informatively because I lose the popups but colorful and beautiful. Thank you for reading this notebook.