# Clustering Toronto Neighbourhoods
#### Part 4: Clustering and visualising the neighbourhoods

This notebook clusters toronto neighbourhoods based on the similarity of nearby venue categorys and visualises them using folium.

## Load libraries

In [1]:
## Built in libraries
import requests # HTTP requests

# Third party libraries
import numpy as np # arrays 
import pandas as pd # Data structures

import folium # Visualising interactive maps

import matplotlib.pyplot as plt # Plotting simple maps
import matplotlib.cm as cm # Colourmaps
import matplotlib.colors as colors # converting colours to RGB
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler # Min Max Scaling for features

from sklearn.cluster import KMeans # KMeans clustering model

## Load Datasets

Here, the geographical information created in part 1 and the venue categories information created in part 2 are loaded.

In [2]:
# Toronto Neighbourhoods geographical information
tor_boro = pd.read_csv('tor_boro.csv')  

# Count of venue catgeories within 500, 1000 and 2000m radius of Toronto Neighbourhoods
# Strored in a dict for ease of use
R = [500, 1000, 2000]
toronto_venues = {r:pd.read_csv('toronto_venues_'+str(r)+'.csv') for r in R}

## Define function

mapClusters is used for visualisation of clustered neighbourhoods, produced with folium.

In [3]:
def mapClusters(k: int,r: int) -> folium.Map: 
    '''
    This function uses folium to map clustered neighbourhoods, which are distinguished
    by colour.
    
    Before running this function, kmeans clustering must be run on the toronto
    venues (scaled) data, which is obtained from the getNearbyVenueCats function.
    
    The input k refers to the number of clusters which was used, while n is the
    radius used in the getNearbyVenueCats function.   
    '''
    
    toronto_clusters = pd.DataFrame(toronto_venues[r].Neighbourhood).merge(
        tor_boro[['Neighbourhood', 'Latitude', 'Longitude']])

    #Get central coordinates for Toronto map
    latitude = tor_boro.Latitude.mean()
    longitude = tor_boro.Longitude.mean()

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

    # set color scheme for the clusters
    x = np.arange(k)
    ys = [i + x + (i*x)**2 for i in range(k)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    kmeans = KMeans(n_clusters=k, random_state=0).fit(scaled_features[r])
    
    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(toronto_clusters['Latitude'], toronto_clusters['Longitude'],
                                      toronto_clusters['Neighbourhood'], kmeans.labels_):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters

## Analysing Neighbourhoods

Now that we have the 39 Toronto neighbourhoods with a count of nearby venues, grouped by category, we can proceed to cluster the Neighbourhoods.

First we scale the venue category counts.

In [4]:
scaled_features = {r:MinMaxScaler().fit_transform(toronto_venues[r][list(toronto_venues[r].columns.values)[1:]]) for r in R}

## K-Means Clustering

K-Means clustering will be used to cluster neighbourhoods based on the similarity of the number of venue categories nearby.

Silhouette coefficients for K-means models given various values of K, as well as their respective silhouette plots, were used to determine which combinations of radii and clusters should be investigated further. This can be seen in part 3 of the project.

The k means models are created and fitted to the scaled data in the mapClusters function.

## Mapping clusters

The mapClusters function was used to produce maps of the Toronto neighbourhoods given a specified number of clusters and search radius.

#### 500m, 2 cluster map

In [5]:
# set number of clusters and radius
k = 2
r = 500

mapClusters(k,r)

#### 2000m, 2 cluster map

In [6]:
# set number of clusters and radius
k = 2
r = 2000

mapClusters(k,r)

#### 1000m, 4 cluster map

In [7]:
# set number of clusters and radius
k = 4
r = 1000

mapClusters(k,r)

## Observations

#### 2 Clusters

As seen in part 3, the 500m radius and 2 cluster solution had the highest mean silhouette score, which indicates the best closeness of fit. 

Similarly, for 2000m, the 2 cluster solution had the highest silhouette coefficient for that search radius.

Although, these results do have relatively well defined clusters, a 2 cluster solution does not really give us much useful insight. As can be seen in the visualisations, with 2 clusters, we see that neighbourhoods are simply split between those close to the centre and those further out. It is quite intuiative that neighbouroods nearer to a city centre will have more venues nearby than a neighbourhood further away from the centre, and perhaps using kmeans clustering to confirm this is a little superfluous.

In fact, the only difference between the 500 and 2000m clusters is that the centre cluster has a larger radius for the larger search radius case, again, as we would expect. 

We should however note that these clusters aren't completely geographically spherical. This proves that, even close to the city centre, some neighbourhoods have a higher density of venues than others and that this density of venues doesn't decline exactly evenly proportionally as the distance increases.

#### 4 Clusters

The 1000m 4 cluster solution has produced a somewhat less predictable output. The outer neighbourhoods have been split somewhat randomly while the central neighbourhoods appear to have been split into a group nearer to the university and a group nearer to the train station.

We should note however that this has the lowest silhouette score of those shown above meaning that these clusters are possibly not as optimal as some of the other solutions. If we were to investigate some of the data points within these clusters we would in fact find that some neighbourhoods within clusters are not particularly similar. Checking the silhouette plot from part 3, this is particulary evident with clusters 2 and 3.


## Conclusions

While we have gained some insights into what types of venues are nearby to the neighbourhoods around Toronto, it is worth noting that there are some limitations to this method and the solutions obtained.

Firstly, the data is not organised very clustered manner. The mean silhouette scores are not particularly high while the silhouette plots showed that it was hard to evenly group neighbourhoods in similarly sized and distanced (in the feature space) clusters.  

One issue with the dataset may be the 100 result limit forced by Foursquare. It is highly likely that most of the inner Toronto neighbourhoods had many venues missing, potentially skewing results.

Another issue that could be addressed in future test would be feature selection. The prescence of a very small number of 'College & University', 'Professional & Other Places' and 'Residence' venues in very few neighbourhoods, made it very hard to properly cluster these neighbourhoods. Removing these categories it was found that the mean silhouette score increased in most cases and could even be increased beyond 0.50.

Including more feature, such as crime rate, public transport accessabilty, average purchase/renatal cost, nearby schools and so on, may also discover more similarities and disimilarites that could be useful for accurate clustering.