# Clustering Toronto Neighbourhoods
#### Part 5: Exploration and Conclusions
 
In the final notebook of this project some exploration of the 1000 metre, 5 cluster solutions is done.

Finally, all observations and notes are summarised.

In [1]:
# Third party libraries
import pandas as pd # Data structures

from sklearn.preprocessing import MinMaxScaler # Min Max Scaling for features

from sklearn.cluster import KMeans # KMeans clustering model

## Load Datasets

Here, the geographical information created in part 1 and the venue categories information created in part 2 are loaded.

In [3]:
# Toronto Neighbourhoods geographical information
tor_boro = pd.read_csv('tor_boro.csv')  

# Count of venue catgeories within 1000m radius of Toronto Neighbourhoods
# Strored in a dict for ease of use
R=[1000]
toronto_venues = {r:pd.read_csv('toronto_venues_'+str(r)+'.csv') for r in R}

## Define function

clusterdf is used for obtaining a dataframe of Neighbourhood names as well as their geographical location and cluster number.

In [4]:
def clusterdf(k: int,r: int) -> pd.DataFrame: 
    '''
    This function returns a dataframe contain Neighbourhood names as well as
    geographical location and cluster number.
    
    Before running this function, kmeans clustering must be run on the toronto
    venues (scaled) data, which is obtained from the getNearbyVenueCats function.
    
    The input k refers to the number of clusters which was used, while n is the
    radius used in the getNearbyVenueCats function.
    '''
    kmeans = KMeans(n_clusters=k, random_state=0).fit(scaled_features[r])
      
    return   pd.concat([pd.DataFrame(toronto_venues[r].Neighbourhood).merge(
            tor_boro[['Neighbourhood', 'Latitude', 'Longitude']]),pd.Series(
            kmeans.labels_, name = 'Cluster')],axis=1)


### Cluster information

Below, we dive a little deeper into the clusters in the 1000m 5 cluster solution.

Here we 

First, we must scale our datasets again.

In [5]:
scaled_features = {r:MinMaxScaler().fit_transform(toronto_venues[r][list(toronto_venues[r].columns.values)[1:]]) for r in R}

### Exploration of 1000 metre, 5 cluster solution

Below we create a dataframe of the mean number of each category of venue for each neighbourhood in each cluster for the 5 cluster 1000m search radius solution. 

In [17]:
# set number of clusters and radius
k = 5
r = 1000

# Get clusters information as DataFrame
toronto_clusters = clusterdf(k,r)

# Create clusters dict for cluster analysis
clusters = {n:list(toronto_clusters.Neighbourhood.loc[toronto_clusters.Cluster == n]) 
            for n in range(k)}

# Create list of mean venue categories for each neigbourhood in clusters
clusters_cat_list = [pd.Series(toronto_venues[r][toronto_venues[r].Neighbourhood.isin(
    clusters[n])].reset_index(drop=True).drop('Neighbourhood',axis=1).mean(
    ).sort_values(ascending=False),name = 'Cluster '+str(n))
 for n in range(k)]

clusters_cat_df = pd.DataFrame([item for item in clusters_cat_list]).T

clusters_cat_df

Unnamed: 0,Cluster 0,Cluster 1,Cluster 2,Cluster 3,Cluster 4
Food,55.8,18.222222,59.0,44.0,49.909091
Shop & Service,17.1,5.888889,20.5,13.857143,17.545455
Arts & Entertainment,8.2,0.555556,2.0,12.571429,2.272727
Nightlife Spot,7.1,0.888889,7.0,6.285714,8.181818
Outdoors & Recreation,6.5,7.111111,9.0,12.571429,9.272727
Travel & Transport,2.8,0.444444,1.5,9.428571,0.818182
Professional & Other Places,1.8,0.111111,0.0,1.285714,0.454545
College & University,0.7,0.222222,0.0,0.0,0.090909
Residence,0.0,0.0,1.0,0.0,0.0


We can also display the mean number of total venues for each neighbourhood in each cluster

In [21]:
clusters_cat_df.sum(numeric_only=True)

Cluster 0    100.000000
Cluster 1     33.444444
Cluster 2    100.000000
Cluster 3    100.000000
Cluster 4     88.545455
dtype: float64

From this we can see that all neighbourhoods in clusters 0, 2 and 3 have 100 venues (the default limit from the Foursquare API). Cluster 1 (mean = 33.4 venues) and 4 (88.5 venues) clearly have lower denisty of nearby venues. 

We can also view the number of neighbourhoods in each cluster:

In [25]:
[len(cluster_dic[k]) for k in cluster_dic]

[10, 9, 2, 7, 11]

Note that these values can also be seen as the width in the silhouette plots from Part 3. Clusters 0, 1, 3 and 4 are relatively similar sized. 

As mentioned previously, we also observed in the silhouette plot that clusters 0 and 4 contain neighbourhoods that aren't particularly similar. We can view the

In [None]:
toronto_venues[1000][toronto_venues[1000].Neighbourhood.isin(
    clusters[0])]

Unnamed: 0,Neighbourhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
1,"Brockton, Parkdale Village, Exhibition Place",9,0,49,10,9,2,0,20,1
4,Central Bay Street,10,1,52,8,5,1,0,18,5
6,Church and Wellesley,4,0,58,5,6,2,0,20,5
13,"Garden District, Ryerson",11,0,60,4,8,1,0,12,4
17,"Kensington Market, Chinatown, Grange Park",6,1,54,12,4,1,0,21,1
23,"Queen's Park, Ontario Provincial Government",6,1,62,5,4,2,0,17,3
24,"Regent Park, Harbourfront",8,0,55,6,11,3,0,16,1
29,St. James Town,10,0,55,7,7,1,0,15,5
34,"The Annex, North Midtown, Yorkville",10,1,55,5,8,3,0,17,1
38,"University of Toronto, Harbord",8,3,58,9,3,2,0,15,2


We can also view the mean number of total venues for the neighbourhoods in each cluster 

# Final Observations

While we have gained some insights into what types of venues are nearby to the neighbourhoods around Toronto, it is worth noting that there are some limitations to the solutions.

Firstly, the data is not organised very clustered manner. The silhouette scores are not particularly high and 



One limitation for the 500m dataset, 5 cluster case may be that due to the small search radius some Neighbourhoods have very few results (the toronto_venues[500] dataframe can be explored for more information). One limitation for the 500m dataset, 5 cluster case may be that due to the small search radius some Neighbourhoods have very few results (the toronto_venues[500] dataframe can be explored for more information). 

Issue with 100 limit?

Higher k -> single neighbourhood clusters? (4 in 10 cluster solution and a cluster of 20)Higher k -> single neighbourhood clusters? (4 in 10 cluster solution and a cluster of 20)

D.

Hihger limit with more features (crime rate, public transport, average cost, nearby schools and so on)