# Assignment of Clustering Toronto Neighbourhoods
### The following is the first section of the assignment. 
### This section caters to downloading the dataset and creating a dataframe with a comma seperated value of neighbourhoods for each Postcode

##### The following code imports all necessary libraries

In [2]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import matplotlib.cm as cm
import matplotlib.colors as colors
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import requests
from bs4 import BeautifulSoup
# import k-means from clustering stage
from sklearn.cluster import KMeans


Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

##### The following code loads the data from the wikipedia site into a dataframe called df

In [3]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]

##### The following removies the rows from the data frame where Borough is equal to 'Not assigned'

In [4]:
df_remove_na = df[df.Borough != 'Not assigned']
df_remove_na = df_remove_na.reset_index(drop=True)

##### The following replaces the value of Borough as the Neighbourhood where Neighbourhood is equal to 'Not assigned'

In [5]:
df_assign_neighbour = df_remove_na
df_assign_neighbour.loc[(df_assign_neighbour.Neighbourhood == 'Not assigned'),'Neighbourhood'] = df_assign_neighbour.Borough

##### The following generates a dataframe called grouped which is grouped by postcode and borough values. For the total number of distince postcodes and borough values, a loop is executed. Inside this loop, the original dataframe is checked for the same post code value. If it is a match, a variable Neighbour is appended with the Neighbourhood value sepearated by comma. For each distince postcode and Borough, a commma seperated Neighbourhood is generated

In [8]:
grouped = df_assign_neighbour.groupby(['Postcode','Borough'],as_index=False).count()

grouped_total = grouped.shape[0]
ungrouped_total = df_assign_neighbour.shape[0]
for i in range(0, grouped_total):
    Neighbour = ''
    for j in range(0, ungrouped_total):
        if grouped.iloc[i,0] == df_assign_neighbour.iloc[j,0]:
            if Neighbour != '':
                Neighbour = Neighbour +','
            Neighbour = Neighbour + df_assign_neighbour.iloc[j,2] 
    grouped.iloc[i,2]=Neighbour
grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


##### The following shows the shape ( total number of entities ) of the dataframe

In [9]:
grouped.shape

(103, 3)

### The following is the second section of the assignment. 
### This section is to assign latitudes and longitudes to the postal codes and generate a data frame 

##### We get the latitude and longitude data from the link provided and generate a dataframe called df_lat_lon_data

In [15]:
df_lat_lon_data = pd.read_csv("http://cocl.us/Geospatial_data")
df_lat_lon_data.columns=['Postcode','Latitude','Longitude']
df_lat_lon_data.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### We create a dataframe by merging the grouped data frame and df_lat_lon_data dataframe on the common column called Postcode. This generates a data frame called df_complete which has all the necessary information 

In [16]:
df_complete = pd.merge(grouped, df_lat_lon_data, on='Postcode', how='inner')
df_complete.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### The following is the third section of the assignment. 
### In this section we cater to those Boroughs with Toronto as part of their names, perform clustering and generate a map

##### We filter the df_complete dataframe for only those rows which have Toronto in the names and create a dataframe called df_Toronto

In [17]:
df_Toronto_dim = pd.merge(df_assign_neighbour, df_lat_lon_data, on='Postcode', how='inner')
df_Toronto = df_Toronto_dim[df_Toronto_dim['Borough'].str.contains('Toronto')]
df_Toronto = df_Toronto.reset_index(drop=True)
df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,Ryerson,43.657162,-79.378937
3,M5B,Downtown Toronto,Garden District,43.657162,-79.378937
4,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418


##### The following gets the latitude and longitude of Toronto city 

In [18]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


##### The following creates a map of Toronto with markers for each Borough and Neighbourhood

In [19]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Toronto['Latitude'], df_Toronto['Longitude'], df_Toronto['Borough'], df_Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto 

##### In the following section we will concentrate on "Central Toronto" Borough 

In [20]:
df_Central_Toronto = df_Toronto[df_Toronto.Borough == 'Central Toronto']
df_Central_Toronto = df_Central_Toronto.reset_index(drop=True)
df_Central_Toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
3,M5P,Central Toronto,Forest Hill North,43.696948,-79.411307
4,M5P,Central Toronto,Forest Hill West,43.696948,-79.411307
5,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
6,M5R,Central Toronto,The Annex,43.67271,-79.405678
7,M5R,Central Toronto,North Midtown,43.67271,-79.405678
8,M5R,Central Toronto,Yorkville,43.67271,-79.405678
9,M4S,Central Toronto,Davisville,43.704324,-79.38879


#####  The following creates a map for central Toronto area for each Borough and Neighbourhood

In [21]:
address = 'Central Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude))

# create map of Toronto using latitude and longitude values
map_ctoronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Central_Toronto['Latitude'], df_Central_Toronto['Longitude'], df_Central_Toronto['Borough'], df_Central_Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ctoronto)  
    
map_ctoronto 

The geograpical coordinate of Central Toronto are 43.653963, -79.387207.


##### We get the credentials to access Foursquare

In [22]:
CLIENT_ID = '4YTPVVOGWFM5YMTDRXF3K2T02CHTVLQVLKCZJXOHK2Y2PI0E' # your Foursquare ID
CLIENT_SECRET = 'HWBBFRVWMYSSLDND4WIXXPQITS14GF3A5ZSI4IOE0TI4KBRN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4YTPVVOGWFM5YMTDRXF3K2T02CHTVLQVLKCZJXOHK2Y2PI0E
CLIENT_SECRET:HWBBFRVWMYSSLDND4WIXXPQITS14GF3A5ZSI4IOE0TI4KBRN


##### we now explore the first neighbourhood for Central Toronto

In [23]:
df_Central_Toronto.loc[0, 'Neighbourhood']

'Lawrence Park'

##### We now get the url to get all venues for the first Neighbourhood value at a radius of 500 and a limit of 100 entries

In [24]:
LIMIT = 100 # limit of number of venues returned by Foursquare API


radius = 500 # define radius


url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    df_Central_Toronto.loc[0, 'Latitude'], 
    df_Central_Toronto.loc[0, 'Longitude'], 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=4YTPVVOGWFM5YMTDRXF3K2T02CHTVLQVLKCZJXOHK2Y2PI0E&client_secret=HWBBFRVWMYSSLDND4WIXXPQITS14GF3A5ZSI4IOE0TI4KBRN&v=20180605&ll=43.7280205,-79.3887901&radius=500&limit=100'

##### We execute the url to get the results as a json file

In [25]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e4046bb9239352901841637'},
 'response': {'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.7325205045, 'lng': -79.3825744605273},
   'sw': {'lat': 43.7235204955, 'lng': -79.3950057394727}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '50e6da19e4b0d8a78a0e9794',
       'name': 'Lawrence Park Ravine',
       'location': {'address': '3055 Yonge Street',
        'crossStreet': 'Lawrence Avenue East',
        'lat': 43.72696303913755,
        'lng': -79.39438246708775,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.72696303913755,
          'lng': -79.39438246708775}],
        'distance': 465,
        'c

##### The following function shows the category given a venue information

In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

##### The following is done to transform a json file into a dataframe

In [28]:

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Lake,Lake,43.72791,-79.386857
2,Zodiac Swim School,Swim School,43.728532,-79.38286
3,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


#####  The following gets the nearby values for the latitudes, longitudes and name of each venue

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### In the following we get the list of all nearby values for each Neighbourhood details provided for Central Toronto

In [30]:
Central_Toronto_venues = getNearbyVenues(names=df_Central_Toronto['Neighbourhood'],
                                   latitudes=df_Central_Toronto['Latitude'],
                                   longitudes=df_Central_Toronto['Longitude']
                                  )


Lawrence Park
Roselawn
Davisville North
Forest Hill North
Forest Hill West
North Toronto West
The Annex
North Midtown
Yorkville
Davisville
Moore Park
Summerhill East
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West


##### We list the values of the dataframe containing details of Central Toronto venues

In [31]:
Central_Toronto_venues

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.728020,-79.388790,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.728020,-79.388790,Lake,43.727910,-79.386857,Lake
2,Lawrence Park,43.728020,-79.388790,Zodiac Swim School,43.728532,-79.382860,Swim School
3,Lawrence Park,43.728020,-79.388790,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
4,Roselawn,43.711695,-79.416936,Dr.Paul Hodges MIP,43.710634,-79.415810,Health & Beauty Service
5,Roselawn,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden
6,Roselawn,43.711695,-79.416936,Aquatics Academy Inc.,43.709951,-79.412127,Pool
7,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
8,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
9,Davisville North,43.712751,-79.390197,Homeway Restaurant & Brunch,43.712641,-79.391557,Breakfast Spot


##### We get the count of entries for each Neighbourhood

In [32]:
Central_Toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,35,35,35,35,35,35
Davisville North,7,7,7,7,7,7
Deer Park,14,14,14,14,14,14
Forest Hill North,4,4,4,4,4,4
Forest Hill SE,14,14,14,14,14,14
Forest Hill West,4,4,4,4,4,4
Lawrence Park,4,4,4,4,4,4
Moore Park,4,4,4,4,4,4
North Midtown,21,21,21,21,21,21
North Toronto West,23,23,23,23,23,23


##### In the followwing, we create columns for categories and for each category present in the Neighbourhood, we represent it with 0 or 1

In [33]:
# one hot encoding
central_Toronto_onehot = pd.get_dummies(Central_Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
central_Toronto_onehot['Neighbourhood'] = Central_Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [central_Toronto_onehot.columns[-1]] + list(central_Toronto_onehot.columns[:-1])
central_Toronto_onehot = central_Toronto_onehot[fixed_columns]

central_Toronto_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Asian Restaurant,BBQ Joint,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Supermarket,Sushi Restaurant,Swim School,Tennis Court,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,Lawrence Park,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Roselawn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### we group the data by Neighbourhood and for each category, a summation is generated. 

In [34]:
central_Toronto_grouped =central_Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
central_Toronto_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Asian Restaurant,BBQ Joint,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Supermarket,Sushi Restaurant,Swim School,Tennis Court,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.057143,0.0,...,0.0,0.057143,0.0,0.0,0.028571,0.028571,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Deer Park,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0
3,Forest Hill North,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0
4,Forest Hill SE,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0
5,Forest Hill West,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0
6,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Moore Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
8,North Midtown,0.047619,0.0,0.047619,0.0,0.0,0.047619,0.0,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0
9,North Toronto West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.043478


##### In the following function, we sort the categories in descending order and list the top n venues

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#####  In the following code, we list the top 10 venues for each Neighbourhood of central Toronto

In [44]:

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = central_Toronto_grouped['Neighbourhood']

for ind in np.arange(central_Toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(central_Toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Sandwich Place,Dessert Shop,Pizza Place,Sushi Restaurant,Gym,Coffee Shop,Café,Italian Restaurant,Restaurant,Farmers Market
1,Davisville North,Hotel,Gym,Sandwich Place,Department Store,Park,Food & Drink Shop,Breakfast Spot,Gym / Fitness Center,Grocery Store,Greek Restaurant
2,Deer Park,Pub,Coffee Shop,Sports Bar,Vietnamese Restaurant,Light Rail Station,Liquor Store,Fried Chicken Joint,Pizza Place,Restaurant,American Restaurant
3,Forest Hill North,Bus Line,Trail,Jewelry Store,Sushi Restaurant,Gourmet Shop,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gas Station
4,Forest Hill SE,Pub,Coffee Shop,Sports Bar,Vietnamese Restaurant,Light Rail Station,Liquor Store,Fried Chicken Joint,Pizza Place,Restaurant,American Restaurant


##### The following code is to generate clusters ( here, 3 ) and the labels for each cluster is printed. 

In [45]:

# set number of clusters
kclusters = 3

cToronto_grouped_clustering = central_Toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cToronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 2, 0, 2, 0, 1, 0, 0], dtype=int32)

##### In the following code, the Cluster labels are added as a column to the finalized data frame

In [46]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

cToronto_merged = df_Central_Toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
cToronto_merged = cToronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

cToronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Bus Line,Lake,Park,Swim School,Gourmet Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gas Station,Yoga Studio
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Health & Beauty Service,Garden,Pool,Yoga Studio,Indian Restaurant,History Museum,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Gym,Sandwich Place,Department Store,Park,Food & Drink Shop,Breakfast Spot,Gym / Fitness Center,Grocery Store,Greek Restaurant
3,M5P,Central Toronto,Forest Hill North,43.696948,-79.411307,2,Bus Line,Trail,Jewelry Store,Sushi Restaurant,Gourmet Shop,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gas Station
4,M5P,Central Toronto,Forest Hill West,43.696948,-79.411307,2,Bus Line,Trail,Jewelry Store,Sushi Restaurant,Gourmet Shop,Food & Drink Shop,Fried Chicken Joint,Furniture / Home Store,Garden,Gas Station


##### Lastly , we create a cluster map using the finalized dataframe

In [47]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(cToronto_merged['Latitude'], cToronto_merged['Longitude'], cToronto_merged['Neighbourhood'], cToronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters