# IBM Coursera Capstone Week 3 Assignment
## Clustering Neighbourhoods in Toronto
### Oct 19, 2019

### Part 1 - Cleaning neighbourhood data

#### Import required libraries

In [1]:
import pandas as pd
import numpy as np
import urllib.request, urllib.parse, urllib.error
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

#### Scrape the table from Wikipedia and turn it into a dataframe

In [2]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Data cleanup
#### 1. Remove Postcode with "Not Assigned" boroughs

In [3]:
df = df[df.Borough != 'Not assigned']
# df.reset_index(inplace=True)
# df.drop('index',axis = 1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### 2. If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. 

In [4]:
for i, row in df.iterrows():
    if df.loc[i, 'Neighbourhood'] == 'Not assigned':
        df.loc[i, 'Neighbourhood'] = df.loc[i, 'Borough']
    else:
        continue
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### 3. Merge neighbourhoods with the same postcode

In [5]:
df_2 = df.groupby(by=['Postcode','Borough']).agg(lambda x: ', '.join(x))
df_2.reset_index(inplace=True)
df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Number of rows in the cleaned up dataframe

In [6]:
df_2.shape

(103, 3)

### Part 2 - Retrieve coordinates for each neighbourhood

#### Fetch the csv

In [7]:
!wget -q -O 'toneighbourhood_location.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [8]:
location_df = pd.read_csv('toneighbourhood_location.csv')
location_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Create two empty columns for Lat and Long

In [9]:
df_2['Latitude'] = ''
df_2['Longitude'] = ''

df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",,
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


#### For loop to match Lat Long into df_2

In [10]:
df_2['Latitude'] = ''
df_2['Longitude'] = ''

df_2.head()

for i, row in location_df.iterrows():
    for j, row in df_2.iterrows():
        if df_2.loc[j, 'Postcode'] == location_df.loc[i, 'Postal Code']:
            df_2['Latitude'][j] = location_df['Latitude'][i]
            df_2['Longitude'][j] = location_df['Longitude'][i]
        else:
            continue
        
df_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.8067,-79.1944
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.7845,-79.1605
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7636,-79.1887
3,M1G,Scarborough,Woburn,43.771,-79.2169
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395


#### Do the same for df so that the neighbourhoods are not merged for clustering.

In [11]:
df['Latitude'] = ''
df['Longitude'] = ''

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,,
3,M4A,North York,Victoria Village,,
4,M5A,Downtown Toronto,Harbourfront,,
5,M5A,Downtown Toronto,Regent Park,,
6,M6A,North York,Lawrence Heights,,


In [12]:
for i, row in location_df.iterrows():
    for j, row in df.iterrows():
        if df.loc[j, 'Postcode'] == location_df.loc[i, 'Postal Code']:
            df['Latitude'][j] = location_df['Latitude'][i]
            df['Longitude'][j] = location_df['Longitude'][i]
        else:
            continue
        
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.7533,-79.3297
3,M4A,North York,Victoria Village,43.7259,-79.3156
4,M5A,Downtown Toronto,Harbourfront,43.6543,-79.3606
5,M5A,Downtown Toronto,Regent Park,43.6543,-79.3606
6,M6A,North York,Lawrence Heights,43.7185,-79.4648


### Part 3 - Cluster neighbourhoods in Toronto

#### Import required libraries

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

In [13]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

#### Retrieve Toronto's coordinates using geopy

In [15]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


#### Visualize all neighbourhoods in GTA

In [16]:
# create map of New York using latitude and longitude values
map_gta = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_2['Latitude'], df_2['Longitude'], df_2['Borough'], df_2['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_gta)  
    
map_gta

#### Define FourSquare Credentials

In [17]:
CLIENT_ID = 'YUI2MPDATWZOO5AYH0QVYIGA2KXNMRZ2AIEZNZUXRW21VFV2' # your Foursquare ID
CLIENT_SECRET = 'QYDEO5VQ0J2EZMK5HQMIDXBRFGCASXJKRZVYYDRRD5CDGXTZ' # your Foursquare Secret
VERSION = '20191019' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: YUI2MPDATWZOO5AYH0QVYIGA2KXNMRZ2AIEZNZUXRW21VFV2
CLIENT_SECRET:QYDEO5VQ0J2EZMK5HQMIDXBRFGCASXJKRZVYYDRRD5CDGXTZ


### Cluster Toronto neighbourhoods into 4 clusters using KNN

#### To cluster only neighbourhoods containing the word, "Toronto"

In [41]:
df.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)
toronto_data = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.6543,-79.3606
1,M5A,Downtown Toronto,Regent Park,43.6543,-79.3606
2,M5B,Downtown Toronto,Ryerson,43.6572,-79.3789
3,M5B,Downtown Toronto,Garden District,43.6572,-79.3789
4,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754


#### Function that extracts the category of the venue

In [19]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
        if len(categories_list) == 0:
         return None
    else:
        return categories_list[0]['name']

#### GetNearbyVenues function to determine venues around 500m of each neighbourhood in Toronto

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [43]:
LIMIT = 100

toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])

Harbourfront
Regent Park
Ryerson
Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide
King
Richmond
Dovercourt Village
Dufferin
Harbourfront East
Toronto Islands
Union Station
Little Portugal
Trinity
The Danforth West
Riverdale
Design Exchange
Toronto Dominion Centre
Brockton
Exhibition Place
Parkdale Village
The Beaches West
India Bazaar
Commerce Court
Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North
Forest Hill West
High Park
The Junction South
North Toronto West
The Annex
North Midtown
Yorkville
Parkdale
Roncesvalles
Davisville
Harbord
University of Toronto
Runnymede
Swansea
Moore Park
Summerhill East
Chinatown
Grange Park
Kensington Market
Deer Park
Forest Hill SE
Rathnelly
South Hill
Summerhill West
CN Tower
Bathurst Quay
Island airport
Harbourfront West
King and Spadina
Railway Lands
South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown
St. James Town
First Canadian Place
Underground city


#### Look at the dataframe and the size of dataframe generated

In [44]:
print(toronto_venues.shape)
toronto_venues.head()

(3307, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


#### Number of venues around each neighbourhood

In [45]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Bathurst Quay,16,16,16,16,16,16
Berczy Park,56,56,56,56,56,56
Brockton,23,23,23,23,23,23
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16
CN Tower,16,16,16,16,16,16
Cabbagetown,43,43,43,43,43,43
Central Bay Street,86,86,86,86,86,86
Chinatown,100,100,100,100,100,100
Christie,16,16,16,16,16,16


#### Number of unique venue categories

In [46]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 240 uniques categories.


#### One hot encoding

In [47]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

#### Size of toronto_onehot data frame

In [49]:
toronto_onehot.shape
toronto_onehot

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Group by Toronto neighborhoods

In [50]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store
0,Adelaide,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.030000,...,0.000000,0.00000,0.00,0.010000,0.000000,0.000000,0.000000,0.010000,0.000000,0.01
1,Bathurst Quay,0.000000,0.000000,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
2,Berczy Park,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.000000,0.00000,0.00,0.017857,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
3,Brockton,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
4,Business Reply Mail Processing Centre 969 Eastern,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
5,CN Tower,0.000000,0.000000,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
6,Cabbagetown,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00
7,Central Bay Street,0.011628,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.011628,...,0.000000,0.00000,0.00,0.011628,0.000000,0.011628,0.000000,0.011628,0.000000,0.00
8,Chinatown,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.010000,0.00000,0.00,0.050000,0.000000,0.000000,0.040000,0.010000,0.000000,0.00
9,Christie,0.000000,0.000000,0.0000,0.0000,0.0000,0.000,0.0000,0.000,0.000000,...,0.000000,0.00000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00


#### Top 5 venues in each neighborhood

In [51]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2       Steakhouse  0.04
3  Thai Restaurant  0.04
4              Bar  0.04


----Bathurst Quay----
              venue  freq
0   Airport Service  0.19
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3          Boutique  0.06
4   Harbor / Marina  0.06


----Berczy Park----
            venue  freq
0     Coffee Shop  0.07
1    Cocktail Bar  0.05
2        Beer Bar  0.04
3  Farmers Market  0.04
4     Cheese Shop  0.04


----Brockton----
            venue  freq
0            Café  0.13
1  Breakfast Spot  0.09
2     Coffee Shop  0.09
3    Climbing Gym  0.04
4   Grocery Store  0.04


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.12
1          Comic Shop  0.06
2       Auto Workshop  0.06
3                Park  0.06
4    Recording Studio  0.06


----CN Tower----
              venue  freq
0   Airport Service  0.19
1    Airport Lounge  0.12

#### Sort venues in decreasing frequency

In [52]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Top 10 venues for each neighborhood

In [53]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Coffee Shop,Café,Thai Restaurant,Steakhouse,Bar,Burger Joint,Restaurant,Sushi Restaurant,Asian Restaurant,Hotel
1,Bathurst Quay,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
2,Berczy Park,Coffee Shop,Cocktail Bar,Café,Seafood Restaurant,Cheese Shop,Farmers Market,Steakhouse,Bakery,Beer Bar,Indian Restaurant
3,Brockton,Café,Coffee Shop,Breakfast Spot,Grocery Store,Bakery,Bar,Stadium,Burrito Place,Sandwich Place,Restaurant
4,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Pizza Place,Auto Workshop,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Comic Shop,Park,Recording Studio


#### Cluster neighborhoods

Use KNN to cluster venues into 5 clusters.

In [54]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 2, 3, 3, 3, 2, 3, 3, 3, 3], dtype=int32)

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [57]:
# add clustering labels
# neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,Harbourfront,43.6543,-79.3606,3,Coffee Shop,Pub,Park,Bakery,Café,Restaurant,Theater,Mexican Restaurant,Breakfast Spot,Greek Restaurant
1,M5A,Downtown Toronto,Regent Park,43.6543,-79.3606,3,Coffee Shop,Pub,Park,Bakery,Café,Restaurant,Theater,Mexican Restaurant,Breakfast Spot,Greek Restaurant
2,M5B,Downtown Toronto,Ryerson,43.6572,-79.3789,3,Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Café,Middle Eastern Restaurant,Ice Cream Shop,Pizza Place,Sporting Goods Shop,Tea Room
3,M5B,Downtown Toronto,Garden District,43.6572,-79.3789,3,Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Café,Middle Eastern Restaurant,Ice Cream Shop,Pizza Place,Sporting Goods Shop,Tea Room
4,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754,3,Coffee Shop,Café,Restaurant,Italian Restaurant,Bakery,Breakfast Spot,Hotel,Gastropub,Clothing Store,Park


Visualize the cluster.

In [58]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examining each cluster

#### Cluster 1

In [60]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
49,Central Toronto,0,Playground,Summer Camp,Restaurant,Trail,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Department Store,Doner Restaurant
50,Central Toronto,0,Playground,Summer Camp,Restaurant,Trail,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Department Store,Doner Restaurant


#### Cluster 2

In [61]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,Central Toronto,1,Park,Lawyer,Swim School,Bus Line,Women's Store,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store
34,Central Toronto,1,Trail,Park,Sushi Restaurant,Jewelry Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
35,Central Toronto,1,Trail,Park,Sushi Restaurant,Jewelry Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Women's Store
66,Downtown Toronto,1,Park,Playground,Trail,Building,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


#### Cluster 3

In [62]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
59,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
60,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
61,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
62,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
63,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
64,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court
65,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Sculpture Garden,Coffee Shop,Boat or Ferry,Bar,Airport Gate,Airport Food Court


#### Cluster 4

In [63]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,3,Coffee Shop,Pub,Park,Bakery,Café,Restaurant,Theater,Mexican Restaurant,Breakfast Spot,Greek Restaurant
1,Downtown Toronto,3,Coffee Shop,Pub,Park,Bakery,Café,Restaurant,Theater,Mexican Restaurant,Breakfast Spot,Greek Restaurant
2,Downtown Toronto,3,Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Café,Middle Eastern Restaurant,Ice Cream Shop,Pizza Place,Sporting Goods Shop,Tea Room
3,Downtown Toronto,3,Coffee Shop,Clothing Store,Cosmetics Shop,Italian Restaurant,Café,Middle Eastern Restaurant,Ice Cream Shop,Pizza Place,Sporting Goods Shop,Tea Room
4,Downtown Toronto,3,Coffee Shop,Café,Restaurant,Italian Restaurant,Bakery,Breakfast Spot,Hotel,Gastropub,Clothing Store,Park
5,East Toronto,3,Health Food Store,Trail,Pub,Pizza Place,Greek Restaurant,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
6,Downtown Toronto,3,Coffee Shop,Cocktail Bar,Café,Seafood Restaurant,Cheese Shop,Farmers Market,Steakhouse,Bakery,Beer Bar,Indian Restaurant
7,Downtown Toronto,3,Coffee Shop,Italian Restaurant,Café,Ice Cream Shop,Sandwich Place,Burger Joint,Juice Bar,Salad Place,Bubble Tea Shop,Spa
8,Downtown Toronto,3,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Diner,Restaurant,Nightclub,Baby Store,Coffee Shop
9,Downtown Toronto,3,Coffee Shop,Café,Thai Restaurant,Steakhouse,Bar,Burger Joint,Restaurant,Sushi Restaurant,Asian Restaurant,Hotel


#### Cluster 5

In [64]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Central Toronto,4,Garden,Pool,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


### Insights:

#### Boroughs with "Toronto" in their names are mostly located in the central portion of the Greater Toronto Area.
#### Cluster 4 has the most points and they are mostly food-related venues.
#### Cluster 3 is an anomaly where the Billy Bishop Airport is located.
