## Segmenting and Clustering Toronto Neighborhoods.

For this project, I will explore, segment, and cluster neighborhoods in Toronto based on the most common category of the venues around each neighborhood. 

The data required for this project is a table from Wikipedia. Because it is only a table, I will use Pandas library to scrape it and clean it up for requesting location for each neighborhood which is it latitude and longitude. Geopy package can find coordinates for each neighborhood. To better visualize each neighborhood, I will use Folium to build a map of Toronto and mark each neighborhood on this map.

To explore each neighborhood and extract the venue informationl, I will make API calls to Foursquare to retrieve data like venue information around each venue. I will convert the json file into dataframe, extract the 10 most common types of venues for each neighborhood and use K-Means clustering to cluster neighborhoods based on their common types of venues. Finally, I will examine each cluster. Because location data is still limited and all the neighborhoods are in Toronto, it is hard to recognize the difference between each cluster. 

Techniques used: _web-table-scraping, data clean-up and manipulation using Pandas, making API calls to Foursquare to retrieve location data,map visualization using Folium, K-Means clustering_.

### Scrape the table from Wikipedia and convert it into a dataframe.

In [216]:
import pandas as pd
import folium
from geopy.geocoders import Nominatim
import json 
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [92]:
#Read Wikipedia Table with pandas
wikiurl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
canada = pd.read_html(wikiurl)[0]

#Delete rows that have a 'Not assigned' Borough to only process rows with a concrete borough 
canada = canada[canada['Borough'] != 'Not assigned']
canada.reset_index(inplace=True)
canada.drop('index', axis=1, inplace=True)
canada

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [93]:
# Check if there is any neighbourhood that has a 'Not assigned' value
# Print the number of rows of this dataframe
print('Not assigned' in canada['Neighbourhood'].tolist())
print('The shape of this Canada\'s data frame: ', canada.shape)

False
The shape of this Canada's data frame:  (103, 3)


### In this section, I will create a map of Toronto and mark all neighbourhoods in the 'Canada' dataframe on the map.

In [94]:
## First, to create a map and labels for each neighbourhood, I will need to extract each neighbourhood's coordinate
## Because for some rows under 'Neighbourhood' there are a few neighbourhoods rather than one, I need to split those and extract only
## one neighbourhood which I can request coordinates for. The reason is that geolocator only accepts one address, so I cannot
## put two addresses/locations into this function.

canada['Neigh_for_location'] = None
for i in range(0,len(canada)):
    if ',' in canada.loc[i,'Neighbourhood']:
        canada.loc[i,'Neigh_for_location'] = canada.loc[i,'Neighbourhood'].split(',')[0]
    else:
        canada.loc[i,'Neigh_for_location'] = canada.loc[i,'Neighbourhood']
canada.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neigh_for_location
0,M3A,North York,Parkwoods,Parkwoods
1,M4A,North York,Victoria Village,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",Regent Park
3,M6A,North York,"Lawrence Manor, Lawrence Heights",Lawrence Manor
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",Queen's Park


In [96]:
## Running a for loop to loop through neighbourhoods to find their coordinates
## Using try and except to handle errors because 'geolocator' cannot take some locations
latitude = []
longitude = []

for neighbourhood in canada['Neigh_for_location']:
    try:
        address = neighbourhood + ', Toronto'
        geolocator = Nominatim(user_agent="toronto_explorer")
        location = geolocator.geocode(address)
        latitude.append(location.latitude)
        longitude.append(location.longitude)
    except:
        latitude.append(0)
        longitude.append(0)

In [107]:
## Identify rows without latitude and longitude to manually input values
canada['latitude'] = latitude
canada['longitude'] = longitude
canada.loc[canada['latitude'] == 0]

## Searching online and looking for coordinates for five neighbourhoods
## Only five rows are missing, so manually inputting values is not a big deal
canada.loc[21,'latitude'] = 43.6899
canada.loc[21,'longitude'] = -79.4552
canada.loc[56,'latitude'] = 43.6855
canada.loc[56,'longitude'] = -79.4704
canada.loc[76,'latitude'] = 43.5890
canada.loc[76,'longitude'] = -79.6441
canada.loc[92,'latitude'] = 43.6548
canada.loc[92,'longitude'] = -79.3883
canada.loc[100,'latitude'] = 43.6912
canada.loc[100,'longitude'] = -79.3417

## Verifying for those rows
canada.loc[canada['latitude'] == 0]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neigh_for_location,latitude,longitude


In [113]:
## Using Folium to mark all the neighborhoods
map_toronto = folium.Map(location=[43.6534817, -79.3839347], zoom_start=10)


for lat, lng, borough, neighborhood in zip(canada['latitude'], canada['longitude'], canada['Borough'], canada['Neigh_for_location']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

### Explore the neighborhoods using Foursquare API

#### Now, I am using Parkwoods as an example to find venues around it by retrieving data from Foursquare location data. 

In [123]:
## Prepare client information to make API calls to Foursquare
CLIENT_ID = 'UHIL0K3FB52SPTVKCGD4ZWTL4343WPCIHMYESTVHMEHJZB0S' # Foursquare API Client ID
CLIENT_SECRET = 'YBOWHTAAIQ15GM2ZVQXBODUJQTZOLVXSJYGIIAJ5HFIZE2DV' # Foursquare API Client Secret
ACCESS_TOKEN = 'VXPJ4GKS2O4J1XIK23OFNGY0ASOVYDYZPLJSX1KEQDEW0HUB'
VERSION = '20200605' # Foursquare API version
radius = 500 # to find venues within 500 meters of the specified neighborhood
LIMIT = 30 # A default Foursquare API limit value

In [119]:
canada.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Neigh_for_location,latitude,longitude
0,M3A,North York,Parkwoods,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",Regent Park,43.660706,-79.360457
3,M6A,North York,"Lawrence Manor, Lawrence Heights",Lawrence Manor,43.722079,-79.437507
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",Queen's Park,43.659659,-79.39034


In [198]:
len(canada.Neigh_for_location.unique())

95

In [124]:
Parklati = canada.loc[0,'latitude']
Parklong = canada.loc[0,'longitude']
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, Parklati, Parklong,ACCESS_TOKEN, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=UHIL0K3FB52SPTVKCGD4ZWTL4343WPCIHMYESTVHMEHJZB0S&client_secret=YBOWHTAAIQ15GM2ZVQXBODUJQTZOLVXSJYGIIAJ5HFIZE2DV&ll=43.7587999,-79.3201966&oauth_token=VXPJ4GKS2O4J1XIK23OFNGY0ASOVYDYZPLJSX1KEQDEW0HUB&v=20200605&radius=500&limit=30'

In [166]:
# Send the get request and examine the results
# The result should be a json file.
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '601610d0c431252311f9ff34'},
 'notifications': [{'type': 'notificationTray', 'item': {'unreadCount': 0}}],
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 17,
  'suggestedBounds': {'ne': {'lat': 43.7632999045, 'lng': -79.31397776399336},
   'sw': {'lat': 43.7542998955, 'lng': -79.32641543600664}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b8991cbf964a520814232e3',
       'name': "Allwyn's Bakery",
       'location': {'address': '81 Underhill drive',
        'lat': 43.75984035203157,
        'lng': -

In [142]:
# To extract the category of each venue, I will define a function here
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [146]:
# Now I can clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Allwyn's Bakery,Caribbean Restaurant,43.75984,-79.324719
1,LCBO,Liquor Store,43.757774,-79.314257
2,Shoppers Drug Mart,Pharmacy,43.760857,-79.324961
3,Petro-Canada,Gas Station,43.75795,-79.315187
4,TD Canada Trust,Bank,43.757569,-79.314976


In [147]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

17 venues were returned by Foursquare.


**Create a function to repeat the same process to all the neighborhoods in Toronto**

In [149]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [188]:
toronto_venues = getNearbyVenues(names=canada['Neigh_for_location'],
                                   latitudes=canada['latitude'],
                                   longitudes=canada['longitude']
                                  )

Parkwoods
Victoria Village
Regent Park
Lawrence Manor
Queen's Park
Islington Avenue
Malvern
Don Mills
Parkview Hill
Garden District
Glencairn
West Deane Park
Rouge Hill
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate
Guildwood
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor
Thorncliffe Park
Richmond
Dufferin
Scarborough Village
Fairview
Northwood Park
East Toronto
Harbourfront East
Little Portugal
Kennedy Park
Bayview Village
Downsview
The Danforth West
Toronto Dominion Centre
Brockton
Golden Mile
York Mills
Downsview
India Bazaar
Commerce Court
North Park
Humber Summit
Cliffside
Willowdale
Downsview
Studio District
Bedford Park
Del Ray
Humberlea
Birch Cliff
Willowdale
Downsview
Lawrence Park
Roselawn
Runnymede
Weston
Dorset Park
York Mills West
Davisville North
Forest Hill North & West
High Park
Westmount
Wexford
Willowdale
North Toronto West
The Annex
Parkdale
Canada Post Gateway P

In [189]:
# Checking the resulting dataframe's shape
print(f'The resulting dataframe\'s shape is {toronto_venues.shape}')

The resulting dataframe's shape is (1841, 7)


In [190]:
toronto_venues.head()
# Find out how many unique categories there are from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 250 uniques categories.


### Analyzing neighborhoods in Toronto.

**Sort frequency score of each venue and create a dataframe which shows the 10 most common types of venues for each neighborhood**

In [194]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues['Venue Category'], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Aquarium,Art Gallery,Art Museum,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1836,Mimico NW,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1837,Mimico NW,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1838,Mimico NW,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1839,Mimico NW,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [195]:
toronto_onehot.shape

(1841, 250)

Group rows by neighborhood and by taking the mean of frequecy of occurrence of each category

In [196]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Aquarium,Art Gallery,Art Museum,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.066667,0.0,0.0,0.0,0.0,0.0
1,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
2,Bathurst Manor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.066667,0.033333,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
4,Bedford Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,Willowdale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
90,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.050000,0.0,0.0,0.0,0.0,0.0
91,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
92,York Mills,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0


In [200]:
## Print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.20
1          Food Court  0.13
2    Asian Restaurant  0.13
3         Coffee Shop  0.07
4              Bakery  0.07


----Alderwood----
         venue  freq
0  Pizza Place   0.4
1          Gym   0.2
2          Pub   0.2
3  Coffee Shop   0.2
4  Music Venue   0.0


----Bathurst Manor----
               venue  freq
0  Korean Restaurant  0.13
1        Coffee Shop  0.07
2        Video Store  0.07
3     Ice Cream Shop  0.07
4          Gift Shop  0.03


----Bayview Village----
             venue  freq
0             Bank  0.15
1   Breakfast Spot  0.08
2      Pizza Place  0.08
3   Sandwich Place  0.08
4  Bubble Tea Shop  0.08


----Bedford Park----
                             venue  freq
0             Gym / Fitness Center   1.0
1                              ATM   0.0
2                Mobile Phone Shop   0.0
3       Modern European Restaurant   0.0
4  Molecular Gastronomy Restaurant   0.0


----Berczy Park----
              

4         Burger Joint  0.03


----The Danforth West----
           venue  freq
0  Grocery Store  0.07
1       Pharmacy  0.07
2    Coffee Shop  0.07
3       Bus Line  0.07
4    Pizza Place  0.04


----The Kingsway----
                venue  freq
0  Italian Restaurant  0.07
1                Bank  0.07
2        Dessert Shop  0.07
3      Breakfast Spot  0.07
4                 Pub  0.07


----Thorncliffe Park----
                venue  freq
0   Afghan Restaurant  0.08
1         Coffee Shop  0.08
2      Sandwich Place  0.08
3   Indian Restaurant  0.08
4  Turkish Restaurant  0.08


----Toronto Dominion Centre----
                 venue  freq
0                 Café  0.17
1          Coffee Shop  0.10
2  Japanese Restaurant  0.07
3           Restaurant  0.07
4               Bakery  0.03


----University of Toronto----
                 venue  freq
0                 Café  0.17
1            Bookstore  0.10
2                 Park  0.07
3  Japanese Restaurant  0.07
4                Hotel  0.03


---

In [201]:
# Define a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [204]:
# Create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Food Court,Asian Restaurant,Korean Restaurant,Cantonese Restaurant,Coffee Shop,Train Station,Bakery,Vietnamese Restaurant,Hong Kong Restaurant
1,Alderwood,Pizza Place,Gym,Pub,Coffee Shop,Event Space,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Elementary School,Falafel Restaurant
2,Bathurst Manor,Korean Restaurant,Video Store,Ice Cream Shop,Coffee Shop,Gift Shop,Grocery Store,Sandwich Place,Café,Paper / Office Supplies Store,Restaurant
3,Bayview Village,Bank,Fish Market,Pizza Place,Outdoor Supply Store,Sandwich Place,Fast Food Restaurant,Bubble Tea Shop,Breakfast Spot,Pet Store,Sporting Goods Shop
4,Bedford Park,Gym / Fitness Center,Yoga Studio,Donut Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market,Falafel Restaurant


### Using K-Means Clustering to cluster neighborhoods into 4 clusters and visualizing clusters by using Folium Package

In [234]:
# Drop Neighborhood column before clustering
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

# add clustering labels
neighborhoods_venues_sorted['Cluster Labels'] = kmeans.labels_

In [239]:
# merge toronto_grouped with canada to add latitude/longitude for each neighborhood
toronto_merged = canada
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neigh_for_location')
toronto_merged.head() 
# There is a null value in the cluster label, so I use a fillna function to fill the null value and convert labels into integers
toronto_merged['Cluster Labels'].fillna(0, inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype('int32')


In [240]:
toronto_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Postal Code             103 non-null    object 
 1   Borough                 103 non-null    object 
 2   Neighbourhood           103 non-null    object 
 3   Neigh_for_location      103 non-null    object 
 4   latitude                103 non-null    float64
 5   longitude               103 non-null    float64
 6   Cluster Labels          103 non-null    int32  
 7   1st Most Common Venue   102 non-null    object 
 8   2nd Most Common Venue   102 non-null    object 
 9   3rd Most Common Venue   102 non-null    object 
 10  4th Most Common Venue   102 non-null    object 
 11  5th Most Common Venue   102 non-null    object 
 12  6th Most Common Venue   102 non-null    object 
 13  7th Most Common Venue   102 non-null    object 
 14  8th Most Common Venue   102 non-null    ob

In [243]:
# create map
map_clusters = folium.Map(location=[43.6534817, -79.3839347], zoom_start=11)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['Neigh_for_location'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine characteristics of clusters

**Cluster 1**

In [245]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,East York,-79.321907,0,,,,,,,,,,
17,Etobicoke,-79.576516,0,Park,Yoga Studio,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
21,York,-79.4552,0,Women's Store,Park,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
34,North York,-79.50448,0,Park,Baseball Field,Yoga Studio,Eastern European Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant
70,Etobicoke,-79.521043,0,Pizza Place,Park,Gas Station,Donut Shop,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market,Falafel Restaurant
77,Etobicoke,-79.556346,0,Park,Yoga Studio,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
82,Scarborough,-79.297795,0,Convenience Store,Caribbean Restaurant,Park,Gas Station,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
91,Downtown Toronto,-79.380746,0,Park,Food Truck,Bike Trail,Construction & Landscaping,Playground,Fish & Chips Shop,Filipino Restaurant,Field,Dumpling Restaurant,Fast Food Restaurant
95,Scarborough,-79.165837,0,Fast Food Restaurant,Park,Caribbean Restaurant,Yoga Studio,Dumpling Restaurant,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Farmers Market
101,Etobicoke,-79.494334,0,Park,Metro Station,Spa,American Restaurant,River,Event Space,Falafel Restaurant,Egyptian Restaurant,Electronics Store,Elementary School


**Cluster 2**

In [246]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
89,Etobicoke,-79.314538,1,Playground,Yoga Studio,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
90,Scarborough,-79.314538,1,Playground,Yoga Studio,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market


**Cluster 3**

In [247]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,-79.320197,2,ATM,Bank,Pizza Place,Discount Store,Coffee Shop,Chinese Restaurant,Caribbean Restaurant,Electronics Store,Shopping Mall,Bus Line
1,North York,-79.311189,2,Spa,Middle Eastern Restaurant,Park,Thai Restaurant,Bus Line,Yoga Studio,Falafel Restaurant,Egyptian Restaurant,Electronics Store,Elementary School
2,Downtown Toronto,-79.360457,2,Coffee Shop,Restaurant,Thai Restaurant,Performing Arts Venue,Auto Dealership,Park,Electronics Store,Food Truck,Beer Store,Sushi Restaurant
3,North York,-79.437507,2,Bank,Kids Store,Doctor's Office,Electronics Store,Park,Event Space,Eastern European Restaurant,Egyptian Restaurant,Elementary School,Yoga Studio
4,Downtown Toronto,-79.390340,2,Coffee Shop,Café,Park,Spa,Discount Store,Thai Restaurant,Italian Restaurant,Portuguese Restaurant,Fried Chicken Joint,Bookstore
...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,Downtown Toronto,-79.381692,2,Café,Restaurant,Coffee Shop,American Restaurant,Seafood Restaurant,Pizza Place,Gym,Steakhouse,Gastropub,Gluten-free Restaurant
98,Etobicoke,-79.511333,2,Breakfast Spot,Italian Restaurant,Pub,Sushi Restaurant,Bank,Coffee Shop,Dessert Shop,Pharmacy,French Restaurant,Mobile Phone Shop
99,Downtown Toronto,-79.372792,2,Coffee Shop,Grocery Store,Pizza Place,Pie Shop,Bistro,Library,Breakfast Spot,Market,Filipino Restaurant,Metro Station
100,East Toronto,-79.341700,2,Breakfast Spot,Bus Line,Gourmet Shop,Bakery,Beer Store,Pub,Cheese Shop,Park,Performing Arts Venue,Indian Restaurant


**Cluster 4**

In [248]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Scarborough,-79.130499,3,Train Station,Yoga Studio,Dumpling Restaurant,Flower Shop,Fish Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Farmers Market
18,Scarborough,-79.198229,3,Train Station,Storage Facility,Baseball Field,Falafel Restaurant,Egyptian Restaurant,Electronics Store,Elementary School,Event Space,Yoga Studio,Dumpling Restaurant


From a quick review of these four clusters, I think the K-Means model performed pretty well in clustering neighborhoods. As we can see from Cluster 1, park is the most common category of venue for more than half neighborhoods. For the 5th through 7th most common category of venue, Fish Market and Filipino restaurants appear quite frequently. For Cluster 4, train station is the most common type of venue for both neighborhoods. However, like I mentioned at the start, depending on locations, it might be hard to cluster neighborhoods in a single city because data is still limited, and neighborhoods in one city might be pretty similar. When we are reaching out to a bigger area or increasing the radius and limit when making API calls, we will definitely have some more clearly defined clusters.

Thank you for checking out this project! 