## Segmenting and Clustering - Neighborhoods in Toronto

This notebook contains an analysis of the neighbourhoods in the city of toronto.
Data from several sources are retrieved and combined to develop a profile for each of the neighbourhoods based on the distribution of venues located nearby. These profiles are used to group neighbourhoods into similar clusters and displayed on a map.   

#### __*Question 1*__

#### __Data Extraction__

Retrieve a listing of neighbourhoods in Ontario, Canada identified by postal codes

In [44]:
###
### Retrieve list of canadian neighbourhoods by postal code and borough
###

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
df_ca = pd.read_html(url, match="Postal Code") [0]
df_ca


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [45]:
###
### Use BeautifulSoup API to retrieve list of neighbourhoods in Canada by postal code from Wikipedia
###

import requests
from bs4 import BeautifulSoup 
from IPython.display import display_html

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
#print(soup.title)

from IPython.display import display_html
tab = str(soup.table)
#display_html(tab,raw=True)

df_ca = pd.read_html(tab)[0]
df_ca

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


#### __Data Preparation__ 

Remove rows with missing borough values and fill missing neighbourhood values with the associated borough value.

In [88]:
###
### Data quality checks
###

# drop cells with a borough that is Not assigned

df = df_ca[df_ca["Borough"] != "Not assigned"]
print("Error count - (Borough == 'Not Assigned'): \t\t", df[df["Borough"] == "Not Assigned"].shape[0])

# set neighbourhood to borough if neighbourhood is Not Assigned

for i in range(df.shape[0]):
   if df.iloc[i]["Neighbourhood"] == 'Not Assigned':
      df.iloc[i]["Neighbourhood"] = df.iloc[i]["Borough"]
    
print("Error count - (Neighbourhood == 'Not Assigned'): \t", df[df["Neighbourhood"]=="Not Assigned"].shape[0], "\n")

# join description for duplicate neighbourhoods with the same postal code and borough values

df = df.groupby(by=["Postal Code", "Borough"]).agg({'Neighbourhood': ', '.join}).reset_index()

df


Error count - (Borough == 'Not Assigned'): 		 0
Error count - (Neighbourhood == 'Not Assigned'): 	 0 



Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [91]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "df")
print("Dataframe shape: \t", df.shape)
print("Number of rows: \t", df.shape[0])
print("Boroughs: \t\t", len(df["Borough"].unique()))
print("Neighbourhoods: \t", len(df["Neighbourhood"].unique()))

print("\nQuestion 1 Answer: \t", "Dataframe contains {} rows and {} columns".format(df.shape[0],df.shape[1]))


Dataframe: 		 df
Dataframe shape: 	 (103, 3)
Number of rows: 	 103
Boroughs: 		 10
Neighbourhoods: 	 99

Question 1 Answer: 	 Dataframe contains 103 rows and 3 columns


__Observation:__ 

There are 103 rows in the dataframe corresponding to 103 distinct postal codes but only 99 distinct neighbourhoods. For the borough of North York, the neighbourhood *Don Mills* is duplicated 2 times and *Downsview* is duplicated 4 times each with distinct postal codes. As a result, the unique count of neighbourhoods is 99 and not 103. This descrepancy does not affect the subsequent analysis as only boroughs in the city of Toronto are considered and the city of North York is filtered out.

In [90]:
file_name="neighbourhoods.csv"
df.to_csv(file_name, encoding='utf-8', index=False)

#### __*Question 2*__

#### __Data Enhancement__

Extend the neighbourhood dataframe to include latitide and longitude coordinates for each neighbourhood

In [51]:
# Install geocoder if not installed

#!pip install geocoder  

In [68]:
###
### Retrieve latitude and logitude coordinates for each neighbourhood in Ontario
###

df_ll = pd.read_csv("http://cocl.us/Geospatial_data")
df_ll


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [69]:
###
### Add latitude and longitude to the neighbourhoods using an outer left join on postal code
###

df = pd.merge(df, df_ll, on="Postal Code", how="inner")
df


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [70]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "df")
print("Dataframe shape: \t", df.shape)
print("Number of rows: \t", df.shape[0])
print("Boroughs: \t\t", len(df["Borough"].unique()))
print("Neighbourhoods: \t", len(df["Neighbourhood"].unique()))

print("\nQuestion 2 Answer: \t", "Dataframe contains {} rows and {} columns".format(df.shape[0],df.shape[1]))
 

Dataframe: 		 df
Dataframe shape: 	 (103, 5)
Number of rows: 	 103
Boroughs: 		 10
Neighbourhoods: 	 99

Question 2 Answer: 	 Dataframe contains 103 rows and 5 columns


#### __*Question 3*__

#### __Data Preparation__

In [55]:
# install geopy if not already installed

#!conda install -c conda-forge geopy --yes  

In [71]:
###
### Retrieve the latitude and longitude values for the city of Toronto
###

from geopy.geocoders import Nominatim 

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
lat_to = location.latitude
long_to = location.longitude

print('Toronto - [latitude, longitude]: ({}, {})'.format(lat_to, long_to))


Toronto - [latitude, longitude]: (43.6534817, -79.3839347)


In [72]:
# filter the neighbourhood dataframe for boroughs that are located in the city of toronto

df_to = df[df["Borough"].str.contains("Toronto")].sort_values(["Borough", "Postal Code"]).reset_index(drop=True)
df_to


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
5,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049
6,M5N,Central Toronto,Roselawn,43.711695,-79.416936
7,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
9,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529


In [73]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "df_to")
print("Dataframe shape: \t", df_to.shape)
print("Number of rows: \t", df_to.shape[0])
print("Boroughs: \t\t", len(df_to["Borough"].unique()))
print("Neighbourhoods: \t", len(df_to["Neighbourhood"].unique()))


Dataframe: 		 df_to
Dataframe shape: 	 (39, 5)
Number of rows: 	 39
Boroughs: 		 4
Neighbourhoods: 	 39


#### __Data Visualization__

#### _Display map of Toronto neighbourhoods using the Folium API_

In [244]:
# install folium if not installed

#!pip install folium    

In [74]:
###
### Create map of Toronto using latitude and longitude values
###

import folium

map_to = folium.Map(location=[lat_to, long_to], titles="Toronto Neighbourhoods", zoom_start=12)

# add neighbourhood markers to map

for lat, long, borough, nh in zip(df_to['Latitude'], df_to['Longitude'], df_to['Borough'], df_to['Neighbourhood']):

   folium.CircleMarker(
      [lat, long],
      popup = folium.Popup('{}, {}'.format(nh, borough), parse_html=True),
      color = 'blue',
      fill = True,
      fill_color = '#3186cc',
      fill_opacity = 0.7,
      parse_html = False).add_to(map_to)  
    
map_to

#### __Data Exploration__

Explore the Toronto neighbourhoods using the FourSquare API

In [75]:
###
### Define Foursquare credentials and version
###

import requests

CLIENT_ID = '2KCKB0CN1JSW0DRVY0VIVLJASZS0UY5XBFN0HKH2QZK5UHEF' # your Foursquare ID
CLIENT_SECRET = 'FAFGJMGKW1EZZEMLOJUHOHIA3O4HJJDO424O5JWOJ2QUTWOI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('FourSquare Credentails: \n')
print('CLIENT_ID: \t', CLIENT_ID)
print('CLIENT_SECRET: \t', CLIENT_SECRET)


FourSquare Credentails: 

CLIENT_ID: 	 2KCKB0CN1JSW0DRVY0VIVLJASZS0UY5XBFN0HKH2QZK5UHEF
CLIENT_SECRET: 	 FAFGJMGKW1EZZEMLOJUHOHIA3O4HJJDO424O5JWOJ2QUTWOI


Retrieve listing of nearby venuee for each neighbourhood

In [78]:
###
### Function to retrieve nearby venues for each of the neighburhoods
###

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        print(name)    
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
            'Neighbourhood Latitude', 
            'Neighbourhood Longitude', 
            'Venue', 
            'Venue Latitude', 
            'Venue Longitude', 
            'Venue Category']
    
    return(nearby_venues)

In [79]:
###
### Retrieve venues located in the neighbourhoods using the FourSquare API
###

venues_to = getNearbyVenues(names=df_to['Neighbourhood'], latitudes=df_to['Latitude'], longitudes=df_to['Longitude'])
venues_to


Lawrence Park
Davisville North
North Toronto West, Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Queen's Park, Ontario Provincial Government
The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Business reply mail Processing Centre, South Central Letter 

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.728020,-79.388790,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.728020,-79.388790,Dim Sum Deluxe,43.726953,-79.394260,Dim Sum Restaurant
2,Lawrence Park,43.728020,-79.388790,Zodiac Swim School,43.728532,-79.382860,Swim School
3,Lawrence Park,43.728020,-79.388790,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
4,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
...,...,...,...,...,...,...,...
1603,"Runnymede, Swansea",43.651571,-79.484450,Cards 'N' Such,43.650497,-79.480778,Post Office
1604,"Runnymede, Swansea",43.651571,-79.484450,(The New) Moksha Yoga Bloor West,43.648658,-79.485242,Yoga Studio
1605,"Runnymede, Swansea",43.651571,-79.484450,The Coffee Bouquets,43.648785,-79.485940,Coffee Shop
1606,"Runnymede, Swansea",43.651571,-79.484450,Supplements Plus,43.650512,-79.479262,Supplement Shop


In [92]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "venues_to")
print("Dataframe shape: \t", venues_to.shape)
print("Number of rows: \t", venues_to.shape[0])
print("Neighbourhoods: \t", len(venues_to["Neighbourhood"].unique()))
print("Venue Categories: \t", len(venues_to["Venue Category"].unique()))
print("Venues: \t\t", len(venues_to["Venue"].unique()))


Dataframe: 		 venues_to
Dataframe shape: 	 (1608, 7)
Number of rows: 	 1608
Neighbourhoods: 	 39
Venue Categories: 	 238
Venues: 		 1039


In [89]:
# Print count of venues grouped by neighbourhood

venues_to.groupby(["Neighbourhood"]).size().reset_index(name="Venues")


Unnamed: 0,Neighbourhood,Venues
0,Berczy Park,56
1,"Brockton, Parkdale Village, Exhibition Place",22
2,"Business reply mail Processing Centre, South C...",15
3,"CN Tower, King and Spadina, Railway Lands, Har...",16
4,Central Bay Street,61
5,Christie,15
6,Church and Wellesley,80
7,"Commerce Court, Victoria Hotel",100
8,Davisville,32
9,Davisville North,9


In [93]:
###
### Onehot encode the venue categories as numeric column attributes
###

onehot_to = pd.get_dummies(venues_to[['Venue Category']], prefix="", prefix_sep="")
onehot_to["Neighbourhood"] = venues_to["Neighbourhood"] 

cols = [onehot_to.columns[-1]] + list(onehot_to.columns[:-1])
onehot_to = onehot_to[cols]
 
onehot_to

Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1603,"Runnymede, Swansea",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1604,"Runnymede, Swansea",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1605,"Runnymede, Swansea",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1606,"Runnymede, Swansea",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [94]:
###
### Print the number of rows in the encoded dataframe
###

print("Dataframe: \t\t", "onehot_to")
print("Dataframe shape: \t", onehot_to.shape)
print("Number of rows: \t", onehot_to.shape[0])
print("Neighbourhoods: \t", len(onehot_to["Neighbourhood"].unique()))
print("Venue Categories: \t",onehot_to.shape[1]-1)
print("Venues: \t\t", (onehot_to.sum(axis=1,numeric_only=True)).sum(axis=0))


Dataframe: 		 onehot_to
Dataframe shape: 	 (1608, 239)
Number of rows: 	 1608
Neighbourhoods: 	 39
Venue Categories: 	 238
Venues: 		 1608


In [95]:
###
### Compute the average frequency of venues in each category for each of the neighbourhoods
###

grouped_to = onehot_to.groupby('Neighbourhood').mean().reset_index()
grouped_to


Unnamed: 0,Neighbourhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0625,0.0625,0.0625,0.125,0.0625,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016393,0.0,0.016393,0.0,0.016393,0.0,0.016393
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [96]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "grouped_to")
print("Dataframe shape: \t", grouped_to.shape)
print("Number of rows: \t", grouped_to.shape[0])
print("Neighbourhoods: \t", len(grouped_to["Neighbourhood"].unique()))
print("Venue Categories: \t",grouped_to.shape[1]-1)


Dataframe: 		 grouped_to
Dataframe shape: 	 (39, 239)
Number of rows: 	 39
Neighbourhoods: 	 39
Venue Categories: 	 238


In [97]:
###
### Print count of venues summarized by neighbourhood and venue category 
###

count_to = venues_to.groupby(["Neighbourhood", "Venue Category"]).size().reset_index(name="Count") 
count_to = count_to.sort_values(by=["Neighbourhood", "Count"], ascending=[True, False], axis=0).reset_index(drop=True)
count_to


Unnamed: 0,Neighbourhood,Venue Category,Count
0,Berczy Park,Coffee Shop,5
1,Berczy Park,Cocktail Bar,3
2,Berczy Park,Bakery,2
3,Berczy Park,Beer Bar,2
4,Berczy Park,Cheese Shop,2
...,...,...,...
1081,"University of Toronto, Harbord",Sandwich Place,1
1082,"University of Toronto, Harbord",Sushi Restaurant,1
1083,"University of Toronto, Harbord",Theater,1
1084,"University of Toronto, Harbord",Video Game Store,1


In [98]:
###
### Print the top venue categories by neighbourhood sorted by mean frequency 
###

pd.options.display.float_format = '{:.2f}'.format # display 2 decimals for floats

n_cat = 5  # maximum number of venue categories
for h in grouped_to['Neighbourhood']:
    print("\n---- "+h+" ----")
    df_t = grouped_to[grouped_to['Neighbourhood']==h].T[1:].reset_index().set_axis(['Venue','Freq'],axis=1)
    df_t = df_t.sort_values(by=["Freq","Venue"],ascending=[False,True]).reset_index(drop=True).head(n_cat)
    print (df_t)



---- Berczy Park ----
          Venue Freq
0   Coffee Shop 0.09
1  Cocktail Bar 0.05
2        Bakery 0.04
3      Beer Bar 0.04
4   Cheese Shop 0.04

---- Brockton, Parkdale Village, Exhibition Place ----
            Venue Freq
0            Café 0.14
1  Breakfast Spot 0.09
2     Coffee Shop 0.09
3          Bakery 0.05
4             Bar 0.05

---- Business reply mail Processing Centre, South Central Letter Processing Plant Toronto ----
                Venue Freq
0  Light Rail Station 0.13
1             Brewery 0.07
2       Burrito Place 0.07
3             Butcher 0.07
4          Comic Shop 0.07

---- CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport ----
                Venue Freq
0      Airport Lounge 0.12
1    Airport Terminal 0.12
2             Airport 0.06
3  Airport Food Court 0.06
4        Airport Gate 0.06

---- Central Bay Street ----
                Venue Freq
0         Coffee Shop 0.18
1                Café 0.07
2  Itali

In [100]:
###
### Display the top venue categories for each neighbourhood in toronto
###

# create dataframe for the top venue categories by neighbourhood

n_cat = 10                  # number of categories in the ranking
cols = ["Neighbourhood"]    # list of column attributes 
for i in range(1,n_cat+1):
    cols.append("Top {}".format(i))
top_to = pd.DataFrame(columns=cols)
top_to["Neighbourhood"] = grouped_to["Neighbourhood"]

# Retrieve the top venue categories and update dataframe for each of the neighbourhoods

for n in range(len(top_to)):
    top_to.iloc[n, 1:] = grouped_to.iloc[n, 1:].sort_values(ascending=False).index.values[0:n_cat]

top_to


Unnamed: 0,Neighbourhood,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,Berczy Park,Coffee Shop,Cocktail Bar,Beer Bar,Farmers Market,Seafood Restaurant,Bakery,Cheese Shop,Restaurant,Bistro,Fish Market
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Grocery Store,Furniture / Home Store,Burrito Place,Convenience Store,Stadium,Restaurant,Italian Restaurant
2,"Business reply mail Processing Centre, South C...",Light Rail Station,Park,Restaurant,Brewery,Butcher,Garden,Garden Center,Burrito Place,Fast Food Restaurant,Farmers Market
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Terminal,Airport,Bar,Harbor / Marina,Coffee Shop,Rental Car Location,Sculpture Garden,Boutique,Boat or Ferry
4,Central Bay Street,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Bubble Tea Shop,Salad Place,Burger Joint,Middle Eastern Restaurant,Ramen Restaurant,Portuguese Restaurant
5,Christie,Grocery Store,Café,Park,Coffee Shop,Italian Restaurant,Restaurant,Candy Store,Baby Store,Nightclub,Doner Restaurant
6,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Gay Bar,Fast Food Restaurant,Pub,Hotel,Yoga Studio,Mediterranean Restaurant
7,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Hotel,Café,American Restaurant,Italian Restaurant,Gym,Seafood Restaurant,Cocktail Bar,Japanese Restaurant
8,Davisville,Dessert Shop,Sandwich Place,Sushi Restaurant,Café,Gym,Italian Restaurant,Pizza Place,Coffee Shop,Indoor Play Area,Brewery
9,Davisville North,Gym / Fitness Center,Park,Department Store,Dance Studio,Hotel,Food & Drink Shop,Breakfast Spot,Gym,Sandwich Place,Ethiopian Restaurant


In [101]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "top_to")
print("Dataframe shape: \t", top_to.shape)
print("Number of rows: \t", top_to.shape[0])
print("Neighbourhoods: \t", len(top_to["Neighbourhood"].unique()))
 

Dataframe: 		 top_to
Dataframe shape: 	 (39, 11)
Number of rows: 	 39
Neighbourhoods: 	 39


#### __Modeling__

Cluster the neighbourhoods in toronto based on the venue categories using the KMeans clustering model 

In [102]:
###
### Apply the KMeans model to cluster the neighbourhoods in Toronto
###

from sklearn.cluster import KMeans

n_groups = 5
cluster_to = grouped_to.drop(["Neighbourhood"], axis=1)

# define and fit KMeans model
model = KMeans(n_clusters=n_groups, random_state=0).fit(cluster_to)

# print cluster labels generated for each row in the dataframe
model.labels_


array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1, 2, 0, 2,
       2, 2, 2, 2, 0, 4, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2])

#### __Data Preparation__

In [103]:
###
### Add KMeans clusteing results to the dataframe
###

df_all = df_to.join(top_to.set_index("Neighbourhood"), on="Neighbourhood")

# add cluster labels from model execution

df_all.insert(3, 'Cluster Labels', model.labels_)
df_all


Unnamed: 0,Postal Code,Borough,Neighbourhood,Cluster Labels,Latitude,Longitude,Top 1,Top 2,Top 3,Top 4,Top 5,Top 6,Top 7,Top 8,Top 9,Top 10
0,M4N,Central Toronto,Lawrence Park,2,43.73,-79.39,Park,Bus Line,Swim School,Dim Sum Restaurant,Yoga Studio,Distribution Center,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant
1,M4P,Central Toronto,Davisville North,2,43.71,-79.39,Gym / Fitness Center,Park,Department Store,Dance Studio,Hotel,Food & Drink Shop,Breakfast Spot,Gym,Sandwich Place,Ethiopian Restaurant
2,M4R,Central Toronto,"North Toronto West, Lawrence Park",2,43.72,-79.41,Coffee Shop,Clothing Store,Yoga Studio,Bagel Shop,Chinese Restaurant,Diner,Restaurant,Café,Salon / Barbershop,Mexican Restaurant
3,M4S,Central Toronto,Davisville,2,43.7,-79.39,Dessert Shop,Sandwich Place,Sushi Restaurant,Café,Gym,Italian Restaurant,Pizza Place,Coffee Shop,Indoor Play Area,Brewery
4,M4T,Central Toronto,"Moore Park, Summerhill East",2,43.69,-79.38,Restaurant,Playground,Tennis Court,Lawyer,Yoga Studio,Donut Shop,Distribution Center,Dog Run,Doner Restaurant,Eastern European Restaurant
5,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",2,43.69,-79.4,Coffee Shop,Pizza Place,Light Rail Station,Restaurant,Fried Chicken Joint,Supermarket,Sushi Restaurant,Bank,Bagel Shop,Pub
6,M5N,Central Toronto,Roselawn,2,43.71,-79.42,Health & Beauty Service,Fast Food Restaurant,Garden,Fish & Chips Shop,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
7,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",2,43.7,-79.41,Trail,Park,Sushi Restaurant,Jewelry Store,Eastern European Restaurant,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Yoga Studio
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",2,43.67,-79.41,Sandwich Place,Café,Coffee Shop,Pizza Place,BBQ Joint,Pharmacy,Pub,Cheese Shop,Donut Shop,Middle Eastern Restaurant
9,M4W,Downtown Toronto,Rosedale,2,43.68,-79.38,Park,Trail,Playground,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio


In [104]:
###
### Print the number of rows in the dataframe
###

print("Dataframe: \t\t", "df_all")
print("Dataframe shape: \t", df_all.shape)
print("Number of rows: \t", df_all.shape[0])
print("Neighbourhoods: \t", len(df_all["Neighbourhood"].unique()))


Dataframe: 		 df_all
Dataframe shape: 	 (39, 16)
Number of rows: 	 39
Neighbourhoods: 	 39


#### __Data Visualization__

In [105]:
###
### Display a map of toronto showing the neighbourhoods colored by the cluster grouping 
###

# import matplotlib and associated plotting modules

import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np

# create map of neighbourhoods by clusters using latitude and longitude values

map_to = folium.Map(location=[lat_to, long_to], titles="Toronto Neighbourhoods", zoom_start=12)

# set color scheme for the clusters

x = np.arange(n_groups)
ys = [i + x + (i*x)**2 for i in range(n_groups)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# setup a color scheme for the clusters
#colors = ['green', 'purple', 'orange', 'red','blue', 'black', 'beige']

# add neighbourhood markers to map
 
for lat, long, nh, cluster in zip(df_all['Latitude'], df_all['Longitude'], df_to['Neighbourhood'], df_all["Cluster Labels"]):

   folium.CircleMarker(
      [lat, long],
      radius=5,
      popup = folium.Popup('Cluster ' + str(cluster+1) + "\n" + nh, parse_html=True),
      color = rainbow[int(cluster-1)],
      fill = True,
      fill_color = rainbow[int(cluster-1)],
      fill_opacity = 0.8,
      parse_html = False).add_to(map_to)  

map_to

__Observation:__

The majority of the defined neighbourhoods in the city of Toronto (34 out of 39) are similar and grouped into a single cluster based on the distribution of venues located in the neighbourhood. The main city core in Toronto along Yonge street is comprised of a mix of businesses, shopping, coffee shops, restaurants, entertainment and related venues.There are other neighbourhoods in cluster 2 in the east end of the city along Queen street such as the Beaches that contain an urban mix of retail and business venues. Shopping centers such as Dufferin Mall and high traffic areas like High Park are also included in the main cluster 2.