<a> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTTYOPfdkntp9QtXcPkvO7JsGnU9GunyWgQ8TvyHBySz38_j0UAYQ" align="center" ></a>

<h1 align="center"> Segmentation and Clustering Neighborhoods in Toronto(CAN)</h1>

## Summary

In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto.The neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way. 

In this particular case we will extract the data from a Wikipedia page, where all the information we need to explore and cluster the neighborhoods in Toronto is stored. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe.

We will use the BeautifulSoup Python Library. If you are interested, here is the library's main documentation page (<a href="https://beautiful-soup-4.readthedocs.io/en/latest/">BeautifulSoup Documentation</a>).

To segment the different neighborhoods in Toronto, we will develop a clustering algorithm based on the k-means principle. The objective of this segmentation is to analyze and create a profile for each cluster group, considering the most common characteristics of each cluster.

### Importing Libraries

In [1]:
# library for data analsysis
import pandas as pd
# library to handle data in a vectorized manner
import numpy as np
# library to handle requests
import requests
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

from bs4 import BeautifulSoup
import csv
import matplotlib.pyplot as plt

In [2]:
!pip install folium
import folium # map rendering library



In [3]:
!pip install geopy



In [4]:
!pip install geocoder
import geocoder



In [5]:
from geopy.geocoders import Nominatim

#### Function to get the number of a given tag frequency in a html file

In [6]:
def count_tags(tag_name, html):
    soup = BeautifulSoup(html)
    return len(soup.find_all(tag_name))

## __Part I - Segmenting and Clustering Neighborhoods in Toronto__  

### Scrape the data from the Wikipedia page using BeautifulSoup

Tried to implement the scrape based in Based in BeautifulSoup, but I could not do it. If you have have tips, you are more than welcomed to share them with me :) 

In [7]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
# counter of the number of tags 'td' in the html file
td_count = count_tags('td',source)

# trying to find all of the tags 'table', since we want to scrap the table contents
table = soup.find_all('table',{'class':'wikitable sortable'})
# table is an array
info=table[0]
#print(info)
    
with open ('neigh_toronto.csv','w') as r:
    for row in info.find_all('tr'):
        for cell in row.find_all('td'):
            r.write(cell.text.ljust(25))
        r.write('\n')    

df=pd.read_csv('neigh_toronto.txt')   

Tried a simpler implementation, using both the BeautifulSoup library and the pandas library

In [8]:
# link of the wikipedia page containing the needed table
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup= BeautifulSoup(source,'lxml')
# If you want to see the source code
#print(soup.prettify())

df=pd.read_html(source) 
df=df[0]
df[:15]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Pre-processing the Dataset, in order to make it clean

First we will replace every Neighborhood=='Not assigned' with an existent Borough, with its Borough value

In [9]:
df['Neighbourhood'] = np.where((df['Neighbourhood']=='Not assigned') & (df['Borough']!=df['Neighbourhood']),df['Borough'],df['Neighbourhood'])
df[:10]
# In cell 8 we can see that we were successful  

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,Not assigned,Not assigned


#### Getting rid of the Not assigned rows

In [10]:
# cells 0,1,9,etc. have non-assigned values

df.drop(df[df.Borough=="Not assigned"].index,inplace=True)
df.drop(df[df.Neighbourhood=="Not assigned"].index,inplace=True)
df[:10]

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


#### Display the duplicates

In [11]:
df.Postcode.value_counts() 

M9V    8
M8Y    8
M5V    7
M4V    5
M9B    5
M8Z    5
M9C    4
M6M    4
M9R    4
M1V    4
M3H    3
M6L    3
M8X    3
M1E    3
M8V    3
M1M    3
M1P    3
M5T    3
M1T    3
M1L    3
M1C    3
M1K    3
M6K    3
M5J    3
M5R    3
M5H    3
M2J    3
M2L    2
M5P    2
M4X    2
      ..
M6E    1
M7Y    1
M1W    1
M1H    1
M1J    1
M3L    1
M7A    1
M7R    1
M2K    1
M9N    1
M9P    1
M2R    1
M3A    1
M9A    1
M1X    1
M4P    1
M9W    1
M4G    1
M6G    1
M1G    1
M4H    1
M4R    1
M5C    1
M6C    1
M5G    1
M4J    1
M4M    1
M1S    1
M4Y    1
M9L    1
Name: Postcode, Length: 103, dtype: int64

#### The same neighborhood can exist in one postal code area. We need to group them. 

## __Part I - Final answer (Group the neighborhoods by postal code)__

In [50]:
df_final = df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ', '.join(x)).reset_index()

# Noticed that Neighborhood and PostalCode were misspelled, so I renamed it
df_final.rename(columns={'Neighbourhood':'Neighborhood', 'Postcode':'PostalCode'},inplace=True)
df_final[:21]

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [51]:
df_final.shape

(103, 3)

In [52]:
df_final['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

## __Part II - Segmenting and Clustering Neighborhoods in Toronto__

#### First we have to fetch the coordinates of every postal code 

In [53]:
length=df_final['PostalCode'].shape[0]
postal_code=df_final['PostalCode'].reset_index(drop=True)
lat = []
long = []

latitude = np.array(lat, dtype = np.float32)
longitude = np.array(long, dtype = np.float32)

for i in range (0,length):
    #print('cell',i)
    lat_long=None
    
    while(lat_long is None):
            g = geocoder.arcgis('{},Toronto,Ontario'.format(postal_code[i]))
            
            lat_long = g.latlng
            #print(lat_long)
            
            
            latitude=np.append(latitude,lat_long[0])
            longitude=np.append(longitude,lat_long[1])
            

In [54]:
# just checking if they match
latitude.shape

(103,)

In [55]:
longitude.shape

(103,)

### Insert the latitude and longitude in the dataframe 

In [56]:
df_final['Latitude']=latitude
df_final['Longitude']=longitude

## __Part II - Final answer__

#### For simplification purposes we will drop both the postalcode and the boroughs that don't end in the word Toronto

In [58]:
df_final.drop(['PostalCode'],axis=1,inplace=True)
df_final.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.75244,-79.329271
1,North York,Victoria Village,43.730421,-79.31332
2,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
3,North York,"Lawrence Heights, Lawrence Manor",43.723125,-79.451589
4,Queen's Park,Queen's Park,43.661102,-79.391035


In [59]:
df_final=df_final[df_final['Borough'].str.endswith('Toronto')].reset_index()
del df_final['index']
df_final.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264
1,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818
2,Downtown Toronto,St. James Town,43.65121,-79.375481
3,East Toronto,The Beaches,43.676845,-79.295225
4,Downtown Toronto,Berczy Park,43.64516,-79.373675


## __Part III - Segmenting and Clustering Neighborhoods in Toronto__  

Fetching the central location of the city of Toronto

In [60]:
g = geocoder.arcgis('Toronto,Ontario')
            
ll_toronto = g.latlng
print(ll_toronto)
            
lat_tor = ll_toronto[0]
long_tor = ll_toronto[1]

[43.648690000000045, -79.38543999999996]


Using folium to vizualize the neighborhoods

In [134]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[lat_tor, long_tor], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_final['Latitude'], df_final['Longitude'], df_final['Borough'], df_final['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Use Foursquare API to explore and cluster the neighborhoods in Toronto

In [62]:
CLIENT_ID = 'Y13NNMPP52FI1IBFHIVAG2YIXMVXSKYJLUSJALPPJKMXLL0V' # your Foursquare ID
CLIENT_SECRET = 'L1Y15FVS0F2OZYDUIZU0AGVMUW3GBRNNE4RPKCGCYXLSJLND' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: Y13NNMPP52FI1IBFHIVAG2YIXMVXSKYJLUSJALPPJKMXLL0V
CLIENT_SECRET:L1Y15FVS0F2OZYDUIZU0AGVMUW3GBRNNE4RPKCGCYXLSJLND


### Try and understand how to explore a single neighborhood

In [63]:
df_final.loc[0,'Neighborhood']

'Harbourfront, Regent Park'

In [64]:
neighborhood_latitude=df_final.loc[0,'Latitude']
neighborhood_longitude=df_final.loc[0,'Longitude']
neighborhood_name=df_final.loc[0,'Neighborhood']

print('The neighborhood {} location is ({} , {})'.format(neighborhood_name,neighborhood_latitude,neighborhood_longitude))

The neighborhood Harbourfront, Regent Park location is (43.65512000000007 , -79.36263979699999)


#### Let's start by getting the top 100 venues in the selected neighborhood, within a radius of 500m

In [65]:
# initialize the needed variables
LIMIT=100
radius=500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    VERSION,
    neighborhood_latitude,
    neighborhood_longitude,
    radius,
    LIMIT)

In [66]:
results=requests.get(url).json()
#results

#### We can see that all of the important info is in the items key, but first we need to the most important columns

In [67]:
# function that extracts the category of the venue
def get_category(row):
    try:
        category = row['categories']
    except:
        category = row['venue.categories']
        
    if len(category) == 0:
        return None
    else:
        return category[0]['name']

In [68]:
# fetch the important info from the json file
venues = results ['response']['groups'][0]['items'] 
# transform venues into a dataframe
df_filt = json_normalize(venues)
filter_col = ['venue.name','venue.categories']+[col for col in df_filt.columns if col.startswith('venue.location.')] + ['venue.id']
df_filtered = df_filt.loc[:,filter_col]

df_filtered['venue.categories'] = df_filtered.apply(get_category,axis=1) 
df_filtered.columns = [column.split('.')[-1] for column in df_filtered.columns]
df_filtered.head()
    

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,neighborhood,postalCode,state,id
0,Roselle Desserts,Bakery,362 King St E,CA,Toronto,Canada,Trinity St,192,"[362 King St E (Trinity St), Toronto ON M5A 1K...","[{'label': 'display', 'lat': 43.65344672305267...",43.653447,-79.362017,,M5A 1K9,ON,54ea41ad498e9a11e9e13308
1,Tandem Coffee,Coffee Shop,368 King St E,CA,Toronto,Canada,at Trinity St,186,"[368 King St E (at Trinity St), Toronto ON, Ca...","[{'label': 'display', 'lat': 43.65355870959944...",43.653559,-79.361809,,,ON,53b8466a498e83df908c3f21
2,Figs Breakfast & Lunch,Breakfast Spot,344 Queen St. E.,CA,Toronto,Canada,at Parliament St.,162,"[344 Queen St. E. (at Parliament St.), Toronto...","[{'label': 'display', 'lat': 43.65567455427388...",43.655675,-79.364503,,M5A 1S8,ON,4af59046f964a520e0f921e3
3,Body Blitz Spa East,Spa,497 King Street East,CA,Toronto,Canada,btwn Sackville St and Sumach St,226,[497 King Street East (btwn Sackville St and S...,"[{'label': 'display', 'lat': 43.65473505045365...",43.654735,-79.359874,,M5A 1L9,ON,50760559e4b0e8c7babe2497
4,Cocina Economica,Mexican Restaurant,114 Berkeley St,CA,Toronto,Canada,btwn Queen & Richmond,243,"[114 Berkeley St (btwn Queen & Richmond), Toro...","[{'label': 'display', 'lat': 43.65495889022676...",43.654959,-79.365657,,,ON,5542ab36498e2f92a8c248f2


In [69]:
print('{} venues returned by Foursquare in neighborhood {} within a {} radius'.format(df_filtered.shape[0],df_final['Neighborhood'][0],radius))

23 venues returned by Foursquare in neighborhood Harbourfront, Regent Park within a 500 radius


#### Clean the dataframe

In [70]:
df_nearby = df_filtered[['name','categories','address','lat','lng','id']]
df_nearby.head()

Unnamed: 0,name,categories,address,lat,lng,id
0,Roselle Desserts,Bakery,362 King St E,43.653447,-79.362017,54ea41ad498e9a11e9e13308
1,Tandem Coffee,Coffee Shop,368 King St E,43.653559,-79.361809,53b8466a498e83df908c3f21
2,Figs Breakfast & Lunch,Breakfast Spot,344 Queen St. E.,43.655675,-79.364503,4af59046f964a520e0f921e3
3,Body Blitz Spa East,Spa,497 King Street East,43.654735,-79.359874,50760559e4b0e8c7babe2497
4,Cocina Economica,Mexican Restaurant,114 Berkeley St,43.654959,-79.365657,5542ab36498e2f92a8c248f2


### Now we will implement a iterative function that enables the exploration of all Toronto's neighborhoods

In [71]:
def getNeighborhoods(names,lats,longs):
    
    l = []
    venues_list = []
    for name,lat,lng in zip(names,lats,longs):
        #debugg
        #print(name)
        
        # initialize the needed variables
        LIMIT=100
        radius=1000

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID,
                CLIENT_SECRET,
                VERSION,
                lat,
                lng,
                radius,
                LIMIT)
        
        results2 = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # get the relevant information for each venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results2])
        
    for venue_list in venues_list:
        #print(venue_list)
        for item in venue_list:
            #print(item)
            l.append(item)
                
    venues_list = pd.DataFrame(l)        
    
    return(venues_list)

In [72]:
toronto_venues = getNeighborhoods(names = df_final['Neighborhood'],
                                   lats = df_final['Latitude'],
                                   longs = df_final['Longitude']
                                  )

### Analyzing the data

In [73]:
print(toronto_venues.shape)
toronto_venues.head()

(3187, 7)


Unnamed: 0,0,1,2,3,4,5,6
0,"Harbourfront, Regent Park",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront, Regent Park",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Harbourfront, Regent Park",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Harbourfront, Regent Park",43.65512,-79.36264,Rooster Coffee,43.6519,-79.365609,Coffee Shop
4,"Harbourfront, Regent Park",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


In [93]:
toronto_venues.columns = ['Neighborhood','Neihgborhood Latitude','Neighborhood Longitude','Venue Name','Venue Latitude','Venue Longitude','Venue Category']
toronto_venues.head()

Unnamed: 0,Neighborhood,Neihgborhood Latitude,Neighborhood Longitude,Venue Name,Venue Latitude,Venue Longitude,Venue Category
0,"Harbourfront, Regent Park",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront, Regent Park",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Harbourfront, Regent Park",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Harbourfront, Regent Park",43.65512,-79.36264,Rooster Coffee,43.6519,-79.365609,Coffee Shop
4,"Harbourfront, Regent Park",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


#### Determine the number of unique categories of the returned venues 

In [75]:
print('There are {} unique categories'.format(len(toronto_venues['Venue Category'].unique())))

There are 288 unique categories


In [76]:
toronto_venues.shape

(3187, 7)

## Analyze Each Neighborhood

#### First we will create a dummy dataframe to analyze the different types of venues and answer a given question

In [77]:
# One hot encoding
toronto_hot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="",prefix_sep="")

# insert the column neighborhood in the first position
ngh=toronto_venues['Neighborhood']
toronto_hot.drop(labels=['Neighborhood'],axis=1,inplace=True)
toronto_hot.insert(0,'Neighborhood',ngh)

In [78]:
toronto_hot.shape

(3187, 288)

#### How many different venues exist and what is their category

In [79]:
#print('There are {} unique venues and this is the list of different venues in all considered neighborhoods:\n \n{}'.format(len(toronto_venues['Venue Category'].unique()),toronto_venues['Venue Category'].unique()))

In [80]:
# get the number of appearances for each venue category (best method)
venues_freq=toronto_venues.groupby('Venue Category').size().sort_values(ascending=False)
venues_freq

Venue Category
Coffee Shop                      235
Café                             182
Italian Restaurant                92
Park                              84
Bakery                            84
Restaurant                        80
Pizza Place                       74
Bar                               71
Hotel                             66
Japanese Restaurant               55
Sushi Restaurant                  52
Pub                               51
Gym                               51
Gastropub                         48
Vegetarian / Vegan Restaurant     47
Sandwich Place                    44
American Restaurant               43
Thai Restaurant                   40
Grocery Store                     38
Breakfast Spot                    34
Theater                           34
Diner                             32
Beer Bar                          31
Mexican Restaurant                30
Burger Joint                      30
Steakhouse                        29
Indian Restaurant      

#### Get the number of bars in all neighborhoods

In [81]:
venues_freq= pd.DataFrame(venues_freq).reset_index()
venues_freq.columns=['Category','# Venues']
df_bar=venues_freq[venues_freq.Category.str.contains('Bar',case=False)]
df_bar.head()

Unnamed: 0,Category,# Venues
7,Bar,71
22,Beer Bar,31
40,Cocktail Bar,21
72,Salon / Barbershop,11
82,Juice Bar,9


#### Clean all the venues not related to bars business

In [211]:
toronto_hot = toronto_hot.groupby('Neighborhood').mean().reset_index()
toronto_hot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04


In [95]:
num_top_venues = 5

for hood in toronto_hot['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_hot[toronto_hot['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
                 venue  freq
0          Coffee Shop  0.09
1                 Café  0.08
2                Hotel  0.07
3  Japanese Restaurant  0.04
4              Theater  0.04


----Berczy Park----
         venue  freq
0  Coffee Shop  0.10
1         Café  0.06
2        Hotel  0.05
3   Restaurant  0.04
4     Beer Bar  0.04


----Brockton, Exhibition Place, Parkdale Village----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.07
2          Bar  0.05
3       Bakery  0.04
4   Restaurant  0.04


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0         Coffee Shop  0.08
1                Café  0.06
2               Hotel  0.06
3  Italian Restaurant  0.03
4         Pizza Place  0.03


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                venue  freq
0         Coffee Shop  0.11
1          Restaurant  0.05
2  Italian Restaurant  0.05
3      

### Create a dataframee with these values sorted 

In [149]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [226]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_hot['Neighborhood']

for ind in np.arange(toronto_hot.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_hot.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Hotel,Japanese Restaurant,Theater,American Restaurant,Sushi Restaurant,Bakery,Steakhouse,Beer Bar
1,Berczy Park,Coffee Shop,Café,Hotel,Japanese Restaurant,Beer Bar,Restaurant,Park,Bakery,Cocktail Bar,Farmers Market
2,"Brockton, Exhibition Place, Parkdale Village",Café,Coffee Shop,Bar,Bakery,Restaurant,Gift Shop,Furniture / Home Store,New American Restaurant,Italian Restaurant,Ice Cream Shop
3,Business Reply Mail Processing Centre 969 Eastern,Coffee Shop,Café,Hotel,Italian Restaurant,Bar,American Restaurant,Pizza Place,Concert Hall,Japanese Restaurant,Steakhouse
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Coffee Shop,Restaurant,Italian Restaurant,Yoga Studio,Bar,Sandwich Place,Park,Spa,Gym,Bakery
5,"Cabbagetown, St. James Town",Park,Japanese Restaurant,Diner,Café,Pool,Gastropub,Bistro,Caribbean Restaurant,Jewelry Store,Bakery
6,Central Bay Street,Coffee Shop,Clothing Store,Café,Italian Restaurant,Vegetarian / Vegan Restaurant,Ramen Restaurant,Japanese Restaurant,Plaza,Middle Eastern Restaurant,Theater
7,"Chinatown, Grange Park, Kensington Market",Café,Vegetarian / Vegan Restaurant,Bakery,Yoga Studio,Dessert Shop,Bar,Mexican Restaurant,Coffee Shop,Furniture / Home Store,Ice Cream Shop
8,Christie,Korean Restaurant,Café,Coffee Shop,Grocery Store,Japanese Restaurant,Park,Indian Restaurant,Cocktail Bar,Diner,Ethiopian Restaurant
9,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Pizza Place,Dance Studio,Gym,Mediterranean Restaurant,Men's Store,Gastropub


### Cluster the Neighborhoods

Clustering using the k-means method (5 clusters)

In [205]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
toronto_clustering = toronto_hot.drop('Neighborhood',1)
# set the number of clusters
kclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters=k,random_state=0).fit(toronto_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 3, 0, 3, 3, 0, 3, 3, 0])

Merging the cluster dataframe with the top-10 venues of each neighborhood 

In [206]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_final

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,"Harbourfront, Regent Park",43.65512,-79.36264,3,Coffee Shop,Café,Restaurant,Park,Theater,Italian Restaurant,Bakery,Gym / Fitness Center,Breakfast Spot,Thai Restaurant
1,Downtown Toronto,"Ryerson, Garden District",43.657363,-79.37818,0,Coffee Shop,Gastropub,Clothing Store,Middle Eastern Restaurant,Italian Restaurant,Tea Room,Diner,Café,Fast Food Restaurant,Department Store
2,Downtown Toronto,St. James Town,43.65121,-79.375481,0,Coffee Shop,Café,Restaurant,Bakery,Gastropub,Hotel,Breakfast Spot,Italian Restaurant,American Restaurant,Cosmetics Shop
3,East Toronto,The Beaches,43.676845,-79.295225,3,Pub,Coffee Shop,Park,Bar,Breakfast Spot,Beach,Bakery,Burger Joint,Restaurant,Thai Restaurant
4,Downtown Toronto,Berczy Park,43.64516,-79.373675,0,Coffee Shop,Café,Hotel,Japanese Restaurant,Beer Bar,Restaurant,Park,Bakery,Cocktail Bar,Farmers Market


In [207]:
print(toronto_merged.shape)
print(neighborhoods_venues_sorted.shape)

(38, 15)
(38, 12)


#### Plot the clustered neighborhoods

In [203]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[lat_tor, long_tor], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## __Part III - Final answer__

#### __Examine the clusters__

#### __Cluster 0__

This is the second biggest cluster (__red__) and the main venues are __coffe shops__ and __cafés__. I think the main reason for this coffe shop/café high density is related to being in between the __University of Toronto__ and __Toronto Union station__.

In [204]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,"Ryerson, Garden District",Coffee Shop,Gastropub,Clothing Store,Middle Eastern Restaurant,Italian Restaurant,Tea Room,Diner,Café,Fast Food Restaurant,Department Store
2,St. James Town,Coffee Shop,Café,Restaurant,Bakery,Gastropub,Hotel,Breakfast Spot,Italian Restaurant,American Restaurant,Cosmetics Shop
4,Berczy Park,Coffee Shop,Café,Hotel,Japanese Restaurant,Beer Bar,Restaurant,Park,Bakery,Cocktail Bar,Farmers Market
5,Central Bay Street,Coffee Shop,Clothing Store,Café,Italian Restaurant,Vegetarian / Vegan Restaurant,Ramen Restaurant,Japanese Restaurant,Plaza,Middle Eastern Restaurant,Theater
7,"Adelaide, King, Richmond",Coffee Shop,Café,Hotel,Japanese Restaurant,Theater,American Restaurant,Sushi Restaurant,Bakery,Steakhouse,Beer Bar
12,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Hotel,Steakhouse,Concert Hall,Italian Restaurant,Lounge,American Restaurant,Restaurant,Deli / Bodega
15,"Commerce Court, Victoria Hotel",Coffee Shop,Hotel,Café,Gastropub,Restaurant,Concert Hall,Beer Bar,Japanese Restaurant,Steakhouse,Bakery
33,Stn A PO Boxes 25 The Esplanade,Coffee Shop,Café,Hotel,Italian Restaurant,Bar,American Restaurant,Pizza Place,Concert Hall,Japanese Restaurant,Steakhouse
35,"First Canadian Place, Underground city",Coffee Shop,Hotel,Café,Japanese Restaurant,Steakhouse,Restaurant,Theater,American Restaurant,Deli / Bodega,Thai Restaurant
36,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Pizza Place,Dance Studio,Gym,Mediterranean Restaurant,Men's Store,Gastropub


#### __Cluster 1__

One of the smallest (__purple__) clusters. Rosedale's neighborhood main venue is the __Rosedale Park__.

In [195]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,Rosedale,Park,Italian Restaurant,Trail,Sporting Goods Shop,Bank,Beer Store,Gourmet Shop,Candy Store,Building,Athletics & Sports


#### __Cluster 2__

Small cluster (__light blue__). Seems like a residential area.

In [196]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Roselawn,Pharmacy,Café,Bank,Trail,Skating Rink,Dumpling Restaurant,Dog Run,Doner Restaurant,Donut Shop,Dry Cleaner


#### __Cluster 3__

This is the biggest cluster __(green)__. The most common venues are restaurants.

In [197]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Harbourfront, Regent Park",Coffee Shop,Café,Restaurant,Park,Theater,Italian Restaurant,Bakery,Gym / Fitness Center,Breakfast Spot,Thai Restaurant
3,The Beaches,Pub,Coffee Shop,Park,Bar,Breakfast Spot,Beach,Bakery,Burger Joint,Restaurant,Thai Restaurant
6,Christie,Korean Restaurant,Café,Coffee Shop,Grocery Store,Japanese Restaurant,Park,Indian Restaurant,Cocktail Bar,Diner,Ethiopian Restaurant
8,"Dovercourt Village, Dufferin",Bar,Bakery,Café,Coffee Shop,Park,Pharmacy,Clothing Store,Beer Store,Ice Cream Shop,Cocktail Bar
10,"Little Portugal, Trinity",Bar,Café,Pizza Place,Cocktail Bar,Coffee Shop,Bakery,Asian Restaurant,Restaurant,Italian Restaurant,Men's Store
11,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Café,Pub,Bakery,Italian Restaurant,Fast Food Restaurant,Pizza Place,Sandwich Place,Grocery Store
13,"Brockton, Exhibition Place, Parkdale Village",Café,Coffee Shop,Bar,Bakery,Restaurant,Gift Shop,Furniture / Home Store,New American Restaurant,Italian Restaurant,Ice Cream Shop
14,"The Beaches West, India Bazaar",Indian Restaurant,Coffee Shop,Beach,Park,Pub,Pizza Place,Brewery,Café,Grocery Store,Gym
16,Studio District,Coffee Shop,Pizza Place,Café,Italian Restaurant,Bakery,Bar,American Restaurant,Park,Sandwich Place,French Restaurant
17,Lawrence Park,Café,Coffee Shop,Pharmacy,Bus Line,Park,Restaurant,Trail,Bookstore,College Gym,Gym / Fitness Center


#### __Cluster 4__

This is the __orange cluster__. It is near the Billy Bishop Toronto Airport, so it makes sense the most common venues are airport related.

In [198]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]].loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,"Harbourfront East, Toronto Islands, Union Station",Harbor / Marina,Bar,Airport Lounge,Airport Terminal,Burger Joint,Boutique,Nudist Beach,Music Venue,Sculpture Garden,Café
