## Toronto Neighborhood/Borough Analysis

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

In [1]:
#imports

import pandas as pd
import numpy as np

In [2]:
!pip install beautifulsoup4



In [3]:
from bs4 import BeautifulSoup as bs
import requests as req

In [6]:
page = req.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(page,"lxml")
soup.table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park / Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor / Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park / Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern / Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>
</td></tr>
<tr>
<td>M3B
</td>
<td>North York
</td>
<td>Don Mills
</td></tr>
<tr>
<td>M4B
</td>
<td>Ea

In [9]:
headers = []
rows =[]
first_row = True

for n in soup.table.find_all('tr'):
    if first_row :
        headers = [header.text.split('\n')[0] for header in n.find_all('th')]
        first_row = False
    else :
        row = [value.text.split('\n')[0] for value in n.find_all('td')]
        rows.append(row)
print('Column names ', headers)
print('rows ', rows[:10])

Column names  ['Postal code', 'Borough', 'Neighborhood']
rows  [['M1A', 'Not assigned', ''], ['M2A', 'Not assigned', ''], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Regent Park / Harbourfront'], ['M6A', 'North York', 'Lawrence Manor / Lawrence Heights'], ['M7A', 'Downtown Toronto', "Queen's Park / Ontario Provincial Government"], ['M8A', 'Not assigned', ''], ['M9A', 'Etobicoke', 'Islington Avenue'], ['M1B', 'Scarborough', 'Malvern / Rouge']]


In [28]:
df_can = pd.DataFrame(data=rows, columns=headers)
df_can.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [29]:
df_can.shape

(180, 3)

In [30]:
#cleaning data

df_can.replace('Not assigned',np.NaN,inplace=True)

df_can=df_can[pd.notnull(df_can['Borough'])]
df_can.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [31]:
df_can['Borough'].isnull().any()

False

In [32]:
print(df_can['Neighborhood'].isnull().any())
print(df_can['Postal code'].isnull().any())

False
False


In [33]:
df_can.shape

(103, 3)

In [34]:
coordinates = pd.read_csv('http://cocl.us/Geospatial_data')
coordinates.shape

(103, 3)

In [35]:
print('Actual data columns names: ', df_can.columns)
print('Actual coordinates names: ', coordinates.columns)
df_can.columns = ['PostalCode', 'Borough', 'Neighborhood']
coordinates.columns = ['PostalCode', 'Latitude','Longitude']
print('Modified data columns names: ', df_can.columns)
print('Modified coordinates names: ', coordinates.columns)

Actual data columns names:  Index(['Postal code', 'Borough', 'Neighborhood'], dtype='object')
Actual coordinates names:  Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')
Modified data columns names:  Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')
Modified coordinates names:  Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')


In [36]:
coordinates=coordinates[pd.notnull(coordinates['Latitude'])]
coordinates=coordinates[pd.notnull(coordinates['Longitude'])]

In [37]:
df_can.sort_values('PostalCode', inplace=True)
coordinates.sort_values('PostalCode',inplace=True)

df_can['Latitude'] = coordinates['Latitude']
df_can['Longitude'] = coordinates['Longitude']
df_can.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
9,M1B,Scarborough,Malvern / Rouge,43.692657,-79.264848
18,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.778517,-79.346556
27,M1E,Scarborough,Guildwood / Morningside / West Hill,43.7259,-79.340923
36,M1G,Scarborough,Woburn,43.695344,-79.318389
45,M1H,Scarborough,Cedarbrae,43.712751,-79.390197


In [38]:
df_can.isnull().any()

PostalCode      False
Borough         False
Neighborhood    False
Latitude         True
Longitude        True
dtype: bool

In [39]:
print('Previous shape : ',df_can.shape)
df_can=df_can[pd.notnull(df_can['Latitude'])]
df_can=df_can[pd.notnull(df_can['Longitude'])]
print('new shape w/o null ', df_can.shape)


Previous shape :  (103, 5)
new shape w/o null  (68, 5)


In [41]:
df_can.isnull().any()

PostalCode      False
Borough         False
Neighborhood    False
Latitude        False
Longitude       False
dtype: bool


## Part 3: Exploration and Clustering of the data

### a) Exploring and visualizing the boroughs

To explore the data, we will follow the next steps:

    Get the Toronto city coordinates using Geopy.
    Visualize a map with the Toronto boroughs pointed out as markers in it using Folium.

For this task, we will need the following packages:


In [45]:
# !pip install geopy
from geopy.geocoders import Nominatim

In [46]:
import folium

In [47]:
address = 'Toronto, Ontario'

locator = Nominatim(user_agent='toronto_exploration')
location = locator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print("coordinates are : {},{}".format(latitude,longitude))

coordinates are : 43.6534817,-79.3839347


In [49]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add boroughs as markers
for lat, lng, borough, neighborhood in zip(df_can['Latitude'], df_can['Longitude'], df_can['Borough'], df_can['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#ff6464',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [50]:
print('Unique boroughs : ',len(df_can['Borough'].unique()))

Unique boroughs :  9


As the dataset is built mainly with the coordinates for each Postal Code, and not for neighborhoods, we will simplify the analysis by exploring venues near groups of neighborhoods belonging to the same Postal Code and performing clustering directly with them, instead of individual neighborhoods.

So, more explicitly, the problem to be analysed will consist by the following:

Using some groups of Toronto neighborhoods belonging to the same Postal Codes (explicitly the ones having a 'Toronto' in its Borough name, as an advise of the instructions), we will retrieve using Foursquare API, their most popular venues nearby in a 800 meter radius (a maximum of 100) to then cluster those groups based on the most common venues for each group of neighborhoods.

Having explained that, we may proceed with the collection of data about venues using Foursquare API.

### b) Defining Foursquare Credentials, Version and some functions we'll need later

In [51]:
CLIENT_ID = 'XIVVLDW5DHIZYE1CJGET0T2BVYFKP4QMYB1RMIDMA2PWJIQR' # Foursquare ID
CLIENT_SECRET = 'AAEMJ5AJR5VQOX1OG5TQO342RVZNSPO1PKJ4AZYRACVFFG42' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Credentials:
CLIENT_ID: XIVVLDW5DHIZYE1CJGET0T2BVYFKP4QMYB1RMIDMA2PWJIQR
CLIENT_SECRET: AAEMJ5AJR5VQOX1OG5TQO342RVZNSPO1PKJ4AZYRACVFFG42


In [52]:


# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']



In [59]:
# Function that gets 100 closest venues to a postal code coordinates
def getNearbyVenues(postal_codes, latitudes, longitudes, radius=800):
    
    LIMIT = 100
    venues_list=[]
    
    for code, lat, lng in zip(postal_codes, latitudes, longitudes):
        print('Querying venues from: ', code)
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = req.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Neighborhood Group Latitude', 
                  'Neighborhood Group Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


### c) Setting up the Neighborhood groups to be analysed and retrieving their closest venues

As we said earlier, we will only select Postal Codes which Boroughs contain the 'Toronto' word in their name.

So let's select those ones in a different DF:


In [56]:
codes_selected = df_can[df_can.Borough.str.contains('Toronto',case=False)]
print('Boroughs that contain Toronto : ', codes_selected.shape)
codes_selected.head()

Boroughs that contain Toronto :  (21, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
30,M4E,East Toronto,The Beaches,43.737473,-79.464763
66,M4K,East Toronto,The Danforth West / Riverdale,43.662696,-79.400049
75,M4L,East Toronto,India Bazaar / The Beaches West,43.669542,-79.422564
84,M4M,East Toronto,Studio District,43.651571,-79.48445
93,M4N,Central Toronto,Lawrence Park,43.667856,-79.532242


In [60]:
all_venues = getNearbyVenues(postal_codes=codes_selected['PostalCode'],
                             latitudes=codes_selected['Latitude'],
                             longitudes=codes_selected['Longitude'])

Querying venues from:  M4E
Querying venues from:  M4K
Querying venues from:  M4L
Querying venues from:  M4M
Querying venues from:  M4N
Querying venues from:  M4P
Querying venues from:  M5A
Querying venues from:  M5B
Querying venues from:  M5C
Querying venues from:  M5E
Querying venues from:  M5G
Querying venues from:  M5H
Querying venues from:  M5J
Querying venues from:  M5K
Querying venues from:  M5L
Querying venues from:  M5N
Querying venues from:  M6G
Querying venues from:  M6H
Querying venues from:  M6J
Querying venues from:  M6K
Querying venues from:  M7A


In [62]:
print('Data shape: ', all_venues.shape)
all_venues.head()

Data shape:  (1010, 7)


Unnamed: 0,Postal Code,Neighborhood Group Latitude,Neighborhood Group Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.737473,-79.464763,Toronto Downsview Airport (YZD),43.738883,-79.470111,Airport
1,M4E,43.737473,-79.464763,Chef 47,43.730483,-79.466422,Turkish Restaurant
2,M4E,43.737473,-79.464763,Pupusa Loka,43.73048,-79.466845,Latin American Restaurant
3,M4E,43.737473,-79.464763,Forget Me Not Cafe Vietnamese Resto-Bar,43.730492,-79.46653,Vietnamese Restaurant
4,M4E,43.737473,-79.464763,Subway,43.731803,-79.462773,Sandwich Place


Thus data retreived from FourSquare API gives 1010 venues. We save it in a csv file to avoid API calls.

In [63]:
all_venues.to_csv('venues.csv')

In [64]:
all_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Neighborhood Group Latitude,Neighborhood Group Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,14,14,14,14,14,14
M4K,100,100,100,100,100,100
M4L,64,64,64,64,64,64
M4M,67,67,67,67,67,67
M4N,9,9,9,9,9,9
M4P,2,2,2,2,2,2
M5A,22,22,22,22,22,22
M5B,23,23,23,23,23,23
M5C,79,79,79,79,79,79
M5E,9,9,9,9,9,9



### d) Preprocessing the venues data

Next we are going to apply One-Hot Encoding to transform the dataset into the number of venues by category for each neighborhood group:


In [65]:
# Applying OH Encoding by 'creating dummies'
venues_one_hot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")

# Add the 'Postal Code' column back to dataset
venues_one_hot['Postal Code'] = all_venues['Postal Code'] 

# Move postal code column to the first column
fixed_columns = [venues_one_hot.columns[-1]] + list(venues_one_hot.columns[:-1])
venues_one_hot = venues_one_hot[fixed_columns]

venues_one_hot.head()

Unnamed: 0,Postal Code,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Arepa Restaurant,...,Tunnel,Turkish Restaurant,Udon Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M4E,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M4E,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [66]:
venues_grouped = venues_one_hot.groupby('Postal Code').mean().reset_index()
venues_grouped

Unnamed: 0,Postal Code,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Arepa Restaurant,...,Tunnel,Turkish Restaurant,Udon Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,M4E,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,0.0,0.01
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,...,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,0.0,0.0,0.0
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M5A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455
7,M5B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M5C,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,0.0
9,M5E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0


In [67]:


num_top_venues = 5

for code in venues_grouped['Postal Code']:
    print("------------ "+code+" ------------")
    temp = venues_grouped[venues_grouped['Postal Code'] == code].T.reset_index()
    temp.columns = ['Venue Cat','Freq']
    temp = temp.iloc[1:]
    temp['Freq'] = temp['Freq'].astype(float)
    temp = temp.round({'Freq': 2})
    print(temp.sort_values('Freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')



------------ M4E ------------
                   Venue Cat  Freq
0                Coffee Shop  0.21
1                    Airport  0.07
2                       Park  0.07
3  Latin American Restaurant  0.07
4             Sandwich Place  0.07


------------ M4K ------------
                       Venue Cat  Freq
0                           Café  0.10
1                    Coffee Shop  0.04
2                     Restaurant  0.04
3                         Bakery  0.04
4  Vegetarian / Vegan Restaurant  0.04


------------ M4L ------------
           Venue Cat  Freq
0      Grocery Store  0.12
1  Korean Restaurant  0.09
2        Coffee Shop  0.08
3               Café  0.05
4              Diner  0.05


------------ M4M ------------
     Venue Cat  Freq
0  Coffee Shop  0.10
1         Café  0.07
2  Pizza Place  0.06
3          Pub  0.04
4       Bakery  0.04


------------ M4N ------------
       Venue Cat  Freq
0       Pharmacy  0.22
1           Café  0.11
2     Playground  0.11
3  Grocery Store  

In [68]:


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]



In [69]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create the columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create the new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Postal Code'] = venues_grouped['Postal Code']

for ind in np.arange(venues_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Coffee Shop,Airport,Food Court,Sandwich Place,Middle Eastern Restaurant,Latin American Restaurant,Turkish Restaurant,Chinese Restaurant,Gas Station,Pizza Place
1,M4K,Café,Vegetarian / Vegan Restaurant,Restaurant,Bakery,Coffee Shop,Italian Restaurant,Pizza Place,Bar,Bookstore,Park
2,M4L,Grocery Store,Korean Restaurant,Coffee Shop,Diner,Café,Indian Restaurant,Japanese Restaurant,Park,Pizza Place,Cocktail Bar
3,M4M,Coffee Shop,Café,Pizza Place,Bakery,Italian Restaurant,Pub,Park,Gastropub,Sushi Restaurant,Bank
4,M4N,Pharmacy,Café,Playground,Shopping Mall,Bank,Skating Rink,Grocery Store,Park,College Theater,Deli / Bodega



### e) Clustering the neighborhood groups using K-Means



In [86]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [87]:
venue_cluster = venues_grouped.drop('Postal Code', 1)
venue_cluster.head()

Unnamed: 0,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Aquarium,Arepa Restaurant,Art Gallery,...,Tunnel,Turkish Restaurant,Udon Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071429,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,0.0,0.01
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015625,...,0.0,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [94]:
# Best K parameter
k_clusters = 10

# Run K-Means clustering algorithm
model = KMeans(n_clusters=k_clusters, random_state=0).fit(venue_cluster)

# Check cluster labels generated for each row in the dataframe
model.labels_[0:10]

array([8, 6, 6, 6, 5, 1, 2, 2, 6, 4])

In [95]:
# Add clustering labels
venues_sorted.insert(0, 'Cluster Labels', model.labels_)

venues_merged = codes_selected

# Merge venues_grouped with the DF with the Postal Codes selected
# to add latitude/longitude for each neighborhood group
venues_merged = venues_merged.join(venues_sorted.set_index('Postal Code'), on='PostalCode')

venues_merged.head()

ValueError: cannot insert Cluster Labels, already exists

In [90]:
# Create a Folium map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set the color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(venues_merged['Latitude'], venues_merged['Longitude'], venues_merged['PostalCode'], venues_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [96]:
venues_merged.loc[venues_merged['Cluster Labels'] == 0, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
76,Downtown Toronto,Commerce Court / Victoria Hotel,0,Park,Coffee Shop,Bakery,Portuguese Restaurant,Bar,Café,Gym,Pharmacy,Camera Store,Music Venue


In [97]:
venues_merged.loc[venues_merged['Cluster Labels'] == 1, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
102,Central Toronto,Davisville North,1,Lounge,Rental Car Location,Deli / Bodega,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Discount Store,Diner,Dim Sum Restaurant


In [99]:
venues_merged.loc[venues_merged['Cluster Labels'] == 2, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Downtown Toronto,Regent Park / Harbourfront,2,Coffee Shop,Bakery,Indian Restaurant,Gym / Fitness Center,Hakka Restaurant,Gas Station,Fried Chicken Joint,Flower Shop,Music Store,Convenience Store
13,Downtown Toronto,"Garden District, Ryerson",2,Pharmacy,Shopping Mall,Coffee Shop,Intersection,Pizza Place,Thai Restaurant,Sandwich Place,Rental Car Location,Italian Restaurant,Bank
6,Downtown Toronto,Queen's Park / Ontario Provincial Government,2,Coffee Shop,Grocery Store,Light Rail Station,Department Store,Discount Store,Sandwich Place,Bus Line,Bus Station,Fast Food Restaurant,Bank


In [101]:
venues_merged.loc[venues_merged['Cluster Labels'] == 3, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,West Toronto,Dufferin / Dovercourt Village,3,Park,Trail,Grocery Store,Playground,Candy Store,Bank,Deli / Bodega,Donut Shop,Doner Restaurant,Dog Run


In [102]:
venues_merged.loc[venues_merged['Cluster Labels'] == 4, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
31,Downtown Toronto,Berczy Park,4,Park,Shopping Mall,Bank,Spa,Pizza Place,Moving Target,Grocery Store,Vietnamese Restaurant,Deli / Bodega,Discount Store


In [103]:
venues_merged.loc[venues_merged['Cluster Labels'] == 5, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
93,Central Toronto,Lawrence Park,5,Pharmacy,Café,Playground,Shopping Mall,Bank,Skating Rink,Grocery Store,Park,College Theater,Deli / Bodega


In [104]:
venues_merged.loc[venues_merged['Cluster Labels'] == 6, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
66,East Toronto,The Danforth West / Riverdale,6,Café,Vegetarian / Vegan Restaurant,Restaurant,Bakery,Coffee Shop,Italian Restaurant,Pizza Place,Bar,Bookstore,Park
75,East Toronto,India Bazaar / The Beaches West,6,Grocery Store,Korean Restaurant,Coffee Shop,Diner,Café,Indian Restaurant,Japanese Restaurant,Park,Pizza Place,Cocktail Bar
84,East Toronto,Studio District,6,Coffee Shop,Café,Pizza Place,Bakery,Italian Restaurant,Pub,Park,Gastropub,Sushi Restaurant,Bank
22,Downtown Toronto,St. James Town,6,Coffee Shop,Pizza Place,Korean Restaurant,Ramen Restaurant,Sushi Restaurant,Sandwich Place,Fast Food Restaurant,Restaurant,Café,Japanese Restaurant
40,Downtown Toronto,Central Bay Street,6,Café,Coffee Shop,Pizza Place,Park,Gastropub,Greek Restaurant,Pub,Beer Bar,Breakfast Spot,Fast Food Restaurant
49,Downtown Toronto,Richmond / Adelaide / King,6,Coffee Shop,Sushi Restaurant,Italian Restaurant,Thai Restaurant,Grocery Store,Gym,Restaurant,Spa,Pub,Pizza Place
58,Downtown Toronto,Harbourfront East / Union Station / Toronto Is...,6,Coffee Shop,Café,Hotel,Theater,Japanese Restaurant,American Restaurant,Restaurant,Breakfast Spot,Clothing Store,Italian Restaurant
67,Downtown Toronto,Toronto Dominion Centre / Design Exchange,6,Café,Bar,Vegetarian / Vegan Restaurant,Coffee Shop,Art Gallery,Dessert Shop,Mexican Restaurant,Tea Room,Gaming Cafe,Bakery
41,Downtown Toronto,Christie,6,Greek Restaurant,Coffee Shop,Pub,Café,Fast Food Restaurant,Italian Restaurant,Ice Cream Shop,Pizza Place,Breakfast Spot,Bookstore
59,West Toronto,Little Portugal / Trinity,6,Coffee Shop,Hotel,Boat or Ferry,Park,Japanese Restaurant,Brewery,Gym,Restaurant,Music Venue,Sandwich Place


In [105]:
venues_merged.loc[venues_merged['Cluster Labels'] == 7, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,Central Toronto,Roselawn,7,Pizza Place,Gym,Hotel,Convenience Store,Café,Restaurant,Mexican Restaurant,Bank,Theater,Coffee Shop


In [106]:
venues_merged.loc[venues_merged['Cluster Labels'] == 8, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,East Toronto,The Beaches,8,Coffee Shop,Airport,Food Court,Sandwich Place,Middle Eastern Restaurant,Latin American Restaurant,Turkish Restaurant,Chinese Restaurant,Gas Station,Pizza Place


In [107]:
venues_merged.loc[venues_merged['Cluster Labels'] == 9, venues_merged.columns[[1,2] + list(range(5, venues_merged.shape[1]))]]

Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
68,West Toronto,Brockton / Parkdale Village / Exhibition Place,9,Harbor / Marina,Boat or Ferry,Airport Service,Airport Lounge,Airport Terminal,Rental Car Location,Sculpture Garden,Coffee Shop,Music Venue,Plane
