# Segmenting and Clustering Neighborhoods in Toronto

I have divided these notebooks into 3 parts.  
Part 1: Number of Rows of the dataframe  
Part 2: Creating the given dataframe using the CSV file  
Part 3: Clustering Neighborhoods in Toronto

## Part 1: Number of rows of the dataframe.

Importing Required Modules

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Obtaining HTML Data from the wikipedia page

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data = requests.get(url).text
html_data

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of postal codes of Canada: M - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"821b4139-c7e1-45f8-a276-44c37182b800","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":1029579868,"wgRevisionId":1029579868,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communications in Ont

Creating Beautiful Soup Object

In [3]:
soup = BeautifulSoup(html_data, 'html.parser')

Creating a list from soup object

In [4]:
table_contents=[] # Creating an Empty List
table=soup.find('table') # Finding Table
for row in table.findAll('td'): # Finding Table Data
    cell = {}
    if row.span.text=='Not assigned': # Dropping rows with a borough that is not assigned
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # Extracting Postal Code 
        cell['Borough'] = (row.span.text).split('(')[0] #Extracting boroughs
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') #Extracting Neighborhoods
        table_contents.append(cell) #Appending to the list
print(table_contents)

[{'PostalCode': 'M3A', 'Borough': 'North York', 'Neighborhood': 'Parkwoods'}, {'PostalCode': 'M4A', 'Borough': 'North York', 'Neighborhood': 'Victoria Village'}, {'PostalCode': 'M5A', 'Borough': 'Downtown Toronto', 'Neighborhood': 'Regent Park, Harbourfront'}, {'PostalCode': 'M6A', 'Borough': 'North York', 'Neighborhood': 'Lawrence Manor, Lawrence Heights'}, {'PostalCode': 'M7A', 'Borough': "Queen's Park", 'Neighborhood': 'Ontario Provincial Government'}, {'PostalCode': 'M9A', 'Borough': 'Etobicoke', 'Neighborhood': 'Islington Avenue'}, {'PostalCode': 'M1B', 'Borough': 'Scarborough', 'Neighborhood': 'Malvern, Rouge'}, {'PostalCode': 'M3B', 'Borough': 'North York', 'Neighborhood': 'Don Mills North'}, {'PostalCode': 'M4B', 'Borough': 'East York', 'Neighborhood': 'Parkview Hill, Woodbine Gardens'}, {'PostalCode': 'M5B', 'Borough': 'Downtown Toronto', 'Neighborhood': 'Garden District, Ryerson'}, {'PostalCode': 'M6B', 'Borough': 'North York', 'Neighborhood': 'Glencairn'}, {'PostalCode': 'M9

Creating dataframe df with the obtained list

In [5]:
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.rename(columns={'PostalCode': 'Postal Code'}, inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Grouping neighborhoods with the same postal code

In [6]:
df = df.groupby(['Postal Code']).head()
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Ensuring no "Not assigned" neighborhoods are in the dataframe.

In [7]:
df.Neighborhood.str.count("Not assigned").sum()

0

In [8]:
print("Number of rows in the dataframe:",len(df.index))
print("Number of columns in the dataframe:",len(df.columns))

Number of rows in the dataframe: 103
Number of columns in the dataframe: 3


There are 103 rows in the dataframe

## Part 2: Creating the given dataframe using the CSV file

In [9]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv"
data = pd.read_csv(url)
data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [10]:
print(data.shape==df.shape)

True


Joining the two dataframes, since their dimensions are the same.  
Checking the column types of both the dataframes:

In [11]:
df.dtypes

Postal Code     object
Borough         object
Neighborhood    object
dtype: object

In [12]:
data.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

Using inner join

In [13]:
neighborhoods = df.join(data.set_index('Postal Code'), on='Postal Code', how='inner')
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


#### The dataframe is created as per the question.

## Part 3: Clustering Neighborhoods in Toronto

Importing necessary modules

In [14]:
from geopy.geocoders import Nominatim
import folium

In [15]:
from pandas.io.json import json_normalize

In [16]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating a map of Toronto with neighborhoods superimposed on top

In [17]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], 
                                           neighborhoods['Longitude'],
                                          neighborhoods['Borough'],
                                          neighborhoods['Neighborhood']):
    label= '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Simplifying map, segmenting & clustering only the neighborhoods in 

In [18]:
north_york_data = neighborhoods[neighborhoods['Borough'] == 'North York'].reset_index(drop=True)
north_york_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


Getting geographical coordinates of North York

In [19]:
address = 'North York, Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7543263, -79.44911696639593.


Visualizing North York with the neighborhoods in it.

In [20]:
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(north_york_data['Latitude'], north_york_data['Longitude'], north_york_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  
    
map_north_york

Defining Foursquare Credentials & Version

In [21]:
CLIENT_ID = 'U3RQE2KIF2JZQVAGKDG3ME3BNOFGAWZYW2FCBN0CT1X4RFL3' # your Foursquare ID
CLIENT_SECRET = 'XMBGQX5QFZN3U3LOE3JFKZI4MNWB4W5P1BXICWZKJ2ZEHVGN' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: U3RQE2KIF2JZQVAGKDG3ME3BNOFGAWZYW2FCBN0CT1X4RFL3
CLIENT_SECRET:XMBGQX5QFZN3U3LOE3JFKZI4MNWB4W5P1BXICWZKJ2ZEHVGN


#### Exploring the first neighborhood in the dataframe

Getting neighborhood's name

In [22]:
north_york_data.loc[0, 'Neighborhood']

'Parkwoods'

Getting the neighborhood's latitude and longitude values

In [23]:
neighborhood_latitude = north_york_data.loc[0, 'Latitude']
neighborhood_longitude = north_york_data.loc[0, 'Longitude']
neighborhood_name = north_york_data.loc[0, 'Neighborhood']
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


#### Now getting the top 100 venues that are in Parkwoods within a radius of 500 meters.

In [24]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=U3RQE2KIF2JZQVAGKDG3ME3BNOFGAWZYW2FCBN0CT1X4RFL3&client_secret=XMBGQX5QFZN3U3LOE3JFKZI4MNWB4W5P1BXICWZKJ2ZEHVGN&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

In [25]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60e01f66d298e558b620afff'},
 'response': {'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e6696b6d16433b9ffff47c3',
       'name': 'KFC',
       'location': {'lat': 43.75438666345904,
        'lng': -79.3330206627504,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.75438666345904,
          'lng': -79.3330206627504}],
        'distance': 298,
        'cc': 'CA',
        'country': 'Canada',
        'format

Function that extracts the category of the venue

In [26]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [27]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,KFC,Fast Food Restaurant,43.754387,-79.333021
1,Brookbanks Park,Park,43.751976,-79.33214
2,649 Variety,Convenience Store,43.754513,-79.331942
3,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [28]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


#### Repeating the same process to all neighborhoods in North York

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Running the above function on each neighborhood and creating a new dataframe called _north york venues_

In [30]:
north_york_venues = getNearbyVenues(names=north_york_data['Neighborhood'],
                                   latitudes=north_york_data['Latitude'],
                                   longitudes=north_york_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills North
Glencairn
Don Mills South
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview East
York Mills, Silver Hills
Downsview West
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview Central
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale South
Downsview Northwest
York Mills West
Willowdale West


Checking how many venues were returned for each neighborhood

In [31]:
north_york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
Don Mills North,5,5,5,5,5,5
Don Mills South,19,19,19,19,19,19
Downsview Central,3,3,3,3,3,3
Downsview East,3,3,3,3,3,3
Downsview Northwest,5,5,5,5,5,5
Downsview West,5,5,5,5,5,5
"Fairview, Henry Farm, Oriole",69,69,69,69,69,69


#### Finding out how many unique categories can be curated from all the returned venues


In [32]:
print('There are {} uniques categories.'.format(len(north_york_venues['Venue Category'].unique())))

There are 101 uniques categories.


### Analyzing each neighborhood

In [33]:
# one hot encoding
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Supermarket,Supplement Shop,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Grouping rows by neighborhood and taking the mean of the frequency of occurence of each category

In [34]:
north_york_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()
north_york_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Supermarket,Supplement Shop,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.095238,...,0.047619,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.04,0.0,0.04,0.0,0.04,0.0,0.0,0.04
3,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills South,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,...,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview East,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Fairview, Henry Farm, Oriole",0.014493,0.0,0.014493,0.0,0.0,0.014493,0.0,0.014493,0.028986,...,0.0,0.014493,0.0,0.014493,0.0,0.014493,0.028986,0.014493,0.0,0.028986


Confirming the new size

In [35]:
north_york_grouped.shape

(23, 102)

Printing each neighborhood along with the top 5 most common venues

In [36]:
num_top_venues = 5

for hood in north_york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = north_york_grouped[north_york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
         venue  freq
0         Bank  0.10
1  Coffee Shop  0.10
2     Pharmacy  0.05
3  Bridal Shop  0.05
4  Gas Station  0.05


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0      Sandwich Place  0.08
1  Italian Restaurant  0.08
2         Coffee Shop  0.08
3       Women's Store  0.04
4        Cupcake Shop  0.04


----Don Mills North----
                  venue  freq
0                   Gym   0.2
1  Caribbean Restaurant   0.2
2                  Café   0.2
3          Dessert Shop   0.2
4   Japanese Restaurant   0.2


----Don Mills South----
                venue  freq
0                 Gym  0.11
1          Restaurant  0.11
2         Coffee Shop  0.11
3  Dim Sum Restaurant  0.05
4          Beer Store  0.05


----

Writing a function to sort the venues in descending order

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Creating a _pandas_ dataframe and displaying the top 10 venues for each neighborhood

In [38]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = north_york_grouped['Neighborhood']

for ind in np.arange(north_york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(north_york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Pharmacy,Bridal Shop,Gas Station,Ice Cream Shop,Fried Chicken Joint,Diner,Middle Eastern Restaurant,Deli / Bodega
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Accessories Store,Liquor Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,"Bedford Park, Lawrence Manor East",Sandwich Place,Italian Restaurant,Coffee Shop,Women's Store,Cupcake Shop,Butcher,Café,Pizza Place,Pharmacy,Liquor Store
3,Don Mills North,Gym,Caribbean Restaurant,Café,Dessert Shop,Japanese Restaurant,Accessories Store,Lounge,Movie Theater,Mobile Phone Shop,Miscellaneous Shop
4,Don Mills South,Gym,Restaurant,Coffee Shop,Dim Sum Restaurant,Beer Store,Sandwich Place,Clothing Store,Sporting Goods Shop,Supermarket,Bike Shop


### Clustering Neighborhoods

Running k-means to cluster the neighborhood into 5 clusters

In [39]:
# set number of clusters
kclusters = 5

north_york_grouped_clustering = north_york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(north_york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [40]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = north_york_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
north_york_merged = north_york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

north_york_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,3.0,Park,Convenience Store,Food & Drink Shop,Fast Food Restaurant,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Hockey Arena,Financial or Legal Service,Portuguese Restaurant,Coffee Shop,Accessories Store,Liquor Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Miscellaneous Shop,Furniture / Home Store,Event Space,Coffee Shop,Gift Shop,Bar
3,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Gym,Caribbean Restaurant,Café,Dessert Shop,Japanese Restaurant,Accessories Store,Lounge,Movie Theater,Mobile Phone Shop,Miscellaneous Shop
4,M6B,North York,Glencairn,43.709577,-79.445073,0.0,Bakery,Sushi Restaurant,Park,Japanese Restaurant,Accessories Store,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant


In [41]:
north_york_merged_nonan = north_york_merged.dropna(subset=['Cluster Labels'])

Visualizing resulting clusters

In [42]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(north_york_merged['Latitude'], north_york_merged['Longitude'], north_york_merged['Neighborhood'], north_york_merged_nonan['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Verifying each of clusters

In [43]:
north_york_merged_nonan.loc[north_york_merged_nonan['Cluster Labels'] == 0, north_york_merged_nonan.columns[[1] + list(range(5, north_york_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Hockey Arena,Financial or Legal Service,Portuguese Restaurant,Coffee Shop,Accessories Store,Liquor Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant
2,North York,0.0,Clothing Store,Accessories Store,Boutique,Vietnamese Restaurant,Miscellaneous Shop,Furniture / Home Store,Event Space,Coffee Shop,Gift Shop,Bar
3,North York,0.0,Gym,Caribbean Restaurant,Café,Dessert Shop,Japanese Restaurant,Accessories Store,Lounge,Movie Theater,Mobile Phone Shop,Miscellaneous Shop
4,North York,0.0,Bakery,Sushi Restaurant,Park,Japanese Restaurant,Accessories Store,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant
5,North York,0.0,Gym,Restaurant,Coffee Shop,Dim Sum Restaurant,Beer Store,Sandwich Place,Clothing Store,Sporting Goods Shop,Supermarket,Bike Shop
6,North York,0.0,Golf Course,Fast Food Restaurant,Mediterranean Restaurant,Pool,Dog Run,Hockey Arena,Juice Bar,Miscellaneous Shop,Middle Eastern Restaurant,Grocery Store
7,North York,0.0,Bank,Coffee Shop,Pharmacy,Bridal Shop,Gas Station,Ice Cream Shop,Fried Chicken Joint,Diner,Middle Eastern Restaurant,Deli / Bodega
8,North York,0.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Women's Store,Jewelry Store,Japanese Restaurant,Bank,Electronics Store,Toy / Game Store
9,North York,0.0,Coffee Shop,Furniture / Home Store,Caribbean Restaurant,Miscellaneous Shop,Bar,Massage Studio,Liquor Store,Park,Movie Theater,Mobile Phone Shop
10,North York,0.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Accessories Store,Liquor Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant


In [44]:
north_york_merged_nonan.loc[north_york_merged_nonan['Cluster Labels'] == 1, north_york_merged_nonan.columns[[1] + list(range(5, north_york_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,North York,1.0,Baseball Field,Accessories Store,Jewelry Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio,Luggage Store


In [45]:
north_york_merged_nonan.loc[north_york_merged_nonan['Cluster Labels'] == 2, north_york_merged_nonan.columns[[1] + list(range(5, north_york_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,North York,2.0,Park,Jewelry Store,Movie Theater,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio,Luggage Store,Lounge


In [46]:
north_york_merged_nonan.loc[north_york_merged_nonan['Cluster Labels'] == 3, north_york_merged_nonan.columns[[1] + list(range(5, north_york_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,3.0,Park,Convenience Store,Food & Drink Shop,Fast Food Restaurant,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio
22,North York,3.0,Convenience Store,Park,Accessories Store,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio,Luggage Store


In [47]:
north_york_merged_nonan.loc[north_york_merged_nonan['Cluster Labels'] == 4, north_york_merged_nonan.columns[[1] + list(range(5, north_york_merged_nonan.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,North York,4.0,Furniture / Home Store,Pizza Place,Accessories Store,Jewelry Store,Mobile Phone Shop,Miscellaneous Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Massage Studio,Luggage Store


The North York Neighborhood in Toronto is successfully clustered