<h1>Segmenting and Clustering Neighborhoods in Toronto, Canada - Urvashi M.</h1>

<h2>Introduction</h2>

<p>In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.</p>

<h3><l1>1. Capture and prepare data for Toronto Neighborhoods from Wikipedia (webscraping) for analysis</l1></h3>

In [1]:
#install Beautiful Soup library for webscrapping
!pip install bs4



In [2]:
#install Folium for maps
!conda install -c conda-forge folium=0.5.0 --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [3]:
#from bs4 import BeautifulSoup
from bs4 import BeautifulSoup #for webscrapping
import requests #to get() data from the uri
from IPython.display import display_html

import pandas as pd
from pandas import json_normalize # tranform JSON file into a pandas dataframe

import numpy as np
import json # library to handle JSON files

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# map rendering library
import folium

print('Libraries imported.')

Libraries imported.


<h3> 1.1 Download Toronto neighborhoods data from  wikipedia page using Beautiful Soup Python Library</h3>

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #List of Postal Codes of Toronto Canada - 3 letter postal code begins with 'M'
html = requests.get(url).text
soup = BeautifulSoup(html,'html5lib')
#Examine table contents
print (soup.title)
postalCodeTable = str(soup.table)
display_html(postalCodeTable,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


0,1,2,3,4,5,6,7,8
M1A Not assigned,M2A Not assigned,M3A North York (Parkwoods),M4A North York (Victoria Village),M5A Downtown Toronto (Regent Park / Harbourfront),M6A North York (Lawrence Manor / Lawrence Heights),M7A Queen's Park (Ontario Provincial Government),M8A Not assigned,M9A Etobicoke (Islington Avenue)
M1B Scarborough (Malvern / Rouge),M2B Not assigned,M3B North York (Don Mills) North,M4B East York (Parkview Hill / Woodbine Gardens),"M5B Downtown Toronto (Garden District, Ryerson)",M6B North York (Glencairn),M7B Not assigned,M8B Not assigned,M9B Etobicoke (West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale)
M1C Scarborough (Rouge Hill / Port Union / Highland Creek),M2C Not assigned,M3C North York (Don Mills) South (Flemingdon Park),M4C East York (Woodbine Heights),M5C Downtown Toronto (St. James Town),M6C York (Humewood-Cedarvale),M7C Not assigned,M8C Not assigned,M9C Etobicoke (Eringate / Bloordale Gardens / Old Burnhamthorpe / Markland Wood)
M1E Scarborough (Guildwood / Morningside / West Hill),M2E Not assigned,M3E Not assigned,M4E East Toronto (The Beaches),M5E Downtown Toronto (Berczy Park),M6E York (Caledonia-Fairbanks),M7E Not assigned,M8E Not assigned,M9E Not assigned
M1G Scarborough (Woburn),M2G Not assigned,M3G Not assigned,M4G East York (Leaside),M5G Downtown Toronto (Central Bay Street),M6G Downtown Toronto (Christie),M7G Not assigned,M8G Not assigned,M9G Not assigned
M1H Scarborough (Cedarbrae),M2H North York (Hillcrest Village),M3H North York (Bathurst Manor / Wilson Heights / Downsview North),M4H East York (Thorncliffe Park),M5H Downtown Toronto (Richmond / Adelaide / King),M6H West Toronto (Dufferin / Dovercourt Village),M7H Not assigned,M8H Not assigned,M9H Not assigned
M1J Scarborough (Scarborough Village),M2J North York (Fairview / Henry Farm / Oriole),M3J North York (Northwood Park / York University),M4J East York East Toronto (The Danforth East),M5J Downtown Toronto (Harbourfront East / Union Station / Toronto Islands),M6J West Toronto (Little Portugal / Trinity),M7J Not assigned,M8J Not assigned,M9J Not assigned
M1K Scarborough (Kennedy Park / Ionview / East Birchmount Park),M2K North York (Bayview Village),M3K North York (Downsview) East (CFB Toronto),M4K East Toronto (The Danforth West / Riverdale),M5K Downtown Toronto (Toronto Dominion Centre / Design Exchange),M6K West Toronto (Brockton / Parkdale Village / Exhibition Place),M7K Not assigned,M8K Not assigned,M9K Not assigned
M1L Scarborough (Golden Mile / Clairlea / Oakridge),M2L North York (York Mills / Silver Hills),M3L North York (Downsview) West,M4L East Toronto (India Bazaar / The Beaches West),M5L Downtown Toronto (Commerce Court / Victoria Hotel),M6L North York (North Park / Maple Leaf Park / Upwood Park),M7L Not assigned,M8L Not assigned,M9L North York (Humber Summit)
M1M Scarborough (Cliffside / Cliffcrest / Scarborough Village West),M2M North York (Willowdale / Newtonbrook),M3M North York (Downsview) Central,M4M East Toronto (Studio District),M5M North York (Bedford Park / Lawrence Manor East),M6M York (Del Ray / Mount Dennis / Keelsdale and Silverthorn),M7M Not assigned,M8M Not assigned,M9M North York (Humberlea / Emery)


<h3>1.2 Data preprocessing and build the pandas dataframe</h3>

In [5]:
#1 To create the a dataframe from above html table
table_contents=[]
#2 Columns in dataframe: PostalCode, Borough, Neighborhood - Ignore those cells in original source table where borough = 'Not Assigned'
#3 Combine the Neighborhoods in the same borough as one row (comma separated values)
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') 
        table_contents.append(cell) #3 Combine the Neighborhoods in the same borough as one row (comma separated values)

#print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned',df['Borough'], df['Neighborhood'])

df# examine the dataframe

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


<h3>1.3 Shape of the final Dataframe</h3>

In [6]:
# Print the shape of this dataframe 
df.shape

(103, 3)

<h2>2 Geocoding of Toronto Postal codes/Neighborhoods</h2>

<h3>2.1 Import geocodes in CSV file for Toronto Neighborhoods</h3>
<p>Using the geocodes in CSV URL: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
using Pandas CSV read funtion.</p>

In [7]:
#download csv file
csv_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
geocodes = pd.read_csv(csv_url)
geocodes.head() #examine first 5 rows

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>2.2 Geocode Toronto Borough / Neighborhood data by merging dataframes using postal codes</h3>

In [8]:
#Rename the join key 'Postal Code' om geocode dataframe to 'PostalCode' as in toronto borough dataframe 'df'
geocodes.rename(columns={'Postal Code':'PostalCode'},inplace=True)
geocodes.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
#Merge on 'Postalcodes'
df1 = []
df1 = pd.merge(df,geocodes, on='PostalCode')
df1.head() #display the geocoded Toronto Borough/Neighborhood Dataset

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


<h2>3. Segmentation and Clustering of Neighborhoods in Toronto, CA</h2>

<h3>3.1 Retreiving all the Neighborhoods that contain 'Toronto' in their Borough name</h3>
<p>slice the data containing 'Toronto' in the Borough name</p>

In [10]:
df_toronto_in_borough = []
df_toronto_in_borough = df1[df1['Borough'].str.contains('Toronto',regex=False)]
#reset the index for this subset
df_toronto_in_borough.reset_index(inplace=True)
print("Total Neighborhoods containing \'Toronto\' in the Borough name = ", df_toronto_in_borough.shape[0])
df_toronto_in_borough

Total Neighborhoods containing 'Toronto' in the Borough name =  39


Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,19,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
8,31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259
9,35,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106


<h3>3.2 Plot the sliced Neighborhood data using Folium</h3>
<p><i>*** PS - In case map is not visible in the notebook shared on Git, please refer to Readme file or Map_toronto file here</i></p?

<b>Center the map on Downtown Toronto.</b>

In [11]:
#position map on Downtown Toronto , first record in the dataframe
latitude = df_toronto_in_borough.loc[0,'Latitude']
longitude = df_toronto_in_borough.loc[0,'Longitude']
print('The geograpical coordinate of Toronto city in Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto city in Canada are 43.6542599, -79.3606359.


In [12]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood, postalCode in zip(df_toronto_in_borough['Latitude'], df_toronto_in_borough['Longitude'], df_toronto_in_borough['Borough'], df_toronto_in_borough['Neighborhood'],df_toronto_in_borough['PostalCode']):
    label = '{}, {},{}'.format(neighborhood, borough, postalCode)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

<h4>3.3 Analyzing the Neighborhoods using Four Square API</h4>

In [2]:
CLIENT_ID = 'YOUR_CLIENT_ID' # your Foursquare ID
CLIENT_SECRET = 'YOUR_CLIENT_SECRET' # your Foursquare Secret
VERSION = '20210624' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YOUR_CLIENT_ID
CLIENT_SECRET:YOUR_CLIENT_SECRET


<h4>Let's explore the venues round the first neighborhood in the Toronto.</h4>

In [39]:
neighborhood_latitude = df_toronto_in_borough.loc[0,'Latitude']
neighborhood_longitude = df_toronto_in_borough.loc[0,'Longitude']
neighborhood_name = df_toronto_in_borough.loc[0,'Neighborhood']
print('Latitude and Longitude for {} are {},{}'.format(neighborhood_name,neighborhood_latitude,neighborhood_longitude))

Latitude and Longitude for Regent Park, Harbourfront are 43.6542599,-79.3606359


<h4>Now, let's get the top 150 venues that are in this neighborhood within a radius of 500 meters</h4>

In [40]:
radius = 500 #set the radius for retriving geo data

#set the URI for four square api call
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#display URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=MHNXIXBTUF01UMSWS3DVMOSQ54GN0XOMNN0I0OA4NKAR2P5B&client_secret=NZJQ3VZ0ZFEMHUG5ACEFQZRUU20IM52K531GS4WGO5TYEMFL&v=20210624&ll=43.6542599,-79.3606359&radius=500&limit=100'

<h4>Send the GET request and examine the results from Four Sqaure Api call</h4>

In [42]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60d4d4bae1ca847e912edc9d'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 43,
  'suggestedBounds': {'ne': {'lat': 43.6587599045, 'lng': -79.3544279001486},
   'sw': {'lat': 43.6497598955, 'lng': -79.36684389985142}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '53b8466a498e83df908c3f21',
       'name': 'Tandem Coffee',
       'location': {'address': '368 King St E',
        'crossStreet': 'at Trinity St',
        'lat': 43.65355870959944,
        'lng': -79.36180945913513,
        'labeledLatLngs': [{'label': 'display',
 

<h4>Define a function called 'get_category_type' to examine the categories of results</h4>

In [43]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

<h4>Cleanse and process the JSON result into pandas dataframe</h4>

In [44]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Tandem Coffee,Coffee Shop,43.653559,-79.361809
1,Roselle Desserts,Bakery,43.653447,-79.362017
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Impact Kitchen,Restaurant,43.656369,-79.35698


<h4>Total number of venues around the neighborhood accessed from Four Square API</h4>

In [48]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

43 venues were returned by Foursquare.


<h3>3.3 Explore all the Neighnorhoods in Toronto</h3>

<h4>Define a function 'getNearbyVenues' to retrieve venues data from Four Square API.</h4>

In [49]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<h4>Process the JSON results from Four Square into Pandas Dataframe</h4>

In [57]:
#all the ventues in Toronto Borough
toronto_venues = getNearbyVenues(names=df_toronto_in_borough['Neighborhood'],
                                   latitudes=df_toronto_in_borough['Latitude'],
                                   longitudes=df_toronto_in_borough['Longitude']
                                  )

Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canadian Place, U

In [58]:
print(toronto_venues.shape)
print('{} total venues in Toronto'.format(toronto_venues.shape[0]))
toronto_venues.head()

(1494, 7)
1494 total venues in Toronto


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


<h4>Now, let's find the number of venues in each neighborhood of Toronto</h4>

In [59]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,47,47,47,47,47,47
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",15,15,15,15,15,15
Central Bay Street,62,62,62,62,62,62
Christie,15,15,15,15,15,15
Church and Wellesley,68,68,68,68,68,68
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,27,27,27,27,27,27
Davisville North,9,9,9,9,9,9
"Dufferin, Dovercourt Village",14,14,14,14,14,14


<h4>So, total number of categories for each returned venue</h4>

In [60]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 218 uniques categories.


<h3>3.4 Analyzing each Neighborhood in Totonto</h3>

In [64]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And, let's examine the shape of new dataframe

In [65]:
toronto_onehot.shape

(1494, 218)

Now, let's group rows by neighborhoods, and by taking mean of the frequency of each occurence of category to determine mean venues per category

In [68]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.042553,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.066667,0.066667,0.133333,0.2,0.066667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.016129,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.014706,0.014706,0.014706,0.0,0.0,0.0,0.0,0.0,0.014706,...,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014706
6,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
7,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And, examine the new size

In [70]:
toronto_grouped.shape

(39, 218)

<h4>So, what are the top five most common venues for each Neighborhood in Toronto?</h4>

In [73]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0  Sandwich Place  0.06
1    Cocktail Bar  0.06
2          Bakery  0.06
3     Coffee Shop  0.06
4  Farmers Market  0.04


----Brockton, Parkdale Village, Exhibition Place----
            venue  freq
0  Breakfast Spot  0.09
1  Sandwich Place  0.09
2     Coffee Shop  0.09
3            Café  0.09
4          Bakery  0.05


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
             venue  freq
0  Airport Service  0.20
1   Airport Lounge  0.13
2         Boutique  0.07
3            Plane  0.07
4          Airport  0.07


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.16
1      Sandwich Place  0.10
2    Sushi Restaurant  0.06
3  Italian Restaurant  0.05
4                Café  0.05


----Christie----
           venue  freq
0  Grocery Store  0.27
1           Café  0.20
2           Park  0.13
3     Restaurant  0.07
4     Baby Store  0.07


----Church and

Let's put this data into the pandas dataframe

In [74]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [77]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Sandwich Place,Cocktail Bar,Bakery,Coffee Shop,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,Steakhouse,Bistro
1,"Brockton, Parkdale Village, Exhibition Place",Breakfast Spot,Sandwich Place,Coffee Shop,Café,Bakery,Italian Restaurant,Furniture / Home Store,Climbing Gym,Bar,Restaurant
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Boutique,Plane,Airport,Airport Food Court,Airport Terminal,Harbor / Marina,Bar,Rental Car Location
3,Central Bay Street,Coffee Shop,Sandwich Place,Sushi Restaurant,Italian Restaurant,Café,Japanese Restaurant,Pizza Place,Bank,Burger Joint,Salad Place
4,Christie,Grocery Store,Café,Park,Restaurant,Baby Store,Coffee Shop,Italian Restaurant,Bank,Nightclub,Yoga Studio


<h2>4. Clustering Neighborhoods in Toronto using K-Means</h2>

Run k-means to cluster the neighborhood into 5 clusters.

In [79]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [80]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto_in_borough

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Café,Wine Shop,Chocolate Shop,Performing Arts Venue,French Restaurant,Mexican Restaurant
1,9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1,Coffee Shop,Sandwich Place,Clothing Store,Café,Bank,Cosmetics Shop,Hotel,Japanese Restaurant,Pizza Place,Ramen Restaurant
2,15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Italian Restaurant,Cocktail Bar,Café,Restaurant,Clothing Store,Cosmetics Shop,Beer Bar,Gym,Gastropub
3,19,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Health Food Store,Grocery Store,Pub,Yoga Studio,Moroccan Restaurant,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant
4,20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Sandwich Place,Cocktail Bar,Bakery,Coffee Shop,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,Steakhouse,Bistro


Finally, let's visualize the resulting clusters

In [81]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<h2>5. Examine Toronto Neighborhood Clusters</h2>

<h3>5.1 Examine Cluster 1 : Postal Codes - M4N, M5P, M4W and Most parks

In [84]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,M4N,-79.38879,0,Park,Bus Line,Dim Sum Restaurant,Swim School,Yoga Studio,Monument / Landmark,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station
21,M5P,-79.411307,0,Park,Trail,Jewelry Store,Sushi Restaurant,Yoga Studio,Molecular Gastronomy Restaurant,Market,Martial Arts School,Mediterranean Restaurant,Men's Store
33,M4W,-79.377529,0,Park,Playground,Trail,Yoga Studio,Monument / Landmark,Market,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station


<h3>5.2 Cluster 2: Urban Toronto</h3>

In [85]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,-79.360636,1,Coffee Shop,Park,Bakery,Pub,Café,Wine Shop,Chocolate Shop,Performing Arts Venue,French Restaurant,Mexican Restaurant
1,M5B,-79.378937,1,Coffee Shop,Sandwich Place,Clothing Store,Café,Bank,Cosmetics Shop,Hotel,Japanese Restaurant,Pizza Place,Ramen Restaurant
2,M5C,-79.375418,1,Coffee Shop,Italian Restaurant,Cocktail Bar,Café,Restaurant,Clothing Store,Cosmetics Shop,Beer Bar,Gym,Gastropub
3,M4E,-79.293031,1,Health Food Store,Grocery Store,Pub,Yoga Studio,Moroccan Restaurant,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant
4,M5E,-79.373306,1,Sandwich Place,Cocktail Bar,Bakery,Coffee Shop,Farmers Market,Beer Bar,Seafood Restaurant,Vegetarian / Vegan Restaurant,Steakhouse,Bistro
5,M5G,-79.387383,1,Coffee Shop,Sandwich Place,Sushi Restaurant,Italian Restaurant,Café,Japanese Restaurant,Pizza Place,Bank,Burger Joint,Salad Place
6,M6G,-79.422564,1,Grocery Store,Café,Park,Restaurant,Baby Store,Coffee Shop,Italian Restaurant,Bank,Nightclub,Yoga Studio
7,M5H,-79.384568,1,Coffee Shop,Café,Sandwich Place,Gym,Clothing Store,Restaurant,Sushi Restaurant,Steakhouse,Cosmetics Shop,Concert Hall
8,M6H,-79.442259,1,Pet Store,Liquor Store,Bakery,Brewery,Supermarket,Music Venue,Café,Bar,Middle Eastern Restaurant,Pool
10,M5J,-79.381752,1,Coffee Shop,Café,Hotel,Aquarium,Pizza Place,Scenic Lookout,Sports Bar,Sporting Goods Shop,Brewery,Gym


<h3>5.3 Cluster 3 - Postal Code M4J with Film Studios</h3>

In [86]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,M4J,-79.338106,2,Park,Film Studio,Convenience Store,Yoga Studio,Monument / Landmark,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant


<h3>5.4 Cluster 4 - Postal Code M4J with Film City</h3>

In [87]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,M5N,-79.416936,3,Pool,Garden,Yoga Studio,Moroccan Restaurant,Market,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant


<h3>5.4 Cluster 5 - Postal Code M4T for Sports (Tennis Courts)</h3>

In [88]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,M4T,-79.38316,4,Tennis Court,Park,Moroccan Restaurant,Market,Martial Arts School,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant


Thank you for taking time to review this work!