# Peer-graded Assignment: Segmenting and Clustering Neighbourhoods in Toronto #

##### In this assignment, you will be required to explore, segment, and cluster the neighbourhoods in the city of Toronto. However, unlike New York, the neighbourhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

##### For the Toronto neighbourhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighbourhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

##### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighbourhoods in the city of Toronto.

##### Your submission will be a link to your Jupyter Notebook on your Github repository. 

##### Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [1]:
#Section for Importing Libraries
import pandas as pd

In [2]:
#reading in the data
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)

In [3]:
#cheking to see what df looks like
df

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

<i> It seems that there are three "elements" read into the DataFrame df
Lets check each one one of them </i>

In [4]:
df[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [5]:
df[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


In [6]:
df[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


<i> It seems that df[0] has the data that we need, we will extract and work on this dataset </i>

In [7]:
postal_code_M = df[0]
postal_code_M

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


##### To create the require dataframe:

##### The dataframe will consist of three columns: PostalCode, Borough, and Neighbourhood ✓

In [8]:
postal_code_M.rename(columns={'Postal Code': 'PostalCode'}, inplace = True)

##### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. ✓ 

<i> Check to see which rows have borough as "not assigned" </i>

In [9]:
postal_code_M[postal_code_M['Borough'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
7,M8A,Not assigned,Not assigned
10,M2B,Not assigned,Not assigned
15,M7B,Not assigned,Not assigned
...,...,...,...
174,M4Z,Not assigned,Not assigned
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned


<i> Therefore the above 77 rows will have to be removed from the dataset<br>
Lets create a new dataframe to contain the new set </i>

In [10]:
#Remove "Not assigned" from the DataFrame, reset index and drop column 'index'
M_with_borough = postal_code_M[postal_code_M['Borough'] != 'Not assigned'].reset_index(drop=True)
M_with_borough

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


##### More than one neighbourhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighbourhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighbourhoods separated with a comma as shown in row 11 in the above table. ✓ 
<i>Check to see if there are any duplicate in PostalCode </i>

In [11]:
M_with_borough[M_with_borough.duplicated(['PostalCode'])]

Unnamed: 0,PostalCode,Borough,Neighbourhood


<i>Apparently, no two rows have the same Postal Code <br>
Lets check M5A to make sure the requirement is indeed true.</i>

In [12]:
M_with_borough[M_with_borough['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<i> it seems that Neighbourhoods that shares the same Postal Code are already merged into one row separated with a comma, moving on... </i>

##### If a cell has a borough but a Not assigned neighbourhood, then the neighbourhood will be the same as the borough.✓ 

In [13]:
postal_code_M[(postal_code_M['Neighbourhood'] == 'Not assigned') & (postal_code_M['Borough'] != 'Not assigned')]

Unnamed: 0,PostalCode,Borough,Neighbourhood


<i> hence, there are no cells has a borough but a Not assigned neighbourhood </i>

##### Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.✓ 

In [14]:
M_with_borough

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


##### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe. ✓ 

In [15]:
M_with_borough.shape

(103, 3)

### Assignment Part 2
<i> The given package that was suggested in the assignment questionnaire did not yield any results... I will use the google API method to retrieve the data needed </i>

In [16]:
import re
himitsu = open("himitsu.txt", "r")
API = re.search('(?<=Google API:)\S+',re.findall(r'Google API:.*', himitsu.read())[0])[0]

In [19]:
Tor_borough_xy = M_with_borough

from geopy.geocoders import GoogleV3
import time

geolocator = GoogleV3(api_key=API)
Tor_borough_xy['Longitude'] =''
Tor_borough_xy['Latitude'] =''

for x in range(len(Tor_borough_xy['PostalCode'])):
    time.sleep(0.25) #to add delay in case of large DFs
    geocode_result = geolocator.geocode('{}, Toronto, Ontario'.format(Tor_borough_xy['PostalCode'][x]))
    Tor_borough_xy['Latitude'][x] = geocode_result[1][0]    
    Tor_borough_xy['Longitude'][x] = geocode_result[1][1]
    print(str(x+1) + ' Out Of '+ str(len(Tor_borough_xy)) + ' --> ' + geocode_result[0])

Tor_borough_xy

1 Out Of 103 --> North York, ON M3A, Canada
2 Out Of 103 --> North York, ON M4A, Canada
3 Out Of 103 --> Toronto, ON M5A, Canada
4 Out Of 103 --> North York, ON M6A, Canada
5 Out Of 103 --> North York, ON M7A, Canada
6 Out Of 103 --> Etobicoke, ON M9A, Canada
7 Out Of 103 --> Scarborough, ON M1B, Canada
8 Out Of 103 --> North York, ON M3B, Canada
9 Out Of 103 --> Toronto, ON M4B, Canada
10 Out Of 103 --> Toronto, ON M5B, Canada
11 Out Of 103 --> Toronto, ON M6B, Canada
12 Out Of 103 --> Etobicoke, ON M9B, Canada
13 Out Of 103 --> Scarborough, ON M1C, Canada
14 Out Of 103 --> Toronto, ON M3C, Canada
15 Out Of 103 --> Toronto, ON M4C, Canada
16 Out Of 103 --> Toronto, ON M5C, Canada
17 Out Of 103 --> Toronto, ON M6C, Canada
18 Out Of 103 --> Etobicoke, ON M9C, Canada
19 Out Of 103 --> Scarborough, ON M1E, Canada
20 Out Of 103 --> Toronto, ON M4E, Canada
21 Out Of 103 --> Toronto, ON M5E, Canada
22 Out Of 103 --> Toronto, ON M6E, Canada
23 Out Of 103 --> Scarborough, ON M1G, Canada
24 Out

Unnamed: 0,PostalCode,Borough,Neighbourhood,Longitude,Latitude
0,M3A,North York,Parkwoods,-79.3297,43.7533
1,M4A,North York,Victoria Village,-79.3156,43.7259
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.3606,43.6543
3,M6A,North York,"Lawrence Manor, Lawrence Heights",-79.4648,43.7185
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",-79.3895,43.6623
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",-79.5069,43.6537
99,M4Y,Downtown Toronto,Church and Wellesley,-79.3832,43.6659
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",-79.3216,43.6627
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",-79.4985,43.6363


### Assigment Part 3
##### Explore and cluster the neighbourhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

<i> Lets looks up boroughs that contain the word Toronto </i>

In [20]:
Toronto_boroughs = Tor_borough_xy[Tor_borough_xy['Borough'].str.contains('Toronto')]
Toronto_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Longitude,Latitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.3606,43.6543
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",-79.3895,43.6623
9,M5B,Downtown Toronto,"Garden District, Ryerson",-79.3789,43.6572
15,M5C,Downtown Toronto,St. James Town,-79.3754,43.6515
19,M4E,East Toronto,The Beaches,-79.293,43.6764


<i>It seems there are multiple boroughs that has the word Toronto
Lets find all unique values</

#### Toronto Borough

In [21]:
Toronto_data =  Toronto_boroughs.reset_index(drop=True)
Toronto_data

Unnamed: 0,PostalCode,Borough,Neighbourhood,Longitude,Latitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.3606,43.6543
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",-79.3895,43.6623
2,M5B,Downtown Toronto,"Garden District, Ryerson",-79.3789,43.6572
3,M5C,Downtown Toronto,St. James Town,-79.3754,43.6515
4,M4E,East Toronto,The Beaches,-79.293,43.6764
5,M5E,Downtown Toronto,Berczy Park,-79.3733,43.6448
6,M5G,Downtown Toronto,Central Bay Street,-79.3874,43.658
7,M6G,Downtown Toronto,Christie,-79.4226,43.6695
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",-79.3846,43.6506
9,M6H,West Toronto,"Dufferin, Dovercourt Village",-79.4423,43.669


<i>Lets get all the tools ready</i>

In [22]:
import numpy as np # library to handle data in a vectorized manner
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('Libraries imported.')

Libraries imported.


##### Use geopy library to get the latitude and longitude values of Downtown Toronto

In [23]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of Toronto, ON are 43.6534817, -79.3839347.


##### Lets visualizat Toronto boroughs and the neighbourhoods in it.

In [24]:
# create map of Toronto using latitude and longitude values
map_DT_Tor = folium.Map(location=[latitude, longitude], zoom_start=11.5)

# add markers to map
for lat, lng, borough, neighbourhood in zip(Toronto_data['Latitude'], Toronto_data['Longitude'], Toronto_data['Borough'], Toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_DT_Tor)  
    
map_DT_Tor

##### Reads in Foursquare Credentials and Version


In [25]:
himitsu = open("himitsu.txt", "r")
CLIENT_ID = re.search('(?<=Foursquare CLIENT_ID:)\S+',re.findall(r'Foursquare CLIENT_ID:.*', himitsu.read())[0])[0]
himitsu = open("himitsu.txt", "r")
CLIENT_SECRET = re.search('(?<=Foursquare CLIENT_SECRET:)\S+',re.findall(r'Foursquare CLIENT_SECRET:.*', himitsu.read())[0])[0]
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### Let's explore Toronto boroughs in our dataframe.

In [26]:
#define function to grab nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighbourhood and create a new dataframe called _Tor_venues_.

In [27]:
Tor_venues = getNearbyVenues(names=Toronto_data['Neighbourhood'],
                                   latitudes=Toronto_data['Latitude'],
                                   longitudes=Toronto_data['Longitude']
                                  )

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

#### Let's check the size of the resulting dataframe

In [28]:
print(Tor_venues.shape)
Tor_venues.head()

(1624, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


<i>Let's check how many venues were returned for each neighbourhood</i>

In [29]:
Tor_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,55,55,55,55,55,55
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,68,68,68,68,68,68
Christie,16,16,16,16,16,16
Church and Wellesley,75,75,75,75,75,75
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,33,33,33,33,33,33
Davisville North,9,9,9,9,9,9


#### Let's find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(len(Tor_venues['Venue Category'].unique())))

There are 237 uniques categories.


### Analyze Each Neighbourhood

In [31]:
# one hot encoding
Tor_onehot = pd.get_dummies(Tor_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
Tor_onehot['Neighbourhood'] = Tor_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = ['Neighbourhood'] + list(Tor_onehot.columns[:-1])
Tor_onehot = Tor_onehot[fixed_columns]

Tor_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<i>And let's examine the new dataframe size.</i>

In [32]:
Tor_onehot = Tor_onehot.loc[:,~Tor_onehot.columns.duplicated()] #remove duplicated columns
Tor_onehot.shape

(1624, 238)

#### Next, let's group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category

In [33]:
Tor_grouped = Tor_onehot.groupby('Neighbourhood').mean().reset_index()
Tor_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.125,0.0625,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.014706,0.0,0.014706
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,...,0.013333,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,0.0,0.026667
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [34]:
Tor_grouped.shape

(39, 238)

#### Let's print each neighbourhood along with the top 10 most common venues

#### Let's put that into a _pandas_ dataframe

<i>First, let's write a function to sort the venues in descending order.</i>

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<i>Now let's create the new dataframe and display the top 10 venues for each neighbourhood.</i>

In [37]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = Tor_grouped['Neighbourhood']

for ind in np.arange(Tor_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Tor_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Beer Bar,Farmers Market,Restaurant,Cheese Shop,Basketball Stadium,Sporting Goods Shop
1,"Brockton, Parkdale Village, Exhibition Place",Café,Nightclub,Coffee Shop,Breakfast Spot,Grocery Store,Intersection,Bar,Bakery,Italian Restaurant,Climbing Gym
2,"Business reply mail Processing Centre, South C...",Gym / Fitness Center,Auto Workshop,Comic Shop,Park,Pizza Place,Recording Studio,Restaurant,Butcher,Burrito Place,Brewery
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Boutique,Plane,Airport,Airport Food Court,Airport Gate,Airport Terminal,Bar,Harbor / Marina
4,Central Bay Street,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Salad Place,Department Store,Thai Restaurant,Burger Joint,Bubble Tea Shop,Japanese Restaurant
5,Christie,Grocery Store,Café,Park,Coffee Shop,Restaurant,Athletics & Sports,Italian Restaurant,Candy Store,Baby Store,Nightclub
6,Church and Wellesley,Coffee Shop,Japanese Restaurant,Gay Bar,Sushi Restaurant,Restaurant,Pub,Men's Store,Mediterranean Restaurant,Hotel,Yoga Studio
7,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Hotel,Café,Gym,American Restaurant,Seafood Restaurant,Deli / Bodega,Japanese Restaurant,Bakery
8,Davisville,Sandwich Place,Dessert Shop,Pizza Place,Café,Italian Restaurant,Gym,Coffee Shop,Sushi Restaurant,Pharmacy,Indian Restaurant
9,Davisville North,Gym / Fitness Center,Sandwich Place,Park,Department Store,Breakfast Spot,Dance Studio,Hotel,Dog Run,Food & Drink Shop,Distribution Center


## Cluster Neighbourhoods

<i>Run _k_-means to cluster the neighbourhood into 5 clusters.</i>

In [38]:
# set number of clusters
kclusters = 5

Tor_grouped_clustering = Tor_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Tor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

<i>Lets create a new dataframe that includes the cluster as well as the top 10 venues for each neighbourhood.</i>

In [39]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = Toronto_data

# merge Tor_grouped with manhattan_data to add latitude/longitude for each neighbourhood
Toronto_merged = Toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",-79.3606,43.6543,2,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Yoga Studio,Event Space,Restaurant
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",-79.3895,43.6623,2,Coffee Shop,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Restaurant,Café,Portuguese Restaurant,Chinese Restaurant,Park
2,M5B,Downtown Toronto,"Garden District, Ryerson",-79.3789,43.6572,2,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Bookstore,Pizza Place,Middle Eastern Restaurant
3,M5C,Downtown Toronto,St. James Town,-79.3754,43.6515,2,Coffee Shop,Café,Cocktail Bar,Restaurant,Gastropub,American Restaurant,Beer Bar,Gym,Moroccan Restaurant,Department Store
4,M4E,East Toronto,The Beaches,-79.293,43.6764,3,Pub,Health Food Store,Trail,Neighborhood,Yoga Studio,Distribution Center,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant


<i>Finally, let's visualize the resulting clusters</i>

In [44]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

In [45]:
from IPython.display import display
for k in range(kclusters):
    print('Cluster =', k+1)
    display(Toronto_merged.loc[Toronto_merged['Cluster Labels'] == k, Toronto_merged.columns[[2] + list(range(5, Toronto_merged.shape[1]))]])


Cluster = 1


Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,0,Park,Bus Line,Swim School,Dim Sum Restaurant,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
21,"Forest Hill North & West, Forest Hill Road Park",0,Park,Jewelry Store,Trail,Sushi Restaurant,Yoga Studio,Dessert Shop,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
33,Rosedale,0,Park,Playground,Trail,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


Cluster = 2


Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park, Summerhill East",1,Playground,Trail,Yoga Studio,Department Store,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


Cluster = 3


Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Regent Park, Harbourfront",2,Coffee Shop,Pub,Bakery,Park,Breakfast Spot,Café,Theater,Yoga Studio,Event Space,Restaurant
1,"Queen's Park, Ontario Provincial Government",2,Coffee Shop,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Restaurant,Café,Portuguese Restaurant,Chinese Restaurant,Park
2,"Garden District, Ryerson",2,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Japanese Restaurant,Cosmetics Shop,Hotel,Bookstore,Pizza Place,Middle Eastern Restaurant
3,St. James Town,2,Coffee Shop,Café,Cocktail Bar,Restaurant,Gastropub,American Restaurant,Beer Bar,Gym,Moroccan Restaurant,Department Store
5,Berczy Park,2,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Beer Bar,Farmers Market,Restaurant,Cheese Shop,Basketball Stadium,Sporting Goods Shop
6,Central Bay Street,2,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Salad Place,Department Store,Thai Restaurant,Burger Joint,Bubble Tea Shop,Japanese Restaurant
7,Christie,2,Grocery Store,Café,Park,Coffee Shop,Restaurant,Athletics & Sports,Italian Restaurant,Candy Store,Baby Store,Nightclub
8,"Richmond, Adelaide, King",2,Coffee Shop,Café,Hotel,Restaurant,Gym,Bar,Thai Restaurant,Clothing Store,Concert Hall,Salad Place
9,"Dufferin, Dovercourt Village",2,Bakery,Pharmacy,Grocery Store,Park,Brewery,Middle Eastern Restaurant,Bank,Music Venue,Supermarket,Bar
10,"Harbourfront East, Union Station, Toronto Islands",2,Coffee Shop,Aquarium,Hotel,Café,Fried Chicken Joint,Scenic Lookout,Brewery,Restaurant,Sporting Goods Shop,Music Venue


Cluster = 4


Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,The Beaches,3,Pub,Health Food Store,Trail,Neighborhood,Yoga Studio,Distribution Center,Dim Sum Restaurant,Diner,Discount Store,Doner Restaurant


Cluster = 5


Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,4,Music Venue,Garden,Yoga Studio,Dessert Shop,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
