Coursera IBM Capstone
==
## Segmenting and Clustering Neighborhoods in Toronto
### By: Ted Hartnell
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

## Step 1: Load Dependencies

In [1]:
import pandas as pd
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import numpy as np

In [2]:
pip install wikipedia

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import the Python Wikipedia Crawl library
import wikipedia
wikipedia.summary("List of postal codes of Canada: M", sentences=2)

'This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area.'

In [4]:
# Import BeautifulSoup to parse the crawled HTML
try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

In [5]:
pip install geocoder

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Import the Geocoder Library to lookup Latitude and Longitude of Toronto Neighborhoods
import geocoder

In [7]:
# Import library to handle REST requests
import requests

In [8]:
pip install folium

Note: you may need to restart the kernel to use updated packages.


In [9]:
# Import Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

# Import Map Rendering Library
import folium

In [10]:
# Have the user manually enter the Foursquare Credentials so they are not stored in the raw Jupyter notebook
# Clear Cell Outputs after running!!!
FOURSQUARE_CLIENT_ID = input("[1/3] Enter your FourSquare Client ID:")
FOURSQUARE_CLIENT_SECRET = input("[2/3] Enter your FourSquare Client Secret:")
FOURSQUARE_VERSION = input("[3/3] Enter your FourSquare Version Number:")

## Step 2: Crawl Wikipedia Toronto Neighborhoods

In [11]:
# Get the link to the 'List of Neighborhoods in Toronto' Wikipedia Page and test it
wikipediaNeighborhoods = wikipedia.page("List_of_postal_codes_of_Canada:_M")
print( wikipediaNeighborhoods.title )
print( wikipediaNeighborhoods.url )

List of postal codes of Canada: M
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [12]:
# Get the HTML content from the 'List of Neighborhoods in Toronto' Wikipedia Page as a String
htmlNeighborhoods = wikipediaNeighborhoods.html()

# Check the HTML content was collected
print( htmlNeighborhoods[0:100] )

<div class="mw-parser-output"><p>This is a list of <a href="/wiki/Postal_codes_in_Canada" title="Pos


In [13]:
# Parse the HTML so it can be searched by BeautifulSoup
parsedNeighborhoods = BeautifulSoup(htmlNeighborhoods)

In [14]:
# Find the list of Neighborhoods at the bottom of the 'List of Neighborhoods in Toronto' Wikipedia Page
find_allNeighborhoods = parsedNeighborhoods.find('table', class_='wikitable').find_all('tr', class_='')
print('Found {} potential Neighborhoods in the HTML'.format( len(find_allNeighborhoods) ))

# Check that the first 5 neighborhoods contains useful data
find_allNeighborhoods[:5]

Found 289 potential Neighborhoods in the HTML


[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>]

In [15]:
# Define the dataframe columns for the Neighborhoods
columnsNeighborhoods = ['PostalCode', 'Borough', 'Neighborhood', 'URL', 'Title', 'Latitude', 'Longitude'] 

# Instantiate the dataframe
neighborhoods = pd.DataFrame(columns=columnsNeighborhoods)

In [16]:
# Try reading in the neighborhoods from earlier - necessary as it takes a very long time to parse neighborhood data
neighborhoods = pd.read_csv("Toronto Neighborhoods 013 Tab-Delineated.csv", sep='\t')
neighborhoods.drop(neighborhoods.columns[0], axis=1, inplace=True) # Drop the ID column

In [17]:
# Convert crawled neighborhood HTML data into structured DataFrames
# Note that this routine can be run several times to retry latitude/longitude lookups as they very often fail

# Limit the number of neighborhoods that can be added each time this cell is run
MAX_NEIGHBORHOODS = 300
countNeighborhoods = 0

for row in find_allNeighborhoods:
    
    countNeighborhoods += 1
    if countNeighborhoods > MAX_NEIGHBORHOODS:
        break
    
    # Get the list of rows found within each crawled table fragment
    tag = row.findAll('td')

    # Extract the Postal Code
    try:
        postalcodeNeighborhood = tag[0].get_text().strip()
    except:
        print("Row {} doesn't have any Neighborhood information - Skip".format(countNeighborhoods))
        continue
    
    # Extract the Borough - if the Borough is 'Not assigned' then skip the row
    try:
        boroughNeighborhood = tag[1].get_text().strip()
    except:
        print("Row {} doesn't have a Borough - Skip".format(countNeighborhoods))
        continue

    if boroughNeighborhood == 'Not assigned':
        print("Row {} Borough is marked as 'Not Assigned' - Skip".format(countNeighborhoods))
        continue
        
    # Extract the Name of the Neighborhood and clean it up
    try:
        nameNeighborhood = tag[2].get_text().strip()
    except:
        print("Row {} doesn't have a Neighborhood - Skip".format(countNeighborhoods))
        continue

    nameNeighborhood = nameNeighborhood.replace(", Toronto","").replace(" (Toronto)","").replace(" (neighbourhood)","").replace(" (page does not exist)","")
    
    # If the Neighborhood Name is 'Not assigned' then use the Borough Name instead
    if nameNeighborhood == 'Not assigned':
        nameNeighborhood = boroughNeighborhood

    # Extract the URL for the Neighborhood (or Borough)
    try:
        urlNeighborhood = tag[2].a['href'].strip() # Try getting the URL for the Neighborhood
    except:
        try:
            urlNeighborhood = tag[1].a['href'].strip() # Else try getting the URL for the Borough
        except:
            urlNeighborhood = ''

    # Extract the Title for the Neighborhood - the Title often contains more information so is used for looking up the Latitude and Longitude
    try:
        titleNeighborhood = tag[2].a['title'].strip() # Try getting the Title for the Neighborhood
    except:
        try:
            titleNeighborhood = tag[1].a['title'].strip() # Else try getting the Title for the Borough
        except:
            titleNeighborhood = ''

    titleNeighborhood = titleNeighborhood.replace(", Toronto","").replace(" (Toronto)","").replace(" (neighbourhood)","").replace(" (page does not exist)","")
    titleNeighborhood = titleNeighborhood + ', Toronto, Ontario' # Use the Neighborhood Title to find the latitude and longitude
            
    # Check that the Neighborhood Name being added to the DataFrame is unique - if the Name is not unique then it has already been added to the DataFrame (recall that this routine can be run many times in order to ensure all location data is loaded)
    if nameNeighborhood in neighborhoods["Neighborhood"].tolist():
        print('The Neighborhood of {} has already been loaded into the DataFrame'.format(nameNeighborhood))
        continue
    
    latitudeNeighborhood = ''
    longitudeNeighborhood = ''

    # (optional) Lookup the Latitude and Longitude of the Neighborhood
    if True:
        try:
            locationNeighborhood = geocoder.arcgis('{}, Toronto, Ontario'.format(nameNeighborhood))
            latitudeNeighborhood = locationNeighborhood.latlng[0]
            longitudeNeighborhood = locationNeighborhood.latlng[1]
            print('{}: The geograpical coordinate of {} is {}, {}.'.format(postalcodeNeighborhood, titleNeighborhood, latitudeNeighborhood, longitudeNeighborhood))
        except:
            print('{}: The geograpical coordinates of {} failed'.format(postalcodeNeighborhood, titleNeighborhood))
            latitudeNeighborhood = ''
            longitudeNeighborhood = ''
            continue

    print('{}: Adding Neighborhood {}'.format(postalcodeNeighborhood, titleNeighborhood))
    neighborhoods = neighborhoods.append({
                                            'PostalCode': postalcodeNeighborhood,
                                            'Borough': boroughNeighborhood,
                                            'Neighborhood': nameNeighborhood,
                                            'URL': urlNeighborhood,
                                            'Title': titleNeighborhood,
                                            'Latitude': latitudeNeighborhood,
                                            'Longitude': longitudeNeighborhood
                                            }, ignore_index=True)

# Note that not all of the locations of the Neighborhood names from the Wikipedia page can be found by the geolocator
print('Finished adding {} Neighborhoods from {} crawled rows.'.format(len(neighborhoods), len(find_allNeighborhoods), ))

Row 1 doesn't have any Neighborhood information - Skip
Row 2 Borough is marked as 'Not Assigned' - Skip
Row 3 Borough is marked as 'Not Assigned' - Skip
The Neighborhood of Parkwoods has already been loaded into the DataFrame
The Neighborhood of Victoria Village has already been loaded into the DataFrame
M5A: The geograpical coordinate of Harbourfront, Toronto, Ontario is 43.63950995556605, -79.38315993782102.
M5A: Adding Neighborhood Harbourfront, Toronto, Ontario
The Neighborhood of Regent Park has already been loaded into the DataFrame
The Neighborhood of Lawrence Heights has already been loaded into the DataFrame
The Neighborhood of Lawrence Manor has already been loaded into the DataFrame
The Neighborhood of Queen's Park has already been loaded into the DataFrame
Row 11 Borough is marked as 'Not Assigned' - Skip
The Neighborhood of Islington Avenue has already been loaded into the DataFrame
The Neighborhood of Rouge has already been loaded into the DataFrame
The Neighborhood of Ma

In [18]:
# Check that the neighborhoods have been loaded correctly
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,URL,Title,Latitude,Longitude
0,M3A,North York,Parkwoods,/wiki/Parkwoods,"Parkwoods, Toronto, Ontario",43.686575,-79.409993
1,M4A,North York,Victoria Village,/wiki/Victoria_Village,"Victoria Village, Toronto, Ontario",43.73154,-79.31428
2,M5A,Downtown Toronto,Regent Park,/wiki/Regent_Park,"Regent Park, Toronto, Ontario",43.66069,-79.36031
3,M6A,North York,Lawrence Heights,/wiki/Lawrence_Heights,"Lawrence Heights, Toronto, Ontario",43.72357,-79.43711
4,M6A,North York,Lawrence Manor,/wiki/Lawrence_Manor,"Lawrence Manor, Toronto, Ontario",43.72292,-79.43131


In [19]:
# Save the neighborhoods as a tab-delineated CSV file
neighborhoods.to_csv("Toronto Neighborhoods 013 Tab-Delineated.csv", sep='\t')

In [20]:
# Get the Location of the geographic center of Toronto to center the map
nameLocation = "Leaside, Toronto, Ontario" # The geographic center is not "Toronto, Canada"
locationToronto = geocoder.arcgis(nameLocation)
print('The geograpical coordinate of the city center at {} is {}, {}.'.format(nameLocation, locationToronto.latlng[0], locationToronto.latlng[1]))

The geograpical coordinate of the city center at Leaside, Toronto, Ontario is 43.70023937855272, -79.3510651247247.


In [21]:
# Create map of Toronto using {latitude, longitude} values and Neighborhood label
mapToronto = folium.Map(location=[locationToronto.latlng[0], locationToronto.latlng[1]], zoom_start=11)

# Add markers to Map
for latitude, longitude, postalcode, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['PostalCode'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '[' + postalcode +']: ' + neighborhood + ', ' + borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mapToronto)  
    
mapToronto

### Toronto Static Map of Neighborhoods
![Toronto Static Map of Neighborhoods](Toronto_Neighborhood_Map_001.png "Toronto Static Map of Neighborhoods")

## Step 3: Explore Toronto Neighborhoods

In [22]:
# Create a function to collect nearby venues from all the neighborhoods in Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):

    # Count the number of neighborhoods being explored
    countNeighborhoods = 0

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # Update the user on which neighborhood is being explored
        countNeighborhoods += 1
        print('{} of {}: Exploring venues nearby {}.'.format(countNeighborhoods, len(names), name))

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            FOURSQUARE_CLIENT_ID, 
            FOURSQUARE_CLIENT_SECRET, 
            FOURSQUARE_VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # Continue if Error with processing the GET request
        try:
            # make the GET request
            results = requests.get(url, timeout=10).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except: 
          pass        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
# Limit the number of neighborhoods that can be explored
MAX_NEIGHBORHOODS = 300

# Select which neighborhoods to explore
#selectNeighborhoods = neighborhoods.loc[neighborhoods['Borough'] == 'Downtown Toronto']
#selectNeighborhoods = neighborhoods.loc[neighborhoods['Borough'].str.contains("Toronto")]
selectNeighborhoods = neighborhoods.copy()

# Call Foursquare to explore nearby venues in each Neighborhood
venuesToronto = getNearbyVenues(names=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Neighborhood'],
                                   latitudes=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Latitude'],
                                   longitudes=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Longitude']
                                  )

1 of 209: Exploring venues nearby Parkwoods.
2 of 209: Exploring venues nearby Victoria Village.
3 of 209: Exploring venues nearby Regent Park.
4 of 209: Exploring venues nearby Lawrence Heights.
5 of 209: Exploring venues nearby Lawrence Manor.
6 of 209: Exploring venues nearby Queen's Park.
7 of 209: Exploring venues nearby Islington Avenue.
8 of 209: Exploring venues nearby Rouge.
9 of 209: Exploring venues nearby Malvern.
10 of 209: Exploring venues nearby Don Mills North.
11 of 209: Exploring venues nearby Woodbine Gardens.
12 of 209: Exploring venues nearby Parkview Hill.
13 of 209: Exploring venues nearby Ryerson.
14 of 209: Exploring venues nearby Garden District.
15 of 209: Exploring venues nearby Glencairn.
16 of 209: Exploring venues nearby Cloverdale.
17 of 209: Exploring venues nearby Islington.
18 of 209: Exploring venues nearby Martin Grove.
19 of 209: Exploring venues nearby Princess Gardens.
20 of 209: Exploring venues nearby West Deane Park.
21 of 209: Exploring venue

165 of 209: Exploring venues nearby Harbourfront West.
166 of 209: Exploring venues nearby King and Spadina.
167 of 209: Exploring venues nearby Railway Lands.
168 of 209: Exploring venues nearby South Niagara.
169 of 209: Exploring venues nearby Humber Bay Shores.
170 of 209: Exploring venues nearby Mimico South.
171 of 209: Exploring venues nearby New Toronto.
172 of 209: Exploring venues nearby Albion Gardens.
173 of 209: Exploring venues nearby Beaumond Heights.
174 of 209: Exploring venues nearby Humbergate.
175 of 209: Exploring venues nearby Jamestown.
176 of 209: Exploring venues nearby Mount Olive.
177 of 209: Exploring venues nearby Silverstone.
178 of 209: Exploring venues nearby South Steeles.
179 of 209: Exploring venues nearby Thistletown.
180 of 209: Exploring venues nearby L'Amoreaux West.
181 of 209: Exploring venues nearby Rosedale.
182 of 209: Exploring venues nearby Stn A PO Boxes 25 The Esplanade.
183 of 209: Exploring venues nearby Alderwood.
184 of 209: Exploring

In [24]:
# Check how many venues were found in each neighborhood
venuesToronto.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelaide,100,100,100,100,100,100
Agincourt,16,16,16,16,16,16
Agincourt North,31,31,31,31,31,31
Albion Gardens,6,6,6,6,6,6
Alderwood,5,5,5,5,5,5
Bathurst Manor,4,4,4,4,4,4
Bathurst Quay,25,25,25,25,25,25
Bayview Village,3,3,3,3,3,3
Beaumond Heights,7,7,7,7,7,7
Bedford Park,37,37,37,37,37,37


In [25]:
# Check how many unique categories were found
print('There are {} uniques categories.'.format(len(venuesToronto['Venue Category'].unique())))

There are 317 uniques categories.


In [26]:
# Prepare to Cluster the data
onehotToronto = pd.get_dummies(venuesToronto[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
onehotToronto['Neighborhood'] = venuesToronto['Neighborhood'] 

# Move neighborhood column to the first column
onehotColumns = [onehotToronto.columns[-1]] + list(onehotToronto.columns[:-1])
onehotToronto = onehotToronto[onehotColumns]

# Check what the data looks like
print( onehotToronto.shape )
onehotToronto.head()

(4878, 317)


Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Group the Explored Venues by Neighborhood to see the percentage of each Venue Category

groupedToronto = onehotToronto.groupby('Neighborhood').mean().reset_index()
groupedToronto

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,...,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Adelaide,0.020000,0.01,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
1,Agincourt,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.062500,0.000000,0.000000,0.0,0.000000,0.000000
2,Agincourt North,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.032258,0.000000,0.032258,0.000000,0.000000,0.0,0.032258,0.000000
3,Albion Gardens,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
4,Alderwood,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
5,Bathurst Manor,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
6,Bathurst Quay,0.000000,0.00,0.000000,0.000000,0.0,0.04,0.0,0.04,0.04,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
7,Bayview Village,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
8,Beaumond Heights,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000
9,Bedford Park,0.000000,0.00,0.000000,0.000000,0.0,0.00,0.0,0.00,0.00,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000


In [29]:
# Limit the number of neighborhoods that can be shown
MAX_NEIGHBORHOODS = 5

# Print each neighborhood along with the top 10 most common venues
COUNT_TOP_VENUES = 10

for hood in groupedToronto['Neighborhood'].head(MAX_NEIGHBORHOODS):
    print("----"+hood+"----")
    temp = groupedToronto[groupedToronto['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(COUNT_TOP_VENUES))
    print('\n')

----Adelaide----
                    venue  freq
0             Coffee Shop  0.06
1      Italian Restaurant  0.06
2              Restaurant  0.05
3                     Bar  0.05
4                     Spa  0.04
5  Furniture / Home Store  0.04
6                  Bakery  0.03
7                    Café  0.03
8                     Gym  0.03
9             Yoga Studio  0.02


----Agincourt----
                    venue  freq
0      Chinese Restaurant  0.31
1        Asian Restaurant  0.12
2             Coffee Shop  0.06
3              Food Court  0.06
4       Korean Restaurant  0.06
5    Cantonese Restaurant  0.06
6              Restaurant  0.06
7  Peking Duck Restaurant  0.06
8                  Bakery  0.06
9    Hong Kong Restaurant  0.06


----Agincourt North----
                venue  freq
0              Bakery  0.06
1  Chinese Restaurant  0.06
2      Ice Cream Shop  0.06
3           Juice Bar  0.03
4       Movie Theater  0.03
5                Bank  0.03
6      Clothing Store  0.03
7        

In [30]:
# Sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [31]:
# Create a new DataFrame with the Top-10 Venues for each Neighborhood
COUNT_TOP_VENUES = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(COUNT_TOP_VENUES):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sortedNeighborhoodVenues = pd.DataFrame(columns=columns)
sortedNeighborhoodVenues['Neighborhood'] = groupedToronto['Neighborhood']

for ind in np.arange(groupedToronto.shape[0]):
    sortedNeighborhoodVenues.iloc[ind, 1:] = return_most_common_venues(groupedToronto.iloc[ind, :], COUNT_TOP_VENUES)

sortedNeighborhoodVenues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Adelaide,Italian Restaurant,Coffee Shop,Restaurant,Bar,Spa,Furniture / Home Store,Café,Bakery,Gym,Caribbean Restaurant
1,Agincourt,Chinese Restaurant,Asian Restaurant,Food Court,Bakery,Vietnamese Restaurant,Peking Duck Restaurant,Coffee Shop,Hong Kong Restaurant,Cantonese Restaurant,Restaurant
2,Agincourt North,Bakery,Chinese Restaurant,Ice Cream Shop,Coffee Shop,Beer Store,Taco Place,Supermarket,Bank,Japanese Restaurant,Sporting Goods Shop
3,Albion Gardens,Baseball Field,Filipino Restaurant,Gas Station,Liquor Store,Beer Store,Bank,Women's Store,Doner Restaurant,Donut Shop,Drugstore
4,Alderwood,Pizza Place,Pharmacy,Coffee Shop,Dance Studio,Convenience Store,Women's Store,Eastern European Restaurant,Doctor's Office,Dog Run,Doner Restaurant


## Step 4: Cluster Toronto Neighborhoods

In [50]:
# Run k-means to cluster the neighborhood into clusters
COUNT_K_MEANS_CLUSTERS = 4

clusteredTorontoGroups = groupedToronto.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=COUNT_K_MEANS_CLUSTERS, random_state=0).fit(clusteredTorontoGroups)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:300]

array([2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 3, 2, 2,
       2, 2, 2, 2, 1, 3, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3,
       2, 1, 2, 3, 3, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 0, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 1, 2, 2, 1, 3,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 3, 2, 2, 2, 2, 2, 2,
       1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 3, 3,
       2, 2, 2, 2, 1, 1, 2, 2])

In [51]:
# Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

# Add clustering labels
labeledSortedNeighborhoodVenues = sortedNeighborhoodVenues.copy()

labeledSortedNeighborhoodVenues.insert(0, 'Cluster Labels', kmeans.labels_)

# Extend the original Neighborhood Data
mergedToronto = selectNeighborhoods.copy()

# Merge the grouped data with the neighborhood data to add latitude/longitude for each neighborhood
mergedToronto = mergedToronto.join(labeledSortedNeighborhoodVenues.set_index('Neighborhood'), on='Neighborhood')

mergedToronto.head() # Check the merged columns

Unnamed: 0,PostalCode,Borough,Neighborhood,URL,Title,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,/wiki/Parkwoods,"Parkwoods, Toronto, Ontario",43.686575,-79.409993,1.0,Park,Café,Bagel Shop,Vegetarian / Vegan Restaurant,Juice Bar,Sushi Restaurant,Coffee Shop,Burger Joint,Italian Restaurant,Japanese Restaurant
1,M4A,North York,Victoria Village,/wiki/Victoria_Village,"Victoria Village, Toronto, Ontario",43.73154,-79.31428,3.0,Food Stand,Park,Women's Store,Diner,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
2,M5A,Downtown Toronto,Regent Park,/wiki/Regent_Park,"Regent Park, Toronto, Ontario",43.66069,-79.36031,2.0,Coffee Shop,Thai Restaurant,Pet Store,Animal Shelter,Electronics Store,Fast Food Restaurant,Beer Store,Sushi Restaurant,Food Truck,Restaurant
3,M6A,North York,Lawrence Heights,/wiki/Lawrence_Heights,"Lawrence Heights, Toronto, Ontario",43.72357,-79.43711,3.0,Park,Accessories Store,Women's Store,Eastern European Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
4,M6A,North York,Lawrence Manor,/wiki/Lawrence_Manor,"Lawrence Manor, Toronto, Ontario",43.72292,-79.43131,2.0,Women's Store,Department Store,Liquor Store,Mexican Restaurant,Shoe Store,Electronics Store,Kids Store,Bus Stop,Supermarket,Sandwich Place


### Map of Clusters

In [52]:
# Visualize the Clusters

# Create a map
map_clusters = folium.Map(location=[locationToronto.latlng[0], locationToronto.latlng[1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(COUNT_K_MEANS_CLUSTERS)
ys = [i + x + (i*x)**2 for i in range(COUNT_K_MEANS_CLUSTERS)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mergedToronto['Latitude'], mergedToronto['Longitude'], mergedToronto['Neighborhood'], mergedToronto['Cluster Labels']):
    try:
        idCluster = int(cluster)
    except:
        idCluster = 0
    label = folium.Popup(str(poi) + ' Cluster ' + str(idCluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[idCluster-1],
        fill=True,
        fill_color=rainbow[idCluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Toronto Static Map of Neighborhood Clusters
![Toronto Static Map of Neighborhood Clusters](Toronto_Neighborhood_Map_002.png "Toronto Static Map of Neighborhood Clusters")

## Step 5: Examine Toronto Clusters

### Cluster 0

In [55]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 0, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
44,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
69,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
82,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
87,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
96,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
109,North York,43.72019,-79.499915,0.0,Pizza Place,Pharmacy,Vietnamese Restaurant,Bakery,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop
175,Etobicoke,43.747074,-79.594665,0.0,Pizza Place,Women's Store,Dumpling Restaurant,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore


### Cluster 1

In [56]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 1, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,43.686575,-79.409993,1.0,Park,Café,Bagel Shop,Vegetarian / Vegan Restaurant,Juice Bar,Sushi Restaurant,Coffee Shop,Burger Joint,Italian Restaurant,Japanese Restaurant
7,Scarborough,43.80766,-79.17405,1.0,Pizza Place,Bus Station,Candy Store,Park,Event Space,Donut Shop,Dive Bar,Doctor's Office,Falafel Restaurant,Exhibit
19,Etobicoke,43.65297,-79.55742,1.0,Park,Convenience Store,Movie Theater,Hotel,Women's Store,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
20,Scarborough,43.78948,-79.17614,1.0,Park,Pharmacy,Bus Station,Baseball Field,Eastern European Restaurant,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
29,Etobicoke,43.661689,-79.578263,1.0,Pizza Place,Coffee Shop,Park,Convenience Store,Chinese Restaurant,Event Service,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run
34,Scarborough,43.76343,-79.1782,1.0,Convenience Store,Tea Room,Construction & Landscaping,Park,Restaurant,Women's Store,Drugstore,Dive Bar,Doctor's Office,Dog Run
37,York,43.68857,-79.45483,1.0,Women's Store,Market,Pharmacy,Park,Fast Food Restaurant,Gym,Dessert Shop,Diner,Discount Store,Dive Bar
41,Downtown Toronto,43.673016,-79.42208,1.0,Park,Coffee Shop,Candy Store,Café,Italian Restaurant,Beer Store,Nightclub,Japanese Restaurant,Diner,Grocery Store
43,North York,43.76378,-79.45477,1.0,Playground,Park,Convenience Store,Baseball Field,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
55,North York,43.693643,-79.401812,1.0,Hotel,Gym / Fitness Center,Skating Rink,Park,Women's Store,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant


### Cluster 2

In [57]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 2, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,43.66069,-79.36031,2.0,Coffee Shop,Thai Restaurant,Pet Store,Animal Shelter,Electronics Store,Fast Food Restaurant,Beer Store,Sushi Restaurant,Food Truck,Restaurant
4,North York,43.72292,-79.43131,2.0,Women's Store,Department Store,Liquor Store,Mexican Restaurant,Shoe Store,Electronics Store,Kids Store,Bus Stop,Supermarket,Sandwich Place
5,Queen's Park,43.66253,-79.39017,2.0,Coffee Shop,Gym,Sushi Restaurant,Persian Restaurant,Portuguese Restaurant,Sandwich Place,Bar,Italian Restaurant,Café,Theater
6,Etobicoke,43.722687,-79.558689,2.0,Coffee Shop,Convenience Store,Clothing Store,Women's Store,Electronics Store,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
8,Scarborough,43.80977,-79.22084,2.0,Pharmacy,Pizza Place,Sandwich Place,Gym / Fitness Center,Grocery Store,Fast Food Restaurant,Convenience Store,Skating Rink,Bubble Tea Shop,Donut Shop
9,North York,43.699339,-79.337722,2.0,Seafood Restaurant,Public Art,Women's Store,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
10,East York,43.70557,-79.30059,2.0,Wings Joint,Bus Station,Coffee Shop,Restaurant,Grocery Store,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant
12,Downtown Toronto,43.648829,-79.402486,2.0,Coffee Shop,Bar,Furniture / Home Store,Pizza Place,Café,Dessert Shop,Park,Ramen Restaurant,Asian Restaurant,Italian Restaurant
13,Downtown Toronto,43.65794,-79.37562,2.0,Coffee Shop,Hotel,Café,Middle Eastern Restaurant,Pizza Place,Sandwich Place,Fast Food Restaurant,Diner,Clothing Store,Japanese Restaurant
14,North York,43.713271,-79.42463,2.0,Café,Pharmacy,Pet Store,Dessert Shop,Bank,Dumpling Restaurant,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop


### Cluster 3

In [58]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 3, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,43.73154,-79.31428,3.0,Food Stand,Park,Women's Store,Diner,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
3,North York,43.72357,-79.43711,3.0,Park,Accessories Store,Women's Store,Eastern European Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
22,Scarborough,43.77897,-79.13109,3.0,Park,Train Station,Women's Store,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
30,Etobicoke,43.63391,-79.56948,3.0,Park,Women's Store,Dim Sum Restaurant,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
32,Scarborough,43.74953,-79.18992,3.0,Park,Hotel,Women's Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
39,East York,43.700239,-79.351065,3.0,Park,Indian Restaurant,Bridge,Women's Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
42,North York,43.80303,-79.35346,3.0,Park,Residential Building (Apartment / Condo),Women's Store,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
48,Downtown Toronto,43.706307,-79.514355,3.0,Convenience Store,Park,Women's Store,Dumpling Restaurant,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
65,Scarborough,43.73629,-79.2732,3.0,Park,Women's Store,Dim Sum Restaurant,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore
94,North York,43.7873,-79.40983,3.0,Park,Women's Store,Dim Sum Restaurant,Discount Store,Dive Bar,Doctor's Office,Dog Run,Doner Restaurant,Donut Shop,Drugstore


## The End.