Coursera IBM Capstone
==
## Segmenting and Clustering Neighborhoods in Toronto
### By: Ted Hartnell
In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

## Step 1: Load Dependencies

In [1]:
import pandas as pd
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import numpy as np

In [2]:
pip install wikipedia

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import the Python Wikipedia Crawl library
import wikipedia
wikipedia.summary("List of postal codes of Canada: M", sentences=2)

'This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area.'

In [4]:
# Import BeautifulSoup to parse the crawled HTML
try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

In [5]:
pip install geocoder

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Import the Geocoder Library to lookup Latitude and Longitude of Toronto Neighborhoods
import geocoder

In [7]:
# Import library to handle REST requests
import requests

In [8]:
pip install folium

Note: you may need to restart the kernel to use updated packages.


In [9]:
# Import Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans

# Import Map Rendering Library
import folium

In [10]:
# Have the user manually enter the Foursquare Credentials so they are not stored in the raw Jupyter notebook
# Clear Cell Outputs after running!!!
FOURSQUARE_CLIENT_ID = input("[1/3] Enter your FourSquare Client ID:")
FOURSQUARE_CLIENT_SECRET = input("[2/3] Enter your FourSquare Client Secret:")
FOURSQUARE_VERSION = input("[3/3] Enter your FourSquare Version Number:")

## Step 2: Crawl Wikipedia Toronto Neighborhoods

In [53]:
### This can also work
#
# import pandas as pd 
# import wikipedia as wp
# from bs4 import BeautifulSoup
# 
# html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
# df = pd.read_html(html, header = 0)[0]
#
###

In [11]:
# Get the link to the 'List of Neighborhoods in Toronto' Wikipedia Page and test it
wikipediaNeighborhoods = wikipedia.page("List_of_postal_codes_of_Canada:_M")
print( wikipediaNeighborhoods.title )
print( wikipediaNeighborhoods.url )

List of postal codes of Canada: M
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [12]:
# Get the HTML content from the 'List of Neighborhoods in Toronto' Wikipedia Page as a String
htmlNeighborhoods = wikipediaNeighborhoods.html()

# Check the HTML content was collected
print( htmlNeighborhoods[0:100] )

<div class="mw-parser-output"><p>This is a list of <a href="/wiki/Postal_codes_in_Canada" title="Pos


In [13]:
# Parse the HTML so it can be searched by BeautifulSoup
parsedNeighborhoods = BeautifulSoup(htmlNeighborhoods)

In [14]:
# Find the list of Neighborhoods at the bottom of the 'List of Neighborhoods in Toronto' Wikipedia Page
find_allNeighborhoods = parsedNeighborhoods.find('table', class_='wikitable').find_all('tr', class_='')
print('Found {} potential Neighborhoods in the HTML'.format( len(find_allNeighborhoods) ))

# Check that the first 5 neighborhoods contains useful data
find_allNeighborhoods[:5]

Found 289 potential Neighborhoods in the HTML


[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>, <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>, <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>, <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
 </td></tr>]

In [15]:
# Define the dataframe columns for the Neighborhoods
columnsNeighborhoods = ['PostalCode', 'Borough', 'Neighborhood', 'URL', 'Title', 'Latitude', 'Longitude'] 

# Instantiate the dataframe
neighborhoods = pd.DataFrame(columns=columnsNeighborhoods)

In [16]:
# Try reading in the neighborhoods from earlier - necessary as it takes a very long time to parse neighborhood data
neighborhoods = pd.read_csv("Toronto Neighborhoods 013 Tab-Delineated.csv", sep='\t')
neighborhoods.drop(neighborhoods.columns[0], axis=1, inplace=True) # Drop the ID column

In [17]:
# Convert crawled neighborhood HTML data into structured DataFrames
# Note that this routine can be run several times to retry latitude/longitude lookups as they very often fail

# Limit the number of neighborhoods that can be added each time this cell is run
MAX_NEIGHBORHOODS = 300
countNeighborhoods = 0

for row in find_allNeighborhoods:
    
    countNeighborhoods += 1
    if countNeighborhoods > MAX_NEIGHBORHOODS:
        break
    
    # Get the list of rows found within each crawled table fragment
    tag = row.findAll('td')

    # Extract the Postal Code
    try:
        postalcodeNeighborhood = tag[0].get_text().strip()
    except:
        print("Row {} doesn't have any Neighborhood information - Skip".format(countNeighborhoods))
        continue
    
    # Extract the Borough - if the Borough is 'Not assigned' then skip the row
    try:
        boroughNeighborhood = tag[1].get_text().strip()
    except:
        print("Row {} doesn't have a Borough - Skip".format(countNeighborhoods))
        continue

    if boroughNeighborhood == 'Not assigned':
        print("Row {} Borough is marked as 'Not Assigned' - Skip".format(countNeighborhoods))
        continue
        
    # Extract the Name of the Neighborhood and clean it up
    try:
        nameNeighborhood = tag[2].get_text().strip()
    except:
        print("Row {} doesn't have a Neighborhood - Skip".format(countNeighborhoods))
        continue

    nameNeighborhood = nameNeighborhood.replace(", Toronto","").replace(" (Toronto)","").replace(" (neighbourhood)","").replace(" (page does not exist)","")
    
    # If the Neighborhood Name is 'Not assigned' then use the Borough Name instead
    if nameNeighborhood == 'Not assigned':
        nameNeighborhood = boroughNeighborhood

    # Extract the URL for the Neighborhood (or Borough)
    try:
        urlNeighborhood = tag[2].a['href'].strip() # Try getting the URL for the Neighborhood
    except:
        try:
            urlNeighborhood = tag[1].a['href'].strip() # Else try getting the URL for the Borough
        except:
            urlNeighborhood = ''

    # Extract the Title for the Neighborhood - the Title often contains more information so is used for looking up the Latitude and Longitude
    try:
        titleNeighborhood = tag[2].a['title'].strip() # Try getting the Title for the Neighborhood
    except:
        try:
            titleNeighborhood = tag[1].a['title'].strip() # Else try getting the Title for the Borough
        except:
            titleNeighborhood = ''

    titleNeighborhood = titleNeighborhood.replace(", Toronto","").replace(" (Toronto)","").replace(" (neighbourhood)","").replace(" (page does not exist)","")
    titleNeighborhood = titleNeighborhood + ', Toronto, Ontario' # Use the Neighborhood Title to find the latitude and longitude
            
    # Check that the Neighborhood Name being added to the DataFrame is unique - if the Name is not unique then it has already been added to the DataFrame (recall that this routine can be run many times in order to ensure all location data is loaded)
    if nameNeighborhood in neighborhoods["Neighborhood"].tolist():
        print('The Neighborhood of {} has already been loaded into the DataFrame'.format(nameNeighborhood))
        continue
    
    latitudeNeighborhood = ''
    longitudeNeighborhood = ''

    # (optional) Lookup the Latitude and Longitude of the Neighborhood
    if True:
        try:
            locationNeighborhood = geocoder.arcgis('{}, Toronto, Ontario'.format(nameNeighborhood))
            latitudeNeighborhood = locationNeighborhood.latlng[0]
            longitudeNeighborhood = locationNeighborhood.latlng[1]
            print('{}: The geograpical coordinate of {} is {}, {}.'.format(postalcodeNeighborhood, titleNeighborhood, latitudeNeighborhood, longitudeNeighborhood))
        except:
            print('{}: The geograpical coordinates of {} failed'.format(postalcodeNeighborhood, titleNeighborhood))
            latitudeNeighborhood = ''
            longitudeNeighborhood = ''
            continue

    print('{}: Adding Neighborhood {}'.format(postalcodeNeighborhood, titleNeighborhood))
    neighborhoods = neighborhoods.append({
                                            'PostalCode': postalcodeNeighborhood,
                                            'Borough': boroughNeighborhood,
                                            'Neighborhood': nameNeighborhood,
                                            'URL': urlNeighborhood,
                                            'Title': titleNeighborhood,
                                            'Latitude': latitudeNeighborhood,
                                            'Longitude': longitudeNeighborhood
                                            }, ignore_index=True)

# Note that not all of the locations of the Neighborhood names from the Wikipedia page can be found by the geolocator
print('Finished adding {} Neighborhoods from {} crawled rows.'.format(len(neighborhoods), len(find_allNeighborhoods), ))

Row 1 doesn't have any Neighborhood information - Skip
Row 2 Borough is marked as 'Not Assigned' - Skip
Row 3 Borough is marked as 'Not Assigned' - Skip
The Neighborhood of Parkwoods has already been loaded into the DataFrame
The Neighborhood of Victoria Village has already been loaded into the DataFrame
The Neighborhood of Harbourfront has already been loaded into the DataFrame
The Neighborhood of Regent Park has already been loaded into the DataFrame
The Neighborhood of Lawrence Heights has already been loaded into the DataFrame
The Neighborhood of Lawrence Manor has already been loaded into the DataFrame
The Neighborhood of Queen's Park has already been loaded into the DataFrame
Row 11 Borough is marked as 'Not Assigned' - Skip
The Neighborhood of Islington Avenue has already been loaded into the DataFrame
The Neighborhood of Rouge has already been loaded into the DataFrame
The Neighborhood of Malvern has already been loaded into the DataFrame
Row 15 Borough is marked as 'Not Assign

In [18]:
# Check that the neighborhoods have been loaded correctly from the crawled results
print( neighborhoods.shape )
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,URL,Title,Latitude,Longitude
0,M3A,North York,Parkwoods,/wiki/Parkwoods,"Parkwoods, Toronto, Ontario",43.686575,-79.409993
1,M4A,North York,Victoria Village,/wiki/Victoria_Village,"Victoria Village, Toronto, Ontario",43.73154,-79.31428
2,M5A,Downtown Toronto,Regent Park,/wiki/Regent_Park,"Regent Park, Toronto, Ontario",43.66069,-79.36031
3,M6A,North York,Lawrence Heights,/wiki/Lawrence_Heights,"Lawrence Heights, Toronto, Ontario",43.72357,-79.43711
4,M6A,North York,Lawrence Manor,/wiki/Lawrence_Manor,"Lawrence Manor, Toronto, Ontario",43.72292,-79.43131


In [19]:
# Save the neighborhoods as a tab-delineated CSV file
neighborhoods.to_csv("Toronto Neighborhoods 013 Tab-Delineated.csv", sep='\t')

In [20]:
# Group the Neighborhoods by Postal Code so that the Neighborhoods are concatenated
groupNeighborhoodNames = neighborhoods.groupby(['PostalCode','Borough'], as_index=True)['Neighborhood'].apply(', '.join).reset_index()

# Group the Neighborhoods by Postal Code so that the Location is averaged
groupNeighborhoodLocations = neighborhoods.groupby(['PostalCode','Borough'], as_index=True)['Latitude','Longitude'].mean()

# Merge together the final Grouped Neighborhoods
neighborhoods = pd.merge(groupNeighborhoodNames, groupNeighborhoodLocations, on=['PostalCode','Borough'], how='inner')

print( neighborhoods.shape )
neighborhoods.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.808715,-79.197445
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785203,-79.146583
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76517,-79.191117
3,M1G,Scarborough,Woburn,43.7673,-79.22823
4,M1H,Scarborough,Cedarbrae,43.747728,-79.235174


In [21]:
# Get the Location of the geographic center of Toronto to center the map
nameLocation = "Leaside, Toronto, Ontario" # The geographic center is not "Toronto, Canada"
locationToronto = geocoder.arcgis(nameLocation)
print('The geograpical coordinate of the city center at {} is {}, {}.'.format(nameLocation, locationToronto.latlng[0], locationToronto.latlng[1]))

The geograpical coordinate of the city center at Leaside, Toronto, Ontario is 43.70023937855272, -79.3510651247247.


In [22]:
# Create map of Toronto using {latitude, longitude} values and Neighborhood label
mapToronto = folium.Map(location=[locationToronto.latlng[0], locationToronto.latlng[1]], zoom_start=11)

# Add markers to Map
for latitude, longitude, postalcode, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['PostalCode'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '[' + postalcode +']: ' + neighborhood + ', ' + borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mapToronto)  
    
mapToronto

### Toronto Static Map of Neighborhoods
![Toronto Static Map of Neighborhoods](Toronto_Neighborhood_Map_003.png "Toronto Static Map of Neighborhoods")

## Step 3: Explore Toronto Neighborhoods

In [60]:
# Create a function to collect nearby venues from all the neighborhoods in Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):

    # Count the number of neighborhoods being explored
    countNeighborhoods = 0

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # Update the user on which neighborhood is being explored
        countNeighborhoods += 1
        print('{} of {}: Exploring venues nearby {}.'.format(countNeighborhoods, len(names), name))

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            FOURSQUARE_CLIENT_ID, 
            FOURSQUARE_CLIENT_SECRET, 
            FOURSQUARE_VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # Continue if Error with processing the GET request
        try:
            # make the GET request
            results = requests.get(url, timeout=10).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except: 
          pass        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [61]:
# Limit the number of neighborhoods that can be explored
MAX_NEIGHBORHOODS = 3

# Select which neighborhoods to explore
#selectNeighborhoods = neighborhoods.loc[neighborhoods['Borough'] == 'Downtown Toronto']
#selectNeighborhoods = neighborhoods.loc[neighborhoods['Borough'].str.contains("Toronto")]
selectNeighborhoods = neighborhoods.copy()

# Call Foursquare to explore nearby venues in each Neighborhood
venuesToronto = getNearbyVenues(names=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Neighborhood'],
                                   latitudes=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Latitude'],
                                   longitudes=selectNeighborhoods.head(MAX_NEIGHBORHOODS)['Longitude']
                                  )

1 of 3: Exploring venues nearby Rouge, Malvern.
2 of 3: Exploring venues nearby Highland Creek, Rouge Hill, Port Union.
3 of 3: Exploring venues nearby Guildwood, Morningside, West Hill.


In [56]:
venuesToronto.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.808715,-79.197445,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.808715,-79.197445,Eco Painting,43.807465,-79.201223,Paintball Field
2,"Rouge, Malvern",43.808715,-79.197445,FASTSIGNS,43.807882,-79.201968,Business Service
3,"Rouge, Malvern",43.808715,-79.197445,Interprovincial Group,43.80563,-79.200378,Print Shop
4,"Highland Creek, Rouge Hill, Port Union",43.785203,-79.146583,Centennial Park,43.786257,-79.148776,Park


In [40]:
# Check how many venues were found in each neighborhood
venuesToronto.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",22,22,22,22,22,22
Agincourt,16,16,16,16,16,16
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",6,6,6,6,6,6
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",6,6,6,6,6,6
"Alderwood, Long Branch",4,4,4,4,4,4
"Bathurst Manor, Downsview North, Wilson Heights",1,1,1,1,1,1
Bayview Village,3,3,3,3,3,3
"Bedford Park, Lawrence Manor East",13,13,13,13,13,13
Berczy Park,100,100,100,100,100,100
"Birch Cliff, Cliffside West",4,4,4,4,4,4


In [26]:
# Check how many unique categories were found
print('There are {} uniques categories.'.format(len(venuesToronto['Venue Category'].unique())))

There are 281 uniques categories.


In [27]:
# Prepare to Cluster the data
onehotToronto = pd.get_dummies(venuesToronto[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
onehotToronto['Neighborhood'] = venuesToronto['Neighborhood'] 

# Move neighborhood column to the first column
onehotColumns = [onehotToronto.columns[-1]] + list(onehotToronto.columns[:-1])
onehotToronto = onehotToronto[onehotColumns]

# Check what the data looks like
print( onehotToronto.shape )
onehotToronto.head()

(2433, 281)


Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,...,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Group the Explored Venues by Neighborhood to see the percentage of each Venue Category

groupedToronto = onehotToronto.groupby('Neighborhood').mean().reset_index()
groupedToronto

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,...,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,Agincourt,0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.062500,0.000000,0.000000,0.000000,0.000000
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.166667
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,"Alderwood, Long Branch",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,"Bathurst Manor, Downsview North, Wilson Heights",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,Bayview Village,0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,"Bedford Park, Lawrence Manor East",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,Berczy Park,0.000000,0.00,0.000000,0.000000,0.0,0.010000,0.000000,0.0,0.0,...,0.000000,0.000000,0.010000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,"Birch Cliff, Cliffside West",0.000000,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,...,0.000000,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [29]:
# Limit the number of neighborhoods that can be shown
MAX_NEIGHBORHOODS = 5

# Print each neighborhood along with the top 10 most common venues
COUNT_TOP_VENUES = 10

for hood in groupedToronto['Neighborhood'].head(MAX_NEIGHBORHOODS):
    print("----"+hood+"----")
    temp = groupedToronto[groupedToronto['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(COUNT_TOP_VENUES))
    print('\n')

----Adelaide, King, Richmond----
                       venue  freq
0                Supermarket  0.09
1                   Pharmacy  0.09
2                Coffee Shop  0.09
3      Portuguese Restaurant  0.05
4                     Bakery  0.05
5                       Pool  0.05
6                       Café  0.05
7       Brazilian Restaurant  0.05
8  Middle Eastern Restaurant  0.05
9              Grocery Store  0.05


----Agincourt----
                    venue  freq
0      Chinese Restaurant  0.31
1        Asian Restaurant  0.12
2       Korean Restaurant  0.06
3    Cantonese Restaurant  0.06
4   Vietnamese Restaurant  0.06
5  Peking Duck Restaurant  0.06
6              Food Court  0.06
7                  Bakery  0.06
8             Coffee Shop  0.06
9    Hong Kong Restaurant  0.06


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
                     venue  freq
0        Indian Restaurant  0.50
1            Women's Store  0.17
2         Asian Restaurant  0.17
3          

In [30]:
# Sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [31]:
# Create a new DataFrame with the Top-10 Venues for each Neighborhood
COUNT_TOP_VENUES = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(COUNT_TOP_VENUES):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
sortedNeighborhoodVenues = pd.DataFrame(columns=columns)
sortedNeighborhoodVenues['Neighborhood'] = groupedToronto['Neighborhood']

for ind in np.arange(groupedToronto.shape[0]):
    sortedNeighborhoodVenues.iloc[ind, 1:] = return_most_common_venues(groupedToronto.iloc[ind, :], COUNT_TOP_VENUES)

sortedNeighborhoodVenues.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Pharmacy,Supermarket,Coffee Shop,Diner,Brazilian Restaurant,Café,Fast Food Restaurant,Music Venue,Liquor Store,Middle Eastern Restaurant
1,Agincourt,Chinese Restaurant,Asian Restaurant,Restaurant,Peking Duck Restaurant,Cantonese Restaurant,Korean Restaurant,Bakery,Coffee Shop,Food Court,Hong Kong Restaurant
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Indian Restaurant,Women's Store,Asian Restaurant,Grocery Store,Creperie,Costume Shop,Food Stand,Food Court,Food & Drink Shop,Flower Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Spa,Flower Shop,Park,Rest Area,Restaurant,Pizza Place,Historic Site,Doctor's Office,Fish Market,Fish & Chips Shop
4,"Alderwood, Long Branch",Convenience Store,Gym,Burmese Restaurant,Construction & Landscaping,Creperie,Dance Studio,Food Truck,Food Stand,Food Court,Food & Drink Shop


## Step 4: Cluster Toronto Neighborhoods

In [47]:
# Run k-means to cluster the neighborhood into clusters
COUNT_K_MEANS_CLUSTERS = 7

clusteredTorontoGroups = groupedToronto.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=COUNT_K_MEANS_CLUSTERS, random_state=0).fit(clusteredTorontoGroups)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:300]

array([1, 1, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 6, 0, 1, 2, 1, 1, 1, 6, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 3, 1, 1, 1])

In [48]:
# Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

# Add clustering labels
labeledSortedNeighborhoodVenues = sortedNeighborhoodVenues.copy()

labeledSortedNeighborhoodVenues.insert(0, 'Cluster Labels', kmeans.labels_)

# Extend the original Neighborhood Data
mergedToronto = selectNeighborhoods.copy()

# Merge the grouped data with the neighborhood data to add latitude/longitude for each neighborhood
mergedToronto = mergedToronto.join(labeledSortedNeighborhoodVenues.set_index('Neighborhood'), on='Neighborhood')

mergedToronto.head() # Check the merged columns

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.808715,-79.197445,,,,,,,,,,,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785203,-79.146583,6.0,Gym,Jewelry Store,Park,Women's Store,Field,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76517,-79.191117,1.0,Pizza Place,Spa,Intersection,Medical Center,Fried Chicken Joint,Fast Food Restaurant,Greek Restaurant,Thrift / Vintage Store,Rental Car Location,Breakfast Spot
3,M1G,Scarborough,Woburn,43.7673,-79.22823,3.0,Indian Restaurant,Bakery,American Restaurant,Fish & Chips Shop,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
4,M1H,Scarborough,Cedarbrae,43.747728,-79.235174,0.0,Playground,Grocery Store,Flower Shop,Park,Women's Store,Fast Food Restaurant,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market


### Map of Clusters

In [49]:
# Visualize the Clusters

# Create a map
map_clusters = folium.Map(location=[locationToronto.latlng[0], locationToronto.latlng[1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(COUNT_K_MEANS_CLUSTERS)
ys = [i + x + (i*x)**2 for i in range(COUNT_K_MEANS_CLUSTERS)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mergedToronto['Latitude'], mergedToronto['Longitude'], mergedToronto['Neighborhood'], mergedToronto['Cluster Labels']):
    try:
        idCluster = int(cluster)
    except:
        idCluster = 0
    label = folium.Popup(str(poi) + ' Cluster ' + str(idCluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[idCluster-1],
        fill=True,
        fill_color=rainbow[idCluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Toronto Static Map of Neighborhood Clusters
![Toronto Static Map of Neighborhood Clusters](Toronto_Neighborhood_Map_004.png "Toronto Static Map of Neighborhood Clusters")

## Step 5: Examine Toronto Clusters

### Cluster 0

In [35]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 0, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,North York,0.0,Music Venue,Baseball Field,Fish & Chips Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Women's Store,Ethiopian Restaurant
91,Etobicoke,0.0,Baseball Field,Women's Store,Fish & Chips Shop,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish Market,Ethiopian Restaurant


### Cluster 1

In [36]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 1, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,1.0,Gym,Jewelry Store,Park,Women's Store,Field,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant
4,Scarborough,1.0,Playground,Grocery Store,Flower Shop,Park,Women's Store,Fast Food Restaurant,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market
17,North York,1.0,Park,Residential Building (Apartment / Condo),Women's Store,Eastern European Restaurant,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
23,North York,1.0,Athletics & Sports,Bus Stop,Supermarket,Park,Women's Store,Filipino Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
24,North York,1.0,Mobile Phone Shop,Park,Women's Store,Filipino Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Fish & Chips Shop
27,North York,1.0,Park,Convenience Store,Liquor Store,Electronics Store,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
34,North York,1.0,Food Stand,Park,Women's Store,Field,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant
38,East York,1.0,Park,Bridge,Indian Restaurant,Women's Store,Filipino Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
42,East Toronto,1.0,Tennis Court,Park,Electronics Store,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Women's Store
100,Etobicoke,1.0,Gym,Sculpture Garden,Supplement Shop,Park,Women's Store,Fast Food Restaurant,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market


### Cluster 2

In [37]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 2, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,North York,2.0,Electronics Store,Women's Store,Fish & Chips Shop,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish Market


### Cluster 3

In [38]:
mergedToronto.loc[mergedToronto['Cluster Labels'] == 3, mergedToronto.columns[[1] + list(range(5, mergedToronto.shape[1]))]].head(10)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Scarborough,3.0,Pizza Place,Spa,Intersection,Medical Center,Fried Chicken Joint,Fast Food Restaurant,Greek Restaurant,Thrift / Vintage Store,Rental Car Location,Breakfast Spot
3,Scarborough,3.0,Indian Restaurant,Bakery,American Restaurant,Fish & Chips Shop,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
5,Scarborough,3.0,Coffee Shop,Pharmacy,Pizza Place,Café,Theater,Fried Chicken Joint,Butcher,Seafood Restaurant,Women's Store,Fast Food Restaurant
6,Scarborough,3.0,Burger Joint,Plaza,Bakery,Indian Chinese Restaurant,Intersection,Italian Restaurant,General Entertainment,Filipino Restaurant,Park,Pharmacy
7,Scarborough,3.0,Coffee Shop,Intersection,Diner,Convenience Store,Soccer Field,Fast Food Restaurant,Fish & Chips Shop,Falafel Restaurant,Farmers Market,Field
8,Scarborough,3.0,Fast Food Restaurant,Pizza Place,Furniture / Home Store,Liquor Store,Discount Store,Park,Sandwich Place,Coffee Shop,Burger Joint,Pharmacy
9,Scarborough,3.0,Skating Rink,Café,College Stadium,General Entertainment,Filipino Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field,Women's Store
10,Scarborough,3.0,Indian Restaurant,Wings Joint,Chinese Restaurant,Latin American Restaurant,Field,Ethiopian Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant
11,Scarborough,3.0,Electronics Store,Miscellaneous Shop,Food Court,Auto Garage,Filipino Restaurant,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Field
12,Scarborough,3.0,Chinese Restaurant,Asian Restaurant,Restaurant,Peking Duck Restaurant,Cantonese Restaurant,Korean Restaurant,Bakery,Coffee Shop,Food Court,Hong Kong Restaurant


## The End.