# IBM Data Science Professional Capstone Project: Scraping, Parsing Table into Pandas DF for Neighborhood Clustering

### In this three-part series of notebooks, we scrape from a Wikipedia article a table of postal codes, cities and neighborhoods in and around Toronto; we clean the data as necessary, geolocate each neighborhood and gather information regarding venues local to that neighborhood; finally, we perform a KMeans cluster analysis to identify neighborhoods sharing similarities and we visualize the result in the form of a tagged Folium map. 

### Part III of III

## Notebook I:  The Neighborhoods DataFrame (Pandas)

#### Import necessary libraries for our analysis

In [1]:
import numpy as np # library to handle data in a vectorized manner

# install beautifulsoup4 if it is not already installed on your system
import bs4  # beautifulsoup4 will be used for for stringifying scraped html
from bs4 import BeautifulSoup

import requests # library to handle requests
import json 

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes # uncomment this line if geopy is not yet installed on your system
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium is not yet installed on your system
import folium # map rendering library

print('Libraries imported.') 


Libraries imported.


#### Retrieve the Wikipedia article containing the Canadian regional Postal Code table; retrieve and stringify the HTML document using BeautifulSoup; and do a preliminary manipulation of the object to isolate the table

In [2]:
       
def parse_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    return soup
    
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
souper = parse_url(url)
type(souper)
     

bs4.BeautifulSoup

#### A quick inspection of the object reveals the table of interest (demarked by 'tbody' tags).  Further, the table comprises rows (demarked by 'tr'tags), the first row comprises column headings (demarked by 'th' tags) and the balance of the rows comprise data elements (demarked by 'td' tags).  Isolating the rows, we find that there is the single row of column headings followed by 289 rows of data.  Using Pandas, the BeautifulSoup Postal Code table object will be moved into a DataFrame ('pc_df') for further processing.  Finally, as requested in the exercise, we will delete rows of the table for which the Borough is indicated as 'Not Assigned' and will assign the name of the Borough to any neighbourhood in that Borough that othewise has no assigned name.

In [3]:
raw_table = souper.tbody
rows = raw_table.find_all('tr')
n_rows = len(rows)

header_row = rows[0].find_all('th')
n_cols = len(header_row)
col_names = list()
[col_names.extend(header_row[col]) for col in range(n_cols)]
col_names[-1] = col_names[-1][:-1]

pc_df= pd.DataFrame(columns=col_names, index=range(0,len(rows)))
for i in range(1,n_rows):
    row_data = rows[i].find_all('td')
    row_data = [row_data[0].text, row_data[1].text, (row_data[2].text)[:-1]]
    for j in range(0,3):
        pc_df.iat[i,j] = row_data[j]
pc_df = pc_df[pc_df.Postcode.notnull()]
pc_df=pc_df.reset_index(drop=True)
pc_clean = pc_df[pc_df.Borough != 'Not assigned']
pc_clean=pc_clean.reset_index(drop=True)

# A spot check shows that at least one Borough has an unassigned Neighbourhood:  we will go through the dataframe and 
# assign to that and any other similarly unassigned Neighborhood the name of its respective Borough:

for indx in range(0,len(pc_clean)):
    if pc_clean.loc[indx,'Neighbourhood']=='Not assigned': 
        pc_clean.loc[indx,'Neighbourhood']=pc_clean.loc[indx,'Borough']
        print(pc_clean.loc[indx,'Neighbourhood'], ' had a "Not assigned" neighborhood, which now has been assigned the name of the Borough.')
        
print('pc_clean.shape is: ',pc_clean.shape)


Queen's Park  had a "Not assigned" neighborhood, which now has been assigned the name of the Borough.
pc_clean.shape is:  (212, 3)


#### As requested in the exercise, for each Postal Code/Borough pair having multiple Neighborhoods we will collapse the group of Neighborhoods of that pair into a single entry by contatenating the names of those Neighborhoods into a single entry for that Postal Code/Borough pair. As a preliminary step, however, we will confirm whether any Postal Code/Neighborhood pair has more than Bourough: if it did, then we would subgroup the Neighborhoods of that Postcode by Borough in order to avoid loss of that information.  We find in the following that no Neighborhood within any Postcode has been assigned more than one Borough.


In [4]:
pc_clean_groupby_neighbourhood=pc_clean.groupby(by=['Neighbourhood'],axis=0)
print(pc_clean_groupby_neighbourhood.size().sum(), pc_clean.shape[0])

212 212


#### Now, we group 'pc_clean' by Borough and Postcode, resulting in a group of neighborhoods ('grp_membs') for each tuple (Borough, Postcode); then we construct a Python list, each member in which is a dictionary comprising a Postal Code, a Borough and the concatenated group of neighborhood names assigned to that Postcode/Borough pair.  The resulting list is converted to a Pandas DataFrame, <i>'canada_nhds'</i>:

In [5]:
# For Notebook II, we have inserted collection of three-tuples comprising Borough, PostalCode, and a single Neighbourhood for each
# Borough/PostalCode pair:  this is to facilitate formulation of get requests from Google Maps Geocoder API, below.

pc_clean_grouped = pc_clean.groupby(by=['Borough', 'Postcode'], axis=0)
frame=[]
first_neigh=[]
for name, group in pc_clean_grouped:
    grp_membs = pc_clean_grouped.get_group(name)
    neighs = ''
    for i in range(len(grp_membs)):
        neighs = neighs + grp_membs.iloc[i,2] +', '
        if i==0: 
            first_neigh.append({'Borough':list(name)[0], 'PostalCode':list(name)[1], 'Target': neighs[:-2]})
            
    neighs = neighs[:-2]
    frame.append({'Borough':list(name)[0],'PostalCode':list(name)[1], 'Neighbourhoods':neighs})

frame_df= pd.DataFrame(frame)
firstn_df = pd.DataFrame(first_neigh)
firstn_df.iloc[100,2]= 'Silverthorn'   #See note(*), below.

canada_nhds=frame_df[['PostalCode', 'Borough', 'Neighbourhoods']]
canada_nhds.head()
    

Unnamed: 0,PostalCode,Borough,Neighbourhoods
0,M4N,Central Toronto,Lawrence Park
1,M4P,Central Toronto,Davisville North
2,M4R,Central Toronto,North Toronto West
3,M4S,Central Toronto,Davisville
4,M4T,Central Toronto,"Moore Park, Summerhill East"


#### <i> *Note:  This is to resolve a later ambiguity regarding the lat/lng of Del RayNoteIn the course of this examination, it became clear that the Google Maps API response with regard to Del Ray was erroneous: it returned location information on the neighborhood of Del Rey, Virgina, USA, rather than Del Rey, Toronto. For that reason, I substituted another of the neighborhoods in the same cell for purposes of determining the coordinates, below.</i>

#### As requested in the exercise, the shape of the final dataframe, "canada_nhds", is determined:

In [6]:
print('canada_nhds.shape :', canada_nhds.shape)

canada_nhds.shape : (103, 3)


## Notebook II:  Obtaining Latitude, Longitude for Each PostalCode/Borough/Neighbourhoods Tuple in 'canada_nhds.shape'

#### In this notebook, I utilize the Google Maps Geocoder API to obtain a location by latitude and longitude for each of the PostalCode/Borough/Neighborhoods entries in canada_nhds.shape.  Although the exercise suggested that I use the python module 'geocoder', upon inspection of its documentation, and experimentation with its code, I was unable to formulate a location request to which the API would respond correctly.  On the other hand, I found that the API itself was quite easily addressed and worked well for the purpose.  Accordingly, the desired data was obtained through direct application of the Google Maps Geocoder API.


In [7]:
import requests


In [8]:
# @hidden_cell
KEY = 'AIzaSyBTQxZ1_42gtdROYFHKbXSZjYlPz3-P2fM'


In the cell above, I have (hopefully) hidden the API key 'KEY', which I use in the calls to googleapis.

#### In the following cell, I create a list of dictionaries ('local_info'); by iteration over the index of firstn_df, construction of the address from 'Target', Borough and PostalCode, a call to the geocode API of Google Maps, decomposition of the response, and population of the respective dictionary, the 'local_info' list is formed.  That list is then used to form the final dataframe, 'can_neigh_locs' for further exploration in Notebook III.

In [9]:
local_info=[]

for i in range (0,canada_nhds.shape[0]):
    loc_search = "{}, {}, {}".format(firstn_df.loc[i,'Target'], firstn_df.loc[i, 'Borough'], firstn_df.loc[i,'PostalCode'])
    url='https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}'.format(loc_search, KEY)
    response = requests.get(url).json()
    local_info.append({'PostalCode':firstn_df.loc[i,'PostalCode'],'Borough':firstn_df.loc[i,'Borough'], 
                    'Neighbourhoods': canada_nhds.loc[i,'Neighbourhoods'], 
                    'Latitude': response['results'][0]['geometry']['location']['lat'], 
                    'Longitude': response['results'][0]['geometry']['location']['lng']})

local_info_df = pd.DataFrame(local_info)
can_neigh_locs = local_info_df[['PostalCode', 'Borough', 'Neighbourhoods', 'Latitude', 'Longitude']]

In [162]:
expl=can_neigh_locs.sort_values(['PostalCode'], ascending=True).reset_index(drop=True)
print(expl.shape)
expl.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhoods,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.78658,-79.188292
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.752743,-79.192777
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [11]:
print('latitude max/min: ', can_neigh_locs['Latitude'].max(), can_neigh_locs['Latitude'].min(),\
      '\nlongitude max/min: ', can_neigh_locs['Longitude'].max(), can_neigh_locs['Longitude'].min())

latitude max/min:  43.8152522 43.6035413 
longitude max/min:  -79.165842 -79.6156513


## Notebook III:  Exploring and Clustering the Neighborhoods in Toronto

### Introduction

#### In this Notebook III I explore the neighborhoods of the City of Toronto in more detail.  As a preliminary step, I visualize the dataframe developed in Notebooks I and II using the Folium mapping service to get a sense of the geographical locations of the neighborhoods:

In [120]:
# create map of Toronto using latitude and longitude values for M5N:

M5A_location = can_neigh_locs.loc[53,['Latitude','Longitude']]
map_toronto = folium.Map(location=[M5A_location[0],M5A_location[1]], zoom_start=11) 
map_toronto

In [121]:
# Now let's add neighborhood markers to the map

for lat, lng, borough, neighbourhoods, zone in zip(can_neigh_locs['Latitude'], can_neigh_locs['Longitude'], can_neigh_locs['Borough'],\
                                                  can_neigh_locs['Neighbourhoods'], can_neigh_locs['PostalCode']):
    label = '{} [{}, {}]'.format(neighbourhoods, borough, zone )
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Based upon the above, I researched and determined that the Canada Post Gateway Processing Centre in Mississauga outside of Toronto and is not relevant to my further inquiry. I decided to do some high-level evaluation of the remaining areas before focusing my exploration more locally.  Using Foursquare's 'Venues' API, I explored the venues within each of twenty-eight geographical cells in a 4x7 grid covering all of the previously identified neighborhoods and decided to focus on the an area that includes the :

#### Define Foursquare Credentials and Version

In [122]:
# # @hidden_cell

CLIENT_ID='TXUCX0NL1LA4OVJT02BCG2EWKE5KCH2YEV1R2YPOKCQIURRB'
CLIENT_SECRET='VHXAURZDANM5FIZIGKPFQ25MQV2X204X1OFATPVEHOQGFEEQ'
VERSION=20180605


In [123]:
from itertools import product

#### Since Foresquare offers in its 'Places' API an option of exploring venues within a rectangular geographic cell determined by the latitude and longitude of the southwest and northeast corners, let's create the grid and store the sw/ne corner coordinates for each cell in a list of dictionaries ('cell_coords'):

In [124]:
lat_northmost = can_neigh_locs['Latitude'].max()
lat_southmost = can_neigh_locs['Latitude'].min()
lng_westmost = can_neigh_locs['Longitude'].min()
lng_eastmost = can_neigh_locs['Longitude'].max()
print('lat_southmost: ', lat_southmost, 'lng_westmost: ', lng_westmost,
'\nlat_northmost: ', lat_northmost, 'lng_eastmost: ', lng_eastmost)

lat_southmost:  43.6035413 lng_westmost:  -79.6156513 
lat_northmost:  43.8152522 lng_eastmost:  -79.165842


In [125]:
grid_rows=15
grid_cols=25
cornerpoint=[]
xx=np.linspace(lat_southmost, lat_northmost, grid_rows+1)
yy=np.linspace(lng_westmost, lng_eastmost, grid_cols+1)
print('lat gridmarks: ', xx, '\n\nlng gridmarks: ', yy)

lat gridmarks:  [43.6035413  43.61765536 43.63176942 43.64588348 43.65999754 43.6741116
 43.68822566 43.70233972 43.71645378 43.73056784 43.7446819  43.75879596
 43.77291002 43.78702408 43.80113814 43.8152522 ] 

lng gridmarks:  [-79.6156513  -79.59765893 -79.57966656 -79.56167418 -79.54368181
 -79.52568944 -79.50769707 -79.4897047  -79.47171232 -79.45371995
 -79.43572758 -79.41773521 -79.39974284 -79.38175046 -79.36375809
 -79.34576572 -79.32777335 -79.30978098 -79.2917886  -79.27379623
 -79.25580386 -79.23781149 -79.21981912 -79.20182674 -79.18383437
 -79.165842  ]


In [126]:
corners = product(xx, yy)
for corner in corners:
    cornerpoint.append(corner)
# cornerpoint
i=0
j=0
cell_coords=[]
while i <=(grid_cols)*(grid_rows)+2:
    
    if (i+1)%(grid_cols+1)==0:
        i=i+1
    else:
#         print('cell number: ',j)
#         print('sw, ne corner indices of cornerpoint: ', cornerpoint[i], cornerpoint[i+9])
        sw_lat = cornerpoint[i][0]
        sw_lng = cornerpoint[i][1]
        ne_lat = cornerpoint[i+grid_cols+2][0]
        ne_lng = cornerpoint[i+grid_cols+2][1]
        cell_coords.append({'cell no.': j, 'sw': (sw_lat, sw_lng), 'ne':(ne_lat, ne_lng)})
        i=i+1
        j=j+1
cell_coords[2]['sw']
len(cell_coords)

364

#### Let's start by redrawing the map of Toronto, superimposing the corners of the geographical cells on the prior map; then let's explore each of the cells a little bit:

In [127]:
cornerlats=[]
cornerlngs=[]
[cornerlats.append(cornerpoint[i][0]) for i in range(len(cornerpoint))]
[cornerlngs.append(cornerpoint[i][1]) for i in range(len(cornerpoint))]
for i in range(len(cornerlats)):
    lable=''
    lat=cornerlats[i]
    lng=cornerlngs[i]
#     print(lat,lng)
    label = '{}, {}'.format(lat,lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=2,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto) 
map_toronto  # Here the grid corners are indicated in red: click on any one of them to see its coordinates

### Based upon a brief review of public information on Toronto, and looking at the grid, I first focused on the rectangular area bounded by the waterfront on the south, Spandina Road on the west, Bloor Street West on the north and Yonge Street on the east.  However, that area provided insufficient diversification for my study, so I've expanded it to include the entire rectangular area bounded at the southwest corner at coordinates (43.6315, -79.3640) and at the northeast corner at coordinates (43.6890, -79.4725).  It stretches all of the way from the waterside to the portions of York and so should cover a variety of socioeconomic and ethnic strata.  I first will find the locations on our dataframe 'can_neigh_locs' within that area and then explore a few of the venues close to those neighborhoods. One caveat: in order to minimize overlapping search areas, I will keep the searches close to the coordinates found for each dataframe record.  In each case, that will be close to the first neighborhood of the group served by each postal code:  this is obviously a bit arbitrary but should still provide useful information.


#### Define Foursquare Credentials and Version

In [128]:
# # @hidden_cell

CLIENT_ID='TXUCX0NL1LA4OVJT02BCG2EWKE5KCH2YEV1R2YPOKCQIURRB'
CLIENT_SECRET='VHXAURZDANM5FIZIGKPFQ25MQV2X204X1OFATPVEHOQGFEEQ'
VERSION=20180605


In [129]:
from itertools import product

In [130]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### From the prior investigation of Manhattan neighborhoods, we know that all the information for venues near the neighborhoods in each PostalCode/Borough area can be found from a call of the 'venues/explore' module of the Foursquare service.  We also know that the desired information will be under the *items* key of the results from that call.  I will use the same approach here as was used for the prior investigation.  Since each PostalCode in our dataframe 'can_neighs_loc' uniquely identifies a Borough and a group of one or more neighborhoods and has associated to it geographic coordinates based upon a neighborhood within its group, we will call Foursquare venues/explore for each PostalCode within our area of interest.

In [131]:
def getNearbyVenues(names, latitudes, longitudes, radius, limit):
    
    venues_list=[]
    i=0
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [132]:
radius=250
limit=50
min_pc_lat = 43.6315
max_pc_lat = 43.6890
min_pc_lng = -79.4725
max_pc_lng = -79.3640

In [133]:
interesting_pc = can_neigh_locs[((can_neigh_locs['Latitude'] >= min_pc_lat) & (can_neigh_locs['Latitude'] <= max_pc_lat)) &\
                                ((can_neigh_locs['Longitude'] >= min_pc_lng) & (can_neigh_locs['Longitude'] <= max_pc_lng))]


#### Based upon the geographic limitations that I've imposed (discussed above), we will be looking at the neighborhoods in twenty-eight PostalCodes:

In [134]:
print(type(interesting_pc),interesting_pc.shape)
interesting_pc.sort_values('PostalCode').reset_index(drop=True)

<class 'pandas.core.frame.DataFrame'> (28, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhoods,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.640552,-79.378937
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657658,-79.378802
5,M5C,Downtown Toronto,St. James Town,43.670867,-79.373306
6,M5E,Downtown Toronto,Berczy Park,43.647985,-79.375225
7,M5G,Downtown Toronto,Central Bay Street,43.657298,-79.384364
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.647713,-79.390892
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640552,-79.378937


#### Let's do the call!!

In [135]:
first_venues=pd.DataFrame(getNearbyVenues(interesting_pc['PostalCode'],interesting_pc['Latitude'],interesting_pc['Longitude'], radius, limit))

In [136]:
print(first_venues.shape)
first_venues.head()

(683, 7)


Unnamed: 0,PostalCode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5R,43.67271,-79.405678,Country Style,43.674527,-79.407143,Coffee Shop
1,M4W,43.679563,-79.377529,Park Drive Reservation Lands,43.679822,-79.377787,Park
2,M4W,43.679563,-79.377529,Mooredale House,43.678631,-79.380091,Building
3,M4X,43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner
4,M4X,43.667967,-79.367675,Butter Chicken Factory,43.667072,-79.369184,Indian Restaurant


#### To help understand the results, I've first grouped them by venue and location.  This helps to examine the extent of geographic overlap of samples but also reveals a significant duplication of listings particular venues revealed by the very close but not exactly identical location coordinates (e.g., two Pizza Pizzas within a few thousandths of a degree, or the large number of Tim Horton's apparently neighboring one another).  A more detailed study of this, and its significance in evaluating the reliability of the Foursquare database, would be of value.

#### Let's look at the results on a venue by venue basis:

In [137]:
# first_venues.groupby(['Venue','Venue Latitude','Venue Longitude', 'PostalCode']).count()

#### Not surprisingly, the PostalCode and related neighborhoods that are in very close proximity show a strong correlation:  to the extent relevant to future analysis, I will treat M5K, M5L and M5X as a single zone.

In [138]:
# first_venues.groupby(['Venue Category']).count().sort_values(['PostalCode'], ascending=False)

In [139]:
# pc_frequency = first_venues.sort_values(['PostalCode']).groupby(['PostalCode']).count()
# pc_frequency

In [140]:
print('There are {} unique categories.'.format(len(first_venues['Venue Category'].unique())))

There are 158 unique categories.


## 3. Analyze Each PostalCode/Borough

#### In this portion of the exercise, I will closely follow the work that was done in the Manhattan neighborhood study.  For this, I will continue to look only at the Postal/Code Boroughs and related neighbourhoods that were studied in the preceding section.

In [141]:
# one hot encoding
toronto_onehot = pd.get_dummies(first_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhoods column back to dataframe
toronto_onehot['PostalCode'] = first_venues['PostalCode'] 

# move neighbourhoods column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(683, 159)


Unnamed: 0,PostalCode,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Café,Cheese Shop,Chinese Restaurant,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Convention Center,Creperie,Cuban Restaurant,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Dumpling Restaurant,Ethiopian Restaurant,Event Space,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Flea Market,Flower Shop,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gaming Cafe,Garden,Gastropub,General Entertainment,General Travel,Gluten-free Restaurant,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,History Museum,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Light Rail Station,Lounge,Mac & Cheese Joint,Malay Restaurant,Market,Martial Arts Dojo,Mattress Store,Mediterranean Restaurant,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,Neighborhood,New American Restaurant,Nightclub,Noodle House,Optical Shop,Other Great Outdoors,Outdoor Sculpture,Park,Performing Arts Venue,Pet Store,Pharmacy,Piano Bar,Pizza Place,Plaza,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Seafood Restaurant,Shipping Store,Shopping Mall,Soup Place,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Sushi Restaurant,Taco Place,Tailor Shop,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M5R,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M4W,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,M4W,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,M4X,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,M4X,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by PostalCode and by taking the mean of the frequency of occurrence of each category

In [142]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,PostalCode,Adult Boutique,American Restaurant,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Café,Cheese Shop,Chinese Restaurant,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Convention Center,Creperie,Cuban Restaurant,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Dumpling Restaurant,Ethiopian Restaurant,Event Space,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Flea Market,Flower Shop,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Furniture / Home Store,Gaming Cafe,Garden,Gastropub,General Entertainment,General Travel,Gluten-free Restaurant,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,History Museum,Hotel,Hotel Bar,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Light Rail Station,Lounge,Mac & Cheese Joint,Malay Restaurant,Market,Martial Arts Dojo,Mattress Store,Mediterranean Restaurant,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Movie Theater,Museum,Music Venue,Neighborhood,New American Restaurant,Nightclub,Noodle House,Optical Shop,Other Great Outdoors,Outdoor Sculpture,Park,Performing Arts Venue,Pet Store,Pharmacy,Piano Bar,Pizza Place,Plaza,Pub,Ramen Restaurant,Record Shop,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,School,Seafood Restaurant,Shipping Store,Shopping Mall,Soup Place,Souvlaki Shop,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Storage Facility,Sushi Restaurant,Taco Place,Tailor Shop,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M4W,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4X,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.05,0.05,0.0,0.0,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M4Y,0.028571,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.028571,0.0,0.085714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.028571,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.028571,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.0,0.028571,0.0,0.028571,0.0,0.0,0.028571,0.028571,0.0,0.0
3,M5A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.266667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.066667,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M5B,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.02381,0.095238,0.0,0.02381,0.0,0.071429,0.0,0.095238,0.02381,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.02381,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.071429,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.02381,0.02381,0.02381,0.0,0.0,0.02381,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.02381,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Let's confirm the new size

In [143]:
toronto_grouped.shape

(27, 159)

#### Let's print each PostalCode along with the top 5 most common venues

In [144]:
num_top_venues = 5

for pcode in toronto_grouped['PostalCode']:
    print("----"+pcode+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == pcode].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4W----
                     venue  freq
0                 Building   0.5
1                     Park   0.5
2           Adult Boutique   0.0
3  New American Restaurant   0.0
4      Monument / Landmark   0.0


----M4X----
            venue  freq
0      Restaurant  0.15
1            Café  0.10
2     Pizza Place  0.10
3     Coffee Shop  0.10
4  Ice Cream Shop  0.05


----M4Y----
                 venue  freq
0         Burger Joint  0.09
1  Japanese Restaurant  0.06
2                 Park  0.03
3          Coffee Shop  0.03
4   Salon / Barbershop  0.03


----M5A----
           venue  freq
0    Coffee Shop  0.27
1            Gym  0.07
2     Steakhouse  0.07
3  Boat or Ferry  0.07
4           Lake  0.03


----M5B----
                       venue  freq
0                       Café  0.10
1                Coffee Shop  0.10
2  Middle Eastern Restaurant  0.07
3             Clothing Store  0.07
4             Sandwich Place  0.05


----M5C----
                 venue  freq
0             Pharmacy  0

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [145]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each group of neighbourhoods.

In [146]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
PostalCode_venues_sorted = pd.DataFrame(columns=columns)
PostalCode_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    PostalCode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

PostalCode_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Park,Building,Yoga Studio,Ethiopian Restaurant,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space
1,M4X,Restaurant,Café,Coffee Shop,Pizza Place,Bakery,Market,Beer Store,Diner,Indian Restaurant,Japanese Restaurant
2,M4Y,Burger Joint,Japanese Restaurant,Adult Boutique,Piano Bar,Salon / Barbershop,Breakfast Spot,Restaurant,Bubble Tea Shop,Ramen Restaurant,Ethiopian Restaurant
3,M5A,Coffee Shop,Gym,Boat or Ferry,Steakhouse,Japanese Restaurant,Bubble Tea Shop,Food Court,Fast Food Restaurant,Lake,Pizza Place
4,M5B,Café,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Movie Theater,Sandwich Place,College Rec Center,Burrito Place,Restaurant,Ramen Restaurant


## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [147]:
# set number of clusters
kclusters = 10

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([4, 6, 6, 0, 6, 7, 6, 6, 6, 0, 6, 6, 1, 6, 6, 6, 6, 6, 9, 5, 6, 8,
       2, 6, 6, 3, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [148]:
# add clustering labels

PostalCode_venues_sorted.drop(['Cluster Labels'], axis = 1, inplace = True, errors = 'ignore')
PostalCode_venues_sorted.insert(0,'Cluster Labels', kmeans.labels_)

pc_merged = interesting_pc
pc_merged = pc_merged.join(PostalCode_venues_sorted.set_index('PostalCode'), on='PostalCode')

pc_merged['Cluster Labels'] = pc_merged['Cluster Labels'].fillna(0).astype('int32', errors='raise')
print('pc_merged final',pc_merged.info())
pc_merged

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 8 to 101
Data columns (total 16 columns):
PostalCode                28 non-null object
Borough                   28 non-null object
Neighbourhoods            28 non-null object
Latitude                  28 non-null float64
Longitude                 28 non-null float64
Cluster Labels            28 non-null int32
1st Most Common Venue     27 non-null object
2nd Most Common Venue     27 non-null object
3rd Most Common Venue     27 non-null object
4th Most Common Venue     27 non-null object
5th Most Common Venue     27 non-null object
6th Most Common Venue     27 non-null object
7th Most Common Venue     27 non-null object
8th Most Common Venue     27 non-null object
9th Most Common Venue     27 non-null object
10th Most Common Venue    27 non-null object
dtypes: float64(2), int32(1), object(13)
memory usage: 3.6+ KB
pc_merged final None


Unnamed: 0,PostalCode,Borough,Neighbourhoods,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,1,Coffee Shop,Yoga Studio,Ethiopian Restaurant,Flower Shop,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space
9,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,4,Park,Building,Yoga Studio,Ethiopian Restaurant,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space
10,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,6,Restaurant,Café,Coffee Shop,Pizza Place,Bakery,Market,Beer Store,Diner,Indian Restaurant,Japanese Restaurant
11,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,6,Burger Joint,Japanese Restaurant,Adult Boutique,Piano Bar,Salon / Barbershop,Breakfast Spot,Restaurant,Bubble Tea Shop,Ramen Restaurant,Ethiopian Restaurant
12,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.640552,-79.378937,0,Coffee Shop,Gym,Boat or Ferry,Steakhouse,Japanese Restaurant,Bubble Tea Shop,Food Court,Fast Food Restaurant,Lake,Pizza Place
13,M5B,Downtown Toronto,"Ryerson, Garden District",43.657658,-79.378802,6,Café,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Movie Theater,Sandwich Place,College Rec Center,Burrito Place,Restaurant,Ramen Restaurant
14,M5C,Downtown Toronto,St. James Town,43.670867,-79.373306,7,Pharmacy,Breakfast Spot,Filipino Restaurant,Bar,Food & Drink Shop,Hotel,Cuban Restaurant,Flower Shop,Flea Market,Fish & Chips Shop
15,M5E,Downtown Toronto,Berczy Park,43.647985,-79.375225,6,Restaurant,Cocktail Bar,Hotel,Italian Restaurant,Pub,Beer Bar,Coffee Shop,Breakfast Spot,Creperie,Café
16,M5G,Downtown Toronto,Central Bay Street,43.657298,-79.384364,6,Coffee Shop,Chinese Restaurant,Café,Thai Restaurant,Italian Restaurant,Spa,Furniture / Home Store,Comic Shop,Department Store,Dessert Shop
17,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.647713,-79.390892,6,Coffee Shop,Restaurant,Bar,Hotel,Movie Theater,Gym,Italian Restaurant,Pizza Place,Event Space,Pub


Finally, let's visualize the resulting clusters

In [167]:
# create map

map_clusters = folium.Map(location=[M5A_location[0],M5A_location[1]], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pc_merged['Latitude'], pc_merged['Longitude'], pc_merged['PostalCode'], pc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

#### Upon examination, it is revealed that the features used to perform the clustering did not provide much differentiation until the k-value was increased to a fairly high degree (e.g., 10, as shown in the map and in the listings below).  At that level, some interesting points do emerge.  For example, class (0), which included the University of Toronto, a cemetary and two PostalCodes on the shore emerged as a cluster. From this experience, two points are worth making: (i) the criteria used by Foursquare are extremely narrow, focusing nearly entirely upon restaurant and entertainment venues (which, although probably the types of venues about which the app's users probably are inquiring and/or rating, is of little use in differentiating neighborhoods for cluster analysis); and (ii) the bundling of neighborhoods, probably reduced the sensitivity of the classification even further.  If one wants to find an area with a large number of coffee shops, Foursquare may be for him; if one wants to find a neighborhood in which to live, Foursquare probably is not the database to use!

In [151]:
pc_merged.loc[pc_merged['Cluster Labels'] == 0, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,M5A,0,Coffee Shop,Gym,Boat or Ferry,Steakhouse,Japanese Restaurant,Bubble Tea Shop,Food Court,Fast Food Restaurant,Lake,Pizza Place
18,M5J,0,Coffee Shop,Gym,Boat or Ferry,Steakhouse,Japanese Restaurant,Bubble Tea Shop,Food Court,Fast Food Restaurant,Lake,Pizza Place
74,M7A,0,Coffee Shop,Park,Bubble Tea Shop,Sandwich Place,Sushi Restaurant,Dumpling Restaurant,Filipino Restaurant,Field,Fast Food Restaurant,Event Space
99,M6E,0,,,,,,,,,,


In [152]:
pc_merged.loc[pc_merged['Cluster Labels'] == 1, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M5R,1,Coffee Shop,Yoga Studio,Ethiopian Restaurant,Flower Shop,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space


In [153]:
pc_merged.loc[pc_merged['Cluster Labels'] == 2, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
100,M6M,2,Park,Field,Yoga Studio,Dumpling Restaurant,Flea Market,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Event Space,Ethiopian Restaurant


In [154]:
pc_merged.loc[pc_merged['Cluster Labels'] == 3, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
96,M6R,3,Light Rail Station,Garden,Food & Drink Shop,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space,Ethiopian Restaurant


In [155]:
pc_merged.loc[pc_merged['Cluster Labels'] == 4, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,M4W,4,Park,Building,Yoga Studio,Ethiopian Restaurant,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant,Event Space


In [156]:
pc_merged.loc[pc_merged['Cluster Labels'] == 5, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
92,M6H,5,Park,Coffee Shop,Rental Car Location,Bar,Dumpling Restaurant,Flea Market,Fish & Chips Shop,Filipino Restaurant,Field,Fast Food Restaurant


In [157]:
pc_merged.loc[pc_merged['Cluster Labels'] == 6, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,M4X,6,Restaurant,Café,Coffee Shop,Pizza Place,Bakery,Market,Beer Store,Diner,Indian Restaurant,Japanese Restaurant
11,M4Y,6,Burger Joint,Japanese Restaurant,Adult Boutique,Piano Bar,Salon / Barbershop,Breakfast Spot,Restaurant,Bubble Tea Shop,Ramen Restaurant,Ethiopian Restaurant
13,M5B,6,Café,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Movie Theater,Sandwich Place,College Rec Center,Burrito Place,Restaurant,Ramen Restaurant
15,M5E,6,Restaurant,Cocktail Bar,Hotel,Italian Restaurant,Pub,Beer Bar,Coffee Shop,Breakfast Spot,Creperie,Café
16,M5G,6,Coffee Shop,Chinese Restaurant,Café,Thai Restaurant,Italian Restaurant,Spa,Furniture / Home Store,Comic Shop,Department Store,Dessert Shop
17,M5H,6,Coffee Shop,Restaurant,Bar,Hotel,Movie Theater,Gym,Italian Restaurant,Pizza Place,Event Space,Pub
19,M5K,6,Coffee Shop,Café,Restaurant,Hotel,Deli / Bodega,Gym,Bakery,American Restaurant,Burger Joint,Building
20,M5L,6,Coffee Shop,Café,Restaurant,Hotel,Deli / Bodega,American Restaurant,Gastropub,Gym,Gluten-free Restaurant,Shopping Mall
21,M5S,6,Café,Bakery,Restaurant,Sandwich Place,Bar,Fish & Chips Shop,Chinese Restaurant,Beer Bar,Cheese Shop,Sushi Restaurant
22,M5T,6,Record Shop,Music Venue,Korean Restaurant,Gaming Cafe,Tea Room,Bank,Pizza Place,Coffee Shop,Dumpling Restaurant,Sandwich Place


In [158]:
pc_merged.loc[pc_merged['Cluster Labels'] == 7, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M5C,7,Pharmacy,Breakfast Spot,Filipino Restaurant,Bar,Food & Drink Shop,Hotel,Cuban Restaurant,Flower Shop,Flea Market,Fish & Chips Shop


In [159]:
pc_merged.loc[pc_merged['Cluster Labels'] == 8, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
94,M6K,8,Brewery,Convenience Store,Women's Store,Boutique,Art Gallery,Fast Food Restaurant,Food & Drink Shop,Flower Shop,Flea Market,Fish & Chips Shop


In [160]:
pc_merged.loc[pc_merged['Cluster Labels'] == 9, pc_merged.columns[[0] + list(range(5, pc_merged.shape[1]))]]

Unnamed: 0,PostalCode,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26,M6G,9,Korean Restaurant,Coffee Shop,Grocery Store,Japanese Restaurant,Sandwich Place,Bubble Tea Shop,Café,Rock Climbing Spot,Ramen Restaurant,Pub
