# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

In this assignment, we are required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. We need to build a different code to scrape the Wikipedia page

https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. A good package for this purpose is BeautifulSoup. We start importing a few libraries we will use in the following steps.

In [1]:
import pandas as pd
import numpy as np
import os,sys
from bs4 import BeautifulSoup
import requests
import urllib
from urllib.request import urlopen
import json
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  57.19 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  35.37 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  40.57 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  49.00 MB/s


We now look at the table at the Wikipedia page, consisting of the columns Postcode, Borough and Neighborhood. We immediately see that some postcodes are not assigned, and these cells will be ignored. Some neighborhoods have the same postcode, and we will list them in the same row with a comma as separators. A few boroughs have a "Not assigned" neighborhood, in this case we will name the neighborhood with the same name of the borough.

### Downloading the table from Wikipedia and loading the data in a pandas dataframe

In [2]:
# Download the page
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urlopen(URL)
soup = BeautifulSoup(page, "lxml")
page.close()
 
# Open the required table
fp = open("data.csv","w")
tables = soup.findAll('table')
tab = tables[0]
for tr in tab.tbody.findAll('tr'):
    #print(tr.findAll('th'))
    for th in tr.findAll('th'):
        text = th.getText().strip()+','
        fp.write(text)
    for td in tr.findAll('td'):
        text = td.getText().strip()+','
        fp.write(text)
    fp.write('\n')
fp.close()

# create the pandas dataframe
dfToronto = pd.read_csv('data.csv')
dfToronto.drop('Unnamed: 3',axis=1,inplace = True)
dfToronto.head(10)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Cleaning the dataframe

In [3]:
# Remove the unassigned postcodes
dfToronto1 = dfToronto[ ~ dfToronto['Borough'].str.contains('Not assigned')]
dfToronto1.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [4]:
# Combine the neighborhoods with the same postcode in the required format
group = dfToronto1.groupby('Postcode')
grouped_neighborhoods = group['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x))
grouped_boroughs = group['Borough'].apply(lambda x: set(x).pop())
dfToronto2 = pd.DataFrame(list(zip(grouped_boroughs.index, grouped_boroughs, grouped_neighborhoods)))
dfToronto2.columns = ['Postcode', 'Borough', 'Neighbourhood']

dfToronto2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [5]:
# Give the unassigned neighbourhoods the same name of the borough
for i in range(len(dfToronto2)):
    line_data=dfToronto2.iloc[i,:]
    if line_data['Neighbourhood'] == 'Not assigned':
        line_data['Neighbourhood'] = line_data['Borough']

In [6]:
# Check the number of rows od the dataframe
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(dfToronto2['Borough'].unique()),
      dfToronto2.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


### Add the coordinates to the neighbourhood dataframe and draw a map

In [7]:
# Prepare a dataframe from the csv coordinated file
coordinates_df = pd.read_csv('http://cocl.us/Geospatial_data')
coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
# Join the neighbourhood 
df_join = dfToronto2.join(coordinates_df.set_index('Postal Code'), on='Postcode')
df_join.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [9]:
# Prepare a map of Toronto with the postcodes

Toronto_map = folium.Map(location=[43.6540,-79.3872], zoom_start=10)

for location in df_join.itertuples():
    label = 'Postal Code: {};  Borough: {};  Neighborhoods: {}'.format(location[1], location[2], location[3])
    label = folium.Popup(label, parse_html=True)    
    folium.CircleMarker(
        [location[-2], location[-1]],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Toronto_map) 
    folium.Circle(
        radius=500,
        popup=label,
        location=[location[-2], location[-1]],
        color='#3186cc',
        fill=True,
        fill_color='#3186cc'
    ).add_to(Toronto_map) 
    
Toronto_map

### Explore and cluster on a selected borough (Etobicoke)

In [10]:
address = 'Etobicoke, CA'

#geolocator = Nominatim()
#location = geolocator.geocode(address)
#latitude = location.latitude
#longitude = location.longitude
latitude = 43.620495
longitude = -79.513199
print('The geographical coordinates of Etobicoke are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Etobicoke are 43.620495, -79.513199.


We now prepare, just for a test, a dataframe related only to the neighbourhoods of Etobicoke and create the relative map

In [11]:
Etobicoke_data = df_join[df_join['Borough'] == 'Etobicoke'].reset_index(drop=True)
Etobicoke_data

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321
1,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484
2,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
3,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So...",43.636258,-79.498509
4,M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen...",43.628841,-79.520999
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M9B,Etobicoke,"Cloverdale, Islington, Martin Grove, Princess ...",43.650943,-79.554724
7,M9C,Etobicoke,"Bloordale Gardens, Eringate, Markland Wood, Ol...",43.643515,-79.577201
8,M9P,Etobicoke,Westmount,43.696319,-79.532242
9,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724


In [12]:
# create map of Etobicoke using latitude and longitude values

map_Etobicoke = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(Etobicoke_data['Latitude'], Etobicoke_data['Longitude'], Etobicoke_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Etobicoke)  
    
map_Etobicoke

### Define Foursquare Credentials and Version

We retrieve our credential to access Foursquare

In [13]:
CLIENT_ID = '0UF1Q1XZBGCW4KXK2BIKH3BASZSVHLFPODYQCWXSAYWBRPTJ' # your Foursquare ID
CLIENT_SECRET = 'KPQX0KZA4XSUOPN5KROZTLVUOBVZXYG5SOB5WXZ2U1BLM1HA' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0UF1Q1XZBGCW4KXK2BIKH3BASZSVHLFPODYQCWXSAYWBRPTJ
CLIENT_SECRET:KPQX0KZA4XSUOPN5KROZTLVUOBVZXYG5SOB5WXZ2U1BLM1HA


### Test exploration on the first neighbourhood

In [14]:
Etobicoke_data.loc[0, 'Neighbourhood']

'Humber Bay Shores, Mimico South, New Toronto'

In [15]:
neighbourhood_latitude = Etobicoke_data.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = Etobicoke_data.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = Etobicoke_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))


Latitude and longitude values of Humber Bay Shores, Mimico South, New Toronto are 43.6056466, -79.50132070000001.


### Retrieval of the top 100 venues within 500 meters

We create a GET request URL and examine the result related to a maximum of 100 venues in a radius of 500

In [16]:
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url


'https://api.foursquare.com/v2/venues/explore?&client_id=0UF1Q1XZBGCW4KXK2BIKH3BASZSVHLFPODYQCWXSAYWBRPTJ&client_secret=KPQX0KZA4XSUOPN5KROZTLVUOBVZXYG5SOB5WXZ2U1BLM1HA&v=20180605&ll=43.6056466,-79.50132070000001&radius=500&limit=100'

In [17]:
#Send the GET request and examine the results

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c419b144c1f671cc3895674'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4b119977f964a520488023e3-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/food_liquor_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d186941735',
         'name': 'Liquor Store',
         'pluralName': 'Liquor Stores',
         'primary': True,
         'shortName': 'Liquor Store'}],
       'id': '4b119977f964a520488023e3',
       'location': {'address': '2762 Lake Shore Blvd W',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'btwn 1st & 2nd St',
        'distance': 408,
        'formattedAddress': ['2762 Lake Shore Blvd W (btwn 1st & 2nd St)',
         'Toronto ON M8V 1H1',
         'Cana

Now we define a function in order to extract the kind of venue (Restaurant, cafè, etc.)

In [18]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

We now filter and clean the JSON and transform it in a pandas database

In [19]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the venue category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)


Unnamed: 0,name,categories,lat,lng
0,LCBO,Liquor Store,43.602281,-79.499302
1,New Toronto Fish & Chips,Restaurant,43.601849,-79.503281
2,Delicia Bakery & Pastry,Bakery,43.601403,-79.503012
3,Lucky Dice Restaurant,Café,43.601392,-79.503056
4,McDonald's,Fast Food Restaurant,43.60247,-79.498963
5,Subway,Sandwich Place,43.602382,-79.498275
6,Popeyes Louisiana Kitchen,Fried Chicken Joint,43.602069,-79.4994
7,Shoppers Drug Mart,Pharmacy,43.601611,-79.502164
8,Maple Leaf House,American Restaurant,43.60204,-79.498678
9,Halibut House Fish and Chips Inc.,Seafood Restaurant,43.60196,-79.501147


In [20]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

15 venues were returned by Foursquare.


### Explore all the neighbourhoods in Etobicoke

After a test on a single neighbourhood, we extend our research to all the neighbourhoods of Etobicoke

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We create a new dataframe for the whole borough using the function previously defined

In [22]:
Etobicoke_venues = getNearbyVenues(names=Etobicoke_data['Neighbourhood'],
                                   latitudes=Etobicoke_data['Latitude'],
                                   longitudes=Etobicoke_data['Longitude']
                                  )

Humber Bay Shores, Mimico South, New Toronto
Alderwood, Long Branch
The Kingsway, Montgomery Road, Old Mill North
Humber Bay, King's Mill Park, Kingsway Park South East, Mimico NE, Old Mill South, The Queensway East, Royal York South East, Sunnylea
Kingsway Park South West, Mimico NW, The Queensway West, Royal York South West, South of Bloor
Islington Avenue
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Westmount
Kingsview Village, Martin Grove Gardens, Richview Gardens, St. Phillips
Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown
Northwest


In [23]:
# Checking the size of the new database
print(Etobicoke_venues.shape)
Etobicoke_venues.head(10)

(71, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,LCBO,43.602281,-79.499302,Liquor Store
1,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,New Toronto Fish & Chips,43.601849,-79.503281,Restaurant
2,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Delicia Bakery & Pastry,43.601403,-79.503012,Bakery
3,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Lucky Dice Restaurant,43.601392,-79.503056,Café
4,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,McDonald's,43.60247,-79.498963,Fast Food Restaurant
5,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Subway,43.602382,-79.498275,Sandwich Place
6,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Popeyes Louisiana Kitchen,43.602069,-79.4994,Fried Chicken Joint
7,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Shoppers Drug Mart,43.601611,-79.502164,Pharmacy
8,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Maple Leaf House,43.60204,-79.498678,American Restaurant
9,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,Halibut House Fish and Chips Inc.,43.60196,-79.501147,Seafood Restaurant


We check the venues by neighbourhoods with the same postcode

In [24]:
Etobicoke_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",10,10,10,10,10,10
"Alderwood, Long Branch",10,10,10,10,10,10
"Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe",6,6,6,6,6,6
"Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park",2,2,2,2,2,2
"Humber Bay Shores, Mimico South, New Toronto",15,15,15,15,15,15
"Humber Bay, King's Mill Park, Kingsway Park South East, Mimico NE, Old Mill South, The Queensway East, Royal York South East, Sunnylea",1,1,1,1,1,1
"Kingsview Village, Martin Grove Gardens, Richview Gardens, St. Phillips",4,4,4,4,4,4
"Kingsway Park South West, Mimico NW, The Queensway West, Royal York South West, South of Bloor",11,11,11,11,11,11
Northwest,2,2,2,2,2,2
"The Kingsway, Montgomery Road, Old Mill North",3,3,3,3,3,3


In [25]:
print('There are {} uniques categories.'.format(len(Etobicoke_venues['Venue Category'].unique())))

There are 39 uniques categories.


### Analysis of the single neighbourhoods

In [26]:
# one hot encoding
Etobicoke_onehot = pd.get_dummies(Etobicoke_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Etobicoke_onehot['Neighbourhood'] = Etobicoke_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [Etobicoke_onehot.columns[-1]] + list(Etobicoke_onehot.columns[:-1])
Etobicoke_onehot = Etobicoke_onehot[fixed_columns]

Etobicoke_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,Bakery,Bank,Baseball Field,Beer Store,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Pub,Rental Car Location,Restaurant,River,Sandwich Place,Seafood Restaurant,Skating Rink,Social Club,Supplement Shop,Wings Joint
0,"Humber Bay Shores, Mimico South, New Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Humber Bay Shores, Mimico South, New Toronto",0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,"Humber Bay Shores, Mimico South, New Toronto",0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Humber Bay Shores, Mimico South, New Toronto",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,"Humber Bay Shores, Mimico South, New Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Checking the size of the new dataframe
Etobicoke_onehot.shape

(71, 40)

In [28]:
# Now we group the rows by neighbourhood and calculate the occurence frequency mean per category
Etobicoke_grouped = Etobicoke_onehot.groupby('Neighbourhood').mean().reset_index()
Etobicoke_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Bakery,Bank,Baseball Field,Beer Store,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Pub,Rental Car Location,Restaurant,River,Sandwich Place,Seafood Restaurant,Skating Rink,Social Club,Supplement Shop,Wings Joint
0,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.1,0.0,0.0,0.0,0.1,0.0,0.1,0.0,0.0,0.0
2,"Bloordale Gardens, Eringate, Markland Wood, Ol...",0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.166667,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cloverdale, Islington, Martin Grove, Princess ...",0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Humber Bay Shores, Mimico South, New Toronto",0.066667,0.066667,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,...,0.0,0.0,0.066667,0.0,0.066667,0.066667,0.0,0.0,0.0,0.0
5,"Humber Bay, King's Mill Park, Kingsway Park So...",0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Kingsview Village, Martin Grove Gardens, Richv...",0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Kingsway Park South West, Mimico NW, The Queen...",0.0,0.090909,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.090909,0.090909,0.090909
8,Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"The Kingsway, Montgomery Road, Old Mill North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# New size
Etobicoke_grouped.shape

(11, 40)

We now print the top 5 most common venues per group of neighbourhoods

In [30]:
# Now we print the top 5 most common venues per neighbourhood
num_top_venues = 5

for hood in Etobicoke_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Etobicoke_grouped[Etobicoke_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
            venue  freq
0   Grocery Store   0.2
1    Liquor Store   0.1
2      Beer Store   0.1
3        Pharmacy   0.1
4  Sandwich Place   0.1


----Alderwood, Long Branch----
            venue  freq
0     Pizza Place   0.2
1             Gym   0.1
2             Pub   0.1
3    Skating Rink   0.1
4  Sandwich Place   0.1


----Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe----
               venue  freq
0  Convenience Store  0.17
1         Beer Store  0.17
2           Pharmacy  0.17
3               Café  0.17
4        Pizza Place  0.17


----Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park----
                       venue  freq
0                       Bank   0.5
1                Golf Course   0.5
2        American Restaurant   0.0
3                        Pub   0.0
4  Middle Eastern Restaurant   0.0


----Humber Bay Shores, Mimico Sout

### Conversion into a pandas dataframe

In order to convert these data in a pandas dataframe, we define a function that sorts the venues in descending order 

In [31]:
# A function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [32]:
# We apply the function to create the dataframe and show the 10 top venues per group of neighbourhoods
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = Etobicoke_grouped['Neighbourhood']

for ind in np.arange(Etobicoke_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Etobicoke_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pizza Place,Coffee Shop,Fast Food Restaurant,Sandwich Place,Beer Store,Liquor Store,Fried Chicken Joint,Pharmacy,Drugstore
1,"Alderwood, Long Branch",Pizza Place,Pub,Dance Studio,Coffee Shop,Pharmacy,Pool,Gym,Sandwich Place,Skating Rink,Beer Store
2,"Bloordale Gardens, Eringate, Markland Wood, Ol...",Pharmacy,Liquor Store,Beer Store,Café,Convenience Store,Pizza Place,Wings Joint,Flower Shop,Fast Food Restaurant,Drugstore
3,"Cloverdale, Islington, Martin Grove, Princess ...",Golf Course,Bank,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
4,"Humber Bay Shores, Mimico South, New Toronto",Café,Gym,Pizza Place,Bakery,Fast Food Restaurant,Flower Shop,Fried Chicken Joint,Liquor Store,Mexican Restaurant,Pharmacy
5,"Humber Bay, King's Mill Park, Kingsway Park So...",Baseball Field,Wings Joint,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
6,"Kingsview Village, Martin Grove Gardens, Richv...",Pizza Place,Bus Line,Mobile Phone Shop,Park,Wings Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio
7,"Kingsway Park South West, Mimico NW, The Queen...",Wings Joint,Supplement Shop,Bakery,Burger Joint,Convenience Store,Discount Store,Fast Food Restaurant,Grocery Store,Gym,Social Club
8,Northwest,Drugstore,Rental Car Location,Wings Joint,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Discount Store,Dance Studio,Convenience Store
9,"The Kingsway, Montgomery Road, Old Mill North",Pool,River,Park,Chinese Restaurant,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store


### Clustering the neighbourhoods

We run k-means from sklear library to cluster the neighbourhoods into 5 clusters

In [39]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

Etobicoke_data_2 = Etobicoke_data.drop(11)
# set number of clusters
kclusters = 5

Etobicoke_grouped_clustering = Etobicoke_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Etobicoke_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:11] 
#len(kmeans.labels_)#=11
#Etobicoke_data.shape#=(12,5)

(11, 5)

In [40]:
Etobicoke_merged = Etobicoke_data_2

# add clustering labels
Etobicoke_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Etobicoke_merged = Etobicoke_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Etobicoke_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto",43.605647,-79.501321,0,Café,Gym,Pizza Place,Bakery,Fast Food Restaurant,Flower Shop,Fried Chicken Joint,Liquor Store,Mexican Restaurant,Pharmacy
1,M8W,Etobicoke,"Alderwood, Long Branch",43.602414,-79.543484,0,Pizza Place,Pub,Dance Studio,Coffee Shop,Pharmacy,Pool,Gym,Sandwich Place,Skating Rink,Beer Store
2,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,0,Pool,River,Park,Chinese Restaurant,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
3,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So...",43.636258,-79.498509,3,Baseball Field,Wings Joint,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
4,M8Z,Etobicoke,"Kingsway Park South West, Mimico NW, The Queen...",43.628841,-79.520999,0,Wings Joint,Supplement Shop,Bakery,Burger Joint,Convenience Store,Discount Store,Fast Food Restaurant,Grocery Store,Gym,Social Club


We now visualize the clusters in a map

In [41]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Etobicoke_merged['Latitude'], Etobicoke_merged['Longitude'], Etobicoke_merged['Neighbourhood'], Etobicoke_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examining clusters

Now we can finally examine the resulting clusters 

#### CLUSTER 1

In [42]:
Etobicoke_merged.loc[Etobicoke_merged['Cluster Labels'] == 0, Etobicoke_merged.columns[[1] + list(range(5, Etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Etobicoke,0,Café,Gym,Pizza Place,Bakery,Fast Food Restaurant,Flower Shop,Fried Chicken Joint,Liquor Store,Mexican Restaurant,Pharmacy
1,Etobicoke,0,Pizza Place,Pub,Dance Studio,Coffee Shop,Pharmacy,Pool,Gym,Sandwich Place,Skating Rink,Beer Store
2,Etobicoke,0,Pool,River,Park,Chinese Restaurant,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
4,Etobicoke,0,Wings Joint,Supplement Shop,Bakery,Burger Joint,Convenience Store,Discount Store,Fast Food Restaurant,Grocery Store,Gym,Social Club
6,Etobicoke,0,Golf Course,Bank,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store
7,Etobicoke,0,Pharmacy,Liquor Store,Beer Store,Café,Convenience Store,Pizza Place,Wings Joint,Flower Shop,Fast Food Restaurant,Drugstore
10,Etobicoke,0,Grocery Store,Pizza Place,Coffee Shop,Fast Food Restaurant,Sandwich Place,Beer Store,Liquor Store,Fried Chicken Joint,Pharmacy,Drugstore


#### CLUSTER 2

In [43]:
Etobicoke_merged.loc[Etobicoke_merged['Cluster Labels'] == 1, Etobicoke_merged.columns[[1] + list(range(5, Etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Etobicoke,1,,,,,,,,,,


#### CLUSTER 3

In [44]:
Etobicoke_merged.loc[Etobicoke_merged['Cluster Labels'] == 2, Etobicoke_merged.columns[[1] + list(range(5, Etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Etobicoke,2,Pizza Place,Chinese Restaurant,Intersection,Sandwich Place,Middle Eastern Restaurant,Coffee Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio


#### CLUSTER 4

In [45]:
Etobicoke_merged.loc[Etobicoke_merged['Cluster Labels'] == 3, Etobicoke_merged.columns[[1] + list(range(5, Etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Etobicoke,3,Baseball Field,Wings Joint,Coffee Shop,Fried Chicken Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio,Convenience Store


#### CLUSTER 5

In [46]:
Etobicoke_merged.loc[Etobicoke_merged['Cluster Labels'] == 4, Etobicoke_merged.columns[[1] + list(range(5, Etobicoke_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Etobicoke,4,Pizza Place,Bus Line,Mobile Phone Shop,Park,Wings Joint,Flower Shop,Fast Food Restaurant,Drugstore,Discount Store,Dance Studio
