# Toranto clustering project

In this project, we aim to build a clustering model to segment Toronto neighborhoods



## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Toronto</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

## 1. Download and Explore Dataset

build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe
 

First we import all libraries that we will use in our scraping and cleaning operation

In [96]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup# Scrap Wikipedia
import pandas as pd

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In order to obtain the data that is in the Wikipedia table of postal codes, we propose to use **BeautifulSoup** library:

1. first identify the URL to scrap in cururl variable
2. open a request to this url and save the response in data variable
3. extract table tag of class="wikitable sortable" and then explore each row and column of this table
4. store obtained data in a Dataframe named df_codes

In [27]:


cururl = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

print('=== URLS Yet To Retrieve:', len(cururl))
 
 
print('RETRIEVING', cururl)
data =  urllib.request.urlopen(cururl).read() 
 
soup = BeautifulSoup(data)
tables = soup.find("table",{"class":"wikitable sortable"}) # wikitable sortable jquery-tablesorter
# print 'Tags'
 

=== URLS Yet To Retrieve: 63
RETRIEVING https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


To clean and prepare our data frame, we use a dictionary structure, which allows us to search easly each postal code and find the associated borough and neighborhood, we apply thereafter some conditionnal test as follow :

1. check that each borough is diffrent to 'Not assigned'
2. if neighborhood is equal to  'Not assigned' replace the neighborhood value by the borough value
3. chech if the code is already added in the dictionary, if yes we concatenate the neighborhood with the existing one using comma
4. remove special charachter using strip()
 

In [28]:
tags_tr= tables.findAll('tr') 

postalcodes={}
for tr in tags_tr:
    #print(tr)
    tds = tr.findAll('td')
    
    if tds is not None and len(tds) > 0:
        for code, borough, neigh in zip(tds[0],tds[1],tds[2]) :
            if ( borough!= 'Not assigned'):
                try :
                    borough = borough.text
                    
                except:
                    None
                try :
                    neigh = neigh.get('title')                    
                except:
                    None
                if ( neigh.strip() == r'Not assigned'):
                        neigh= borough

                        
                try :                     
                    if (code in postalcodes.keys() ):
                        neigh =postalcodes.get(code)[2] + ',' + neigh  

                except:
                    None
                
                postalcodes[code]=[code.strip(),borough.strip() , neigh.strip()]

In the last we build a dataframe using the scraped data and give proper names to the 3 columns, then we found that we have 103 rows.

In [29]:
df_codes = pd.DataFrame(postalcodes.values(),columns=['postalcode','borough','neighborhood'])
df_codes

Unnamed: 0,postalcode,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront (Toronto),Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Toronto,Malvern, Toronto"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens,Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson,Garden District"


In [30]:
df_codes.shape

(103, 3)

## 2. Explore neighborhoods in Toronto

in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

#### Use geopy library to get the latitude and longitude values of New York City.

In [31]:
!conda install -c conda-forge geopy --yes

Solving environment: done

# All requested packages already installed.



In [32]:
!conda install -c conda-forge geocoder --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          59 KB

The following NEW packages will be INSTALLED:

    geocoder: 1.38.1-py_1 conda-forge
    ratelim:  0.1.6-py_2  conda-forge


Downloading and Extracting Packages
ratelim-0.1.6        | 6 KB      | ##################################### | 100% 
geocoder-1.38.1      | 53 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [33]:
import geocoder # import geocoder and get coordinates from google api, this function is not working

# initialize your variable to None

def get_coordinates_google(row ):
    print(row)
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(row['postalcode']))  #' + row['neighborhood'] + '
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

    print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
    return latitude,longitude



In [34]:
import numpy as np
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data

  

In [35]:
geo_data = pd.read_csv('Geospatial_Coordinates.csv')
geo_data.set_index('Postal Code',inplace=True)
geo_data.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [36]:
 geo_data.loc['M3A'].Latitude

43.7532586

In [37]:

def get_coordinates_csv(row ):
    postalcode=  row['postalcode']

    row['latitude'] = geo_data.loc[postalcode].Latitude
    row['longitude'] = geo_data.loc[postalcode].Longitude

 
    #print('The geograpical coordinate of {} are {}, {}.'.format(postalcode,latitude, longitude))
    return row
  


In [38]:
df_codes.loc[:,'postalcode'].head()

0    M3A
1    M4A
2    M5A
3    M6A
4    M7A
Name: postalcode, dtype: object

In [39]:

df_codes = df_codes.apply(get_coordinates_csv,axis=1)


In [40]:
df_codes.head()

Unnamed: 0,postalcode,borough,neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront (Toronto),Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


In [41]:
#df_codes.tail(3)
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_codes['borough'].unique()),
        df_codes.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


In [45]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         868 KB

The following NEW packages will be INSTALLED:

    altair:  3.2.0-py36_0 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge


Downloading and Extracting Packages
vincent-0.4.4        | 28 KB    

In [46]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# create map of Manhattan using latitude and longitude values
map_toronto = folium.Map(location=[43.653908, -79.384293], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_codes['latitude'], df_codes['longitude'], df_codes['neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Define Foursquare Credentials and Version

In [47]:
CLIENT_ID = 'STMPWRRFMWA5HMU4JL4L5NVNEGRHUNYKDTGOIGRBFF3BIMPK' # your Foursquare ID
CLIENT_SECRET = 'EFRKP2DQ5AAZCVPX2RXK4IU3O5S2223VY1XOCSXI4IIDOZ3L' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: STMPWRRFMWA5HMU4JL4L5NVNEGRHUNYKDTGOIGRBFF3BIMPK
CLIENT_SECRET:EFRKP2DQ5AAZCVPX2RXK4IU3O5S2223VY1XOCSXI4IIDOZ3L


### Let's explore the first neighborhood in our dataframe.
Get the neighborhood's name.

In [48]:
df_codes.loc[0, 'neighborhood']
neighborhood_latitude = df_codes.loc[0, 'latitude'] # neighborhood latitude value
neighborhood_longitude = df_codes.loc[0, 'longitude'] # neighborhood longitude value

neighborhood_name = df_codes.loc[0, 'neighborhood'] # neighborhood name
print(neighborhood_name, neighborhood_latitude , neighborhood_longitude)

Parkwoods 43.7532586 -79.3296565


Now, let's get the top 100 venues that are in Parkwoods within a radius of 500 meters.

In [49]:
# type your answer here
search_query = ''
LIMIT=100
radius = 500
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION,  radius, LIMIT)
url


'https://api.foursquare.com/v2/venues/search?client_id=STMPWRRFMWA5HMU4JL4L5NVNEGRHUNYKDTGOIGRBFF3BIMPK&client_secret=EFRKP2DQ5AAZCVPX2RXK4IU3O5S2223VY1XOCSXI4IIDOZ3L&ll=43.7532586,-79.3296565&v=20180605&radius=500&limit=100'

In [50]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d5aabad66dc060025a1cf72'},
 'response': {'venues': [{'id': '4e8d9dcdd5fbbbb6b3003c7b',
    'name': 'Brookbanks Park',
    'location': {'address': 'Toronto',
     'lat': 43.751976046055574,
     'lng': -79.33214044722958,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.751976046055574,
       'lng': -79.33214044722958}],
     'distance': 245,
     'cc': 'CA',
     'city': 'Toronto',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['Toronto', 'Toronto ON', 'Canada']},
    'categories': [{'id': '4bf58dd8d48988d163941735',
      'name': 'Park',
      'pluralName': 'Parks',
      'shortName': 'Park',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1566223277',
    'hasPerk': False},
   {'id': '4e039defd22d4cebf370894a',
    'name': 'Cassandra Public School',
    'location': {'address': '45 Cassandra Blvd',


In [52]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    


In [53]:
venues = results['response']['venues'] 
    
nearby_venues = json_normalize(venues) # flatten JSON
 
# filter columns
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Cassandra Public School,School,43.748291,-79.328889
2,CAPREIT Toronto Townhomes - 56 Cassandra Blvd,Building,43.75392,-79.3224
3,Nile Academy,Elementary School,43.752369,-79.332217
4,15 Brookbanks,Residential Building (Apartment / Condo),43.752266,-79.332322


In [54]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [55]:
toronto_grouped = df_codes.groupby('neighborhood').mean().reset_index()

In [56]:
toronto_grouped

Unnamed: 0,neighborhood,latitude,longitude
0,"Adelaide,King,Richmond",43.650571,-79.384568
1,"Agincourt North,L'Amoreaux East,Milliken, Onta...",43.815252,-79.284577
2,"Agincourt, Toronto",43.794200,-79.262029
3,"Albion Gardens,Beaumond Heights,Humbergate,Mou...",43.739416,-79.588437
4,"Alderwood, Toronto,Long Branch, Toronto",43.602414,-79.543484
5,"Bathurst Manor,Downsview North,Wilson Heights,...",43.754328,-79.442259
6,Bayview Village,43.786947,-79.385975
7,"Bedford Park, Toronto,Lawrence Manor East",43.733283,-79.419750
8,Berczy Park,43.644771,-79.373306
9,"Birch Cliff,Cliffside West",43.692657,-79.264848


Now we are ready to clean the json and structure it into a pandas dataframe.

In [58]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [61]:
# type your answer here


toronto_venues = getNearbyVenues(names=df_codes['neighborhood'],
                                   latitudes=df_codes['latitude'],
                                   longitudes=df_codes['longitude']
                                  )


Parkwoods
Victoria Village
Harbourfront (Toronto),Regent Park
Lawrence Heights,Lawrence Manor
Queen's Park
Islington Avenue
Rouge, Toronto,Malvern, Toronto
Don Mills North
Woodbine Gardens,Parkview Hill
Ryerson,Garden District
Glencairn
Cloverdale,Islington, Toronto,Martin Grove,Princess Gardens,West Deane Park
Highland Creek (Toronto),Rouge Hill,Port Union, Toronto
Flemingdon Park,Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens,Eringate,Markland Wood,Old Burnhamthorpe
Guildwood,Morningside, Toronto,West Hill, Toronto
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn, Toronto
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor,Downsview North,Wilson Heights, Toronto
Thorncliffe Park
Adelaide,King,Richmond
Dovercourt Village,Dufferin
Scarborough Village
Fairview,Henry Farm,Oriole
Northwood Park,York University
East Toronto
Harbourfront East,Toronto Islands,Union Station (Toronto)
Little Portugal, Toronto,Trinity–Bellwoods


In [62]:
print(toronto_venues.shape)
toronto_venues.head()

(2237, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [128]:
toronto_venues[toronto_venues['Neighborhood']=='Upper Rouge']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


In [129]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
"Agincourt North,L'Amoreaux East,Milliken, Ontario,Steeles East",2,2,2,2,2,2
"Agincourt, Toronto",5,5,5,5,5,5
"Albion Gardens,Beaumond Heights,Humbergate,Mount Olive-Silverstone-Jamestown,Mount Olive-Silverstone-Jamestown,Silverstone, Toronto,South Steeles,Thistletown",10,10,10,10,10,10
"Alderwood, Toronto,Long Branch, Toronto",10,10,10,10,10,10
"Bathurst Manor,Downsview North,Wilson Heights, Toronto",19,19,19,19,19,19
Bayview Village,4,4,4,4,4,4
"Bedford Park, Toronto,Lawrence Manor East",22,22,22,22,22,22
Berczy Park,57,57,57,57,57,57
"Birch Cliff,Cliffside West",4,4,4,4,4,4


In [66]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 279 uniques categories.


### 3. Analyze Each Neighborhood

In [68]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.010000,0.000000,0.000000,0.000000,0.000000,0.010000,0.0,0.01
1,"Agincourt North,L'Amoreaux East,Milliken, Onta...",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
2,"Agincourt, Toronto",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
3,"Albion Gardens,Beaumond Heights,Humbergate,Mou...",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.100000,0.000000,0.000000,0.000000,0.0,0.00
4,"Alderwood, Toronto,Long Branch, Toronto",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
5,"Bathurst Manor,Downsview North,Wilson Heights,...",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.052632,0.000000,0.000000,0.000000,0.0,0.00
6,Bayview Village,0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
7,"Bedford Park, Toronto,Lawrence Manor East",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
8,Berczy Park,0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.017544,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00
9,"Birch Cliff,Cliffside West",0.000000,0.0,0.000000,0.000000,0.0000,0.0000,0.000,0.000,0.000,...,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.00


Let's confirm the new size¶

In [81]:
toronto_grouped.shape

(100, 279)

Let's print each neighborhood along with the top 5 most common venues

In [84]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
             venue  freq
0      Coffee Shop  0.08
1             Café  0.05
2              Bar  0.04
3       Steakhouse  0.04
4  Thai Restaurant  0.04


----Agincourt North,L'Amoreaux East,Milliken, Ontario,Steeles East----
                 venue  freq
0                 Park   0.5
1           Playground   0.5
2          Yoga Studio   0.0
3   Mexican Restaurant   0.0
4  Monument / Landmark   0.0


----Agincourt, Toronto----
                venue  freq
0      Breakfast Spot   0.2
1  Chinese Restaurant   0.2
2      Sandwich Place   0.2
3              Lounge   0.2
4        Skating Rink   0.2


----Albion Gardens,Beaumond Heights,Humbergate,Mount Olive-Silverstone-Jamestown,Mount Olive-Silverstone-Jamestown,Silverstone, Toronto,South Steeles,Thistletown----
                  venue  freq
0         Grocery Store   0.2
1           Pizza Place   0.1
2           Coffee Shop   0.1
3              Pharmacy   0.1
4  Fast Food Restaurant   0.1


----Alderwood, Toronto,Lo

#### Let's put that into a *pandas* dataframe.

First, let's write a function to sort the venues in descending order.

In [85]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [87]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Thai Restaurant,Steakhouse,Bar,Restaurant,Gym,Breakfast Spot,Hotel,American Restaurant
1,"Agincourt North,L'Amoreaux East,Milliken, Onta...",Park,Playground,Dumpling Restaurant,Discount Store,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
2,"Agincourt, Toronto",Lounge,Breakfast Spot,Skating Rink,Chinese Restaurant,Sandwich Place,Women's Store,Drugstore,Dog Run,Doner Restaurant,Donut Shop
3,"Albion Gardens,Beaumond Heights,Humbergate,Mou...",Grocery Store,Beer Store,Sandwich Place,Coffee Shop,Pizza Place,Pharmacy,Fried Chicken Joint,Video Store,Fast Food Restaurant,Empanada Restaurant
4,"Alderwood, Toronto,Long Branch, Toronto",Pizza Place,Pub,Skating Rink,Gym,Coffee Shop,Pharmacy,Athletics & Sports,Pool,Sandwich Place,Dog Run


## 3. Cluster Neighborhoods

Run k-means to cluster the neighborhood into 5 clusters.

In [88]:
# import k-means from clustering stage
from sklearn.cluster import KMeans



In [89]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [90]:

kmeans.labels_.shape

(100,)

In [91]:
toronto_grouped_clustering.shape

(100, 278)

In [141]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_codes

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='neighborhood')

toronto_merged.head() # check the last columns!


Unnamed: 0,postalcode,borough,neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Fast Food Restaurant,Park,Food & Drink Shop,Eastern European Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Portuguese Restaurant,Pizza Place,Hockey Arena,French Restaurant,Coffee Shop,Women's Store,Discount Store,Dive Bar,Dog Run,Doner Restaurant
2,M5A,Downtown Toronto,"Harbourfront (Toronto),Regent Park",43.65426,-79.360636,0.0,Coffee Shop,Park,Café,Bakery,Breakfast Spot,Mexican Restaurant,Restaurant,Theater,Pub,Ice Cream Shop
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0.0,Furniture / Home Store,Clothing Store,Women's Store,Coffee Shop,Event Space,Miscellaneous Shop,Athletics & Sports,Boutique,Vietnamese Restaurant,Accessories Store
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0.0,Coffee Shop,Gym,Diner,Park,Japanese Restaurant,Chinese Restaurant,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint


In [146]:
nan_todelete = toronto_merged[toronto_merged['Cluster Labels'].isna()]

In [150]:
toronto_merged=toronto_merged.dropna() 

In [152]:
# create map
latitude =43.653908
longitude = -79.384293
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
     
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

Now, WE can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster. I will leave this exercise to you.

Cluster 1

In [153]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Portuguese Restaurant,Pizza Place,Hockey Arena,French Restaurant,Coffee Shop,Women's Store,Discount Store,Dive Bar,Dog Run,Doner Restaurant
2,Downtown Toronto,0.0,Coffee Shop,Park,Café,Bakery,Breakfast Spot,Mexican Restaurant,Restaurant,Theater,Pub,Ice Cream Shop
3,North York,0.0,Furniture / Home Store,Clothing Store,Women's Store,Coffee Shop,Event Space,Miscellaneous Shop,Athletics & Sports,Boutique,Vietnamese Restaurant,Accessories Store
4,Queen's Park,0.0,Coffee Shop,Gym,Diner,Park,Japanese Restaurant,Chinese Restaurant,Smoothie Shop,Seafood Restaurant,Sandwich Place,Burger Joint
7,North York,0.0,Café,Japanese Restaurant,Caribbean Restaurant,Gym / Fitness Center,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore
8,East York,0.0,Fast Food Restaurant,Pizza Place,Intersection,Athletics & Sports,Gastropub,Bank,Pharmacy,Gym / Fitness Center,Ethiopian Restaurant,Empanada Restaurant
9,Downtown Toronto,0.0,Coffee Shop,Clothing Store,Cosmetics Shop,Café,Middle Eastern Restaurant,Ice Cream Shop,Pizza Place,Bubble Tea Shop,Plaza,Italian Restaurant
12,Scarborough,0.0,Moving Target,Bar,Women's Store,Diner,Dive Bar,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant
13,North York,0.0,Gym,Coffee Shop,Beer Store,Clothing Store,Asian Restaurant,Italian Restaurant,Supermarket,Restaurant,Discount Store,Dim Sum Restaurant
14,East York,0.0,Curling Ice,Park,Skating Rink,Video Store,Bus Stop,Pharmacy,Cosmetics Shop,Beer Store,Athletics & Sports,Dance Studio
