## Wikipedia scrape notebook - Toronto Neighbourhood Clusters

In [1]:
import numpy as np 
import pandas as pd 
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [2]:
wiki_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
#query the website and return the html to the variable ‘page’
page = urlopen(wiki_page)
soup = BeautifulSoup(page, 'html.parser') #store in variable `soup`

Now that we have wiki URL web page parsed and stored in bfSoup we can now extract and convert into dataframe

In [4]:
#extract table and convert into dataframe
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))[0]
df=pd.DataFrame(df)
header = df.iloc[0]
df = df[1:]
df = df.rename(columns = header)
df.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


Replace not assigned neighborhoods with Borough Names, rows wich has duplicate value of Postcode will be combined into one row.

In [5]:
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


In [6]:
df['Neighbourhood'] = df.apply(lambda row: row['Borough'] if (row['Neighbourhood']=='Not assigned') else row['Neighbourhood'],axis=1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


In [10]:
df_grp = df.groupby(['Postcode','Borough'], sort=False)['Neighbourhood'].apply(','.join).reset_index()
df_grp.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [11]:
df_grp.shape

(103, 3)

Above dataframe shows dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.

Now lets get lat long for the above data

In [12]:
!pip install geocoder
import geocoder 
!pip install folium
import folium
import geopy
import tqdm
from geopy.geocoders import Nominatim

Requirement not upgraded as not directly required: geocoder in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: ratelim in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from requests->geocoder)
Requirement not upgraded as not directly required: i

In [13]:
for index, row in df_grp.iterrows():
    address_1 = row['Neighbourhood'] 
    address_2 = address_1.split(',')[-1]
    address_3 = address_2+","+"Ontario,Canada"
    #print(address_3) #-- It worked

In [14]:
column_names = ['Latitude', 'Longitude'] 
n_hood = pd.DataFrame(columns=column_names)
n_hood.shape

(0, 2)

In [16]:
for index, row in df_grp.iterrows():
    try:
        address_1 = row['Neighbourhood'] 
        address_2 = address_1.split(',')[-1]
        address = address_2+","+"Ontario,Canada"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        #print(row['Borough'],address, latitude, longitude)
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
        n_hood
        pass
    except ValueError as error_message:
        print("Error")
    except AttributeError:
        #print("Problem with data or cannot Geocode.")
        address_3 = row['Borough']
        address = address_3+","+"Ontario,Canada"
        geolocator = Nominatim()
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        #print(address, latitude, longitude)
        n_hood = n_hood.append({'Latitude': latitude,'Longitude': longitude}, ignore_index=True)
       # print(row['Borough'],address, latitude, longitude)
        n_hood
        pass

In [17]:
n_hood.head()

Unnamed: 0,Latitude,Longitude
0,43.757846,-79.315975
1,43.732658,-79.311189
2,43.660706,-79.360457
3,43.722079,-79.437507
4,43.65998,-79.390369


In [20]:
df = pd.concat([df_grp, n_hood[['Latitude', 'Longitude']]], axis=1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.757846,-79.315975
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.660706,-79.360457
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.722079,-79.437507
4,M7A,Queen's Park,Queen's Park,43.65998,-79.390369


In [21]:
df.shape

(103, 5)

Let's get the geographical coordinates of Toronto.

In [23]:
print('We have {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)
df.shape
len(df)

We have 11 boroughs and 103 neighborhoods.


103

In [24]:
#address = 'New York City, NY'
#address = 'Manhattan, NY'
address = 'Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


As we did with all of New York City, let's visualizat Toronto the neighborhoods in it.

In [44]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

Let's simplify the above map and segment and cluster only the neighborhoods in Scarborough. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [48]:
#manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
#manhattan_data.head()
toronto_data = df[df['Borough'] == 'Scarborough'].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.775504,-79.134976
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.768914,-79.187291
3,M1G,Scarborough,Woburn,43.759824,-79.225291
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692


In [49]:
address = 'Scarborough, Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough are 43.7626686, -79.2308605092575.


In [52]:
# create map of Manhattan using latitude and longitude values
map_Scarborough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_Scarborough)  
    
map_Scarborough

Above map shows neighbourhoods in scarborough

In [26]:
#!pip install folium
from sklearn.cluster import KMeans
import folium # map rendering library
import matplotlib.cm as cm
import matplotlib.colors as colors

After retrieving all necessary data for Toronto. Lets explore using foursquare API
Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [27]:
address = 'Toronto,Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Toronto City are {}, {}.'.format(latitude, longitude))

Toronto City are 43.653963, -79.387207.


Define Foursquare Credentials and Version

In [28]:
CLIENT_ID = 'HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO' # your Foursquare ID
CLIENT_SECRET = 'T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ' # your Foursquare Secret
VERSION = '20180924' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO
CLIENT_SECRET:T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ


In [53]:
toronto_data.loc[0, 'Neighbourhood']

'Rouge,Malvern'

Get the neighborhood's latitude and longitude values.

In [54]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge,Malvern are 43.8091955, -79.2217008.


In [55]:
# Now, let's get the top 500 venues that are in Rouge/Malvern within a radius of 1000 meters.
# Also create GET request url 

radius = 1000
LIMIT = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=HDD0TFIAZXBAGTHQFUMMFD1TXISGGMXQPMJNNYE1SIT5Q1FO&client_secret=T5ZUW2WMHRJ2JDWKN4TOX1JITUJQ4UFOQV4E1U3M0GFJ1HVJ&v=20180924&ll=43.8091955,-79.2217008&radius=1000&limit=500'

In [56]:
import requests
import json
from pandas.io.json import json_normalize
results = requests.get(url).json()

In [57]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [58]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()
len(nearby_venues)
print('{} venues by Foursquare.'.format(nearby_venues.shape[0]))

18 venues by Foursquare.


### Now lets Explore Neighborhoods in Scarborough

In [40]:
#function to repeat the same process to all the neighborhoods in Scarborough
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Call the above function on each neighborhood and create a new dataframe called toronto_venues

In [59]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude'])



Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West,Steeles West
Upper Rouge


In [61]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.809196,-79.221701,Shoppers Drug Mart,43.809202,-79.22332,Pharmacy
1,"Rouge,Malvern",43.809196,-79.221701,Subway,43.806805,-79.222515,Sandwich Place
2,"Rouge,Malvern",43.809196,-79.221701,Pizza Hut,43.808326,-79.220616,Pizza Place
3,"Rouge,Malvern",43.809196,-79.221701,Pizza Pizza,43.806613,-79.221243,Pizza Place
4,"Rouge,Malvern",43.809196,-79.221701,Shoppers Drug Mart,43.806489,-79.223024,Pharmacy


Lets find out how many unique categories can be curated from all the returned venues

In [63]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.groupby('Neighborhood').count()

There are 65 uniques categories.


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,12,12,12,12,12,12
"Agincourt North,L'Amoreaux East,Milliken,Steeles East",1,1,1,1,1,1
"Birch Cliff,Cliffside West",7,7,7,7,7,7
Cedarbrae,23,23,23,23,23,23
"Clairlea,Golden Mile,Oakridge",5,5,5,5,5,5
"Clarks Corners,Sullivan,Tam O'Shanter",11,11,11,11,11,11
"Cliffcrest,Cliffside,Scarborough Village West",7,7,7,7,7,7
"Dorset Park,Scarborough Town Centre,Wexford Heights",18,18,18,18,18,18
"East Birchmount Park,Ionview,Kennedy Park",4,4,4,4,4,4
"Guildwood,Morningside,West Hill",27,27,27,27,27,27


### Analyze Each Neighborhood

In [64]:
# one hot encoding
toronto_venues_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_venues_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_venues_onehot.columns[-1]] + list(toronto_venues_onehot.columns[:-1])
toronto_venues_onehot = toronto_venues_onehot[fixed_columns]

toronto_venues_onehot.head()

Unnamed: 0,Neighborhood,African Restaurant,Asian Restaurant,Auto Workshop,Bakery,Bank,Bar,Beer Store,Big Box Store,Breakfast Spot,...,Smoke Shop,Smoothie Shop,Sports Bar,Supermarket,Thai Restaurant,Toy / Game Store,Trail,Train Station,Video Game Store,Vietnamese Restaurant
0,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Rouge,Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
toronto_venues_onehot.shape

(184, 66)

Next, lets group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [66]:
toronto_venues_grouped = toronto_venues_onehot.groupby('Neighborhood').mean().reset_index()
toronto_venues_grouped

Unnamed: 0,Neighborhood,African Restaurant,Asian Restaurant,Auto Workshop,Bakery,Bank,Bar,Beer Store,Big Box Store,Breakfast Spot,...,Smoke Shop,Smoothie Shop,Sports Bar,Supermarket,Thai Restaurant,Toy / Game Store,Trail,Train Station,Video Game Store,Vietnamese Restaurant
0,Agincourt,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Birch Cliff,Cliffside West",0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.043478,0.0
4,"Clairlea,Golden Mile,Oakridge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Clarks Corners,Sullivan,Tam O'Shanter",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0
6,"Cliffcrest,Cliffside,Scarborough Village West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0
7,"Dorset Park,Scarborough Town Centre,Wexford He...",0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,...,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0
8,"East Birchmount Park,Ionview,Kennedy Park",0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Guildwood,Morningside,West Hill",0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.074074,...,0.0,0.037037,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0


Lets print each neighborhood along with the top 5 most common venues

In [68]:
num_top_venues = 5

for hood in toronto_venues_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_venues_grouped[toronto_venues_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.33
1   Korean Restaurant  0.08
2       Train Station  0.08
3         Coffee Shop  0.08
4    Asian Restaurant  0.08


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
                venue  freq
0          Playground   1.0
1  African Restaurant   0.0
2      Ice Cream Shop   0.0
3   Korean Restaurant   0.0
4        Liquor Store   0.0


----Birch Cliff,Cliffside West----
            venue  freq
0     Pizza Place  0.29
1     Coffee Shop  0.14
2   Grocery Store  0.14
3  Sandwich Place  0.14
4             Pub  0.14


----Cedarbrae----
                    venue  freq
0    Fast Food Restaurant  0.13
1  Furniture / Home Store  0.09
2             Coffee Shop  0.09
3          Discount Store  0.04
4             Bus Station  0.04


----Clairlea,Golden Mile,Oakridge----
               venue  freq
0     Ice Cream Shop   0.2
1  Convenience Store   0.2
2         Restaurant   0.2
3           Bus Stop   0.2
4             

Function to sort the venues in descending order.

In [69]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

New dataframe and display the top 10 venues for each neighborhood.

In [73]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_venues_grouped['Neighborhood']

for ind in np.arange(toronto_venues_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_venues_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
len(neighborhoods_venues_sorted)

17

In [75]:
toronto_venues_grouped.head()

Unnamed: 0,Neighborhood,African Restaurant,Asian Restaurant,Auto Workshop,Bakery,Bank,Bar,Beer Store,Big Box Store,Breakfast Spot,...,Smoke Shop,Smoothie Shop,Sports Bar,Supermarket,Thai Restaurant,Toy / Game Store,Trail,Train Station,Video Game Store,Vietnamese Restaurant
0,Agincourt,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0
1,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Birch Cliff,Cliffside West",0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.043478,0.043478,0.043478,0.0,...,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.043478,0.0
4,"Clairlea,Golden Mile,Oakridge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cluster Neighborhoods

Run k-means algorithm to cluster the neighborhood into 5 clusters

In [76]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 0, 0, 0, 0, 0, 0, 4, 0], dtype=int32)

New dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [100]:
toronto_merged = toronto_data

# add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.809196,-79.221701,0,Gym / Fitness Center,Fast Food Restaurant,Pharmacy,Pizza Place,Grocery Store,Convenience Store,Sandwich Place,Bubble Tea Shop,Skating Rink,Park
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.775504,-79.134976,1,Park,Vietnamese Restaurant,Clothing Store,Gym,Grocery Store,Greek Restaurant,Furniture / Home Store,Fried Chicken Joint,Food Court,Food & Drink Shop
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.768914,-79.187291,0,Pizza Place,Fast Food Restaurant,Coffee Shop,Breakfast Spot,Pharmacy,Burger Joint,Grocery Store,Gym,Food & Drink Shop,Discount Store
3,M1G,Scarborough,Woburn,43.759824,-79.225291,0,Fast Food Restaurant,Coffee Shop,Grocery Store,Vietnamese Restaurant,Big Box Store,Furniture / Home Store,Indian Restaurant,Discount Store,Paper / Office Supplies Store,Pharmacy
4,M1H,Scarborough,Cedarbrae,43.756467,-79.226692,0,Fast Food Restaurant,Coffee Shop,Furniture / Home Store,Pizza Place,Bus Station,Gym,Video Game Store,Liquor Store,Discount Store,Clothing Store


Finally, lets visualize the resulting clusters

In [78]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine Clusters

Examination of each cluster and determine the discriminating venue categories that distinguish each cluster.

Cluster 1 - Downtown Market
---
This cluster is most happening like downtown in any city with everything in it like lots of eateries, restaurants, 
grocery store, pharmacy, coffee shops, Banks, Electronics store, shopping, Commute like Bus station, Train station and also
fitness clubs, playgrounds


In [79]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,0,Gym / Fitness Center,Fast Food Restaurant,Pharmacy,Pizza Place,Grocery Store,Convenience Store,Sandwich Place,Bubble Tea Shop,Skating Rink,Park
2,Scarborough,0,Pizza Place,Fast Food Restaurant,Coffee Shop,Breakfast Spot,Pharmacy,Burger Joint,Grocery Store,Gym,Food & Drink Shop,Discount Store
3,Scarborough,0,Fast Food Restaurant,Coffee Shop,Grocery Store,Vietnamese Restaurant,Big Box Store,Furniture / Home Store,Indian Restaurant,Discount Store,Paper / Office Supplies Store,Pharmacy
4,Scarborough,0,Fast Food Restaurant,Coffee Shop,Furniture / Home Store,Pizza Place,Bus Station,Gym,Video Game Store,Liquor Store,Discount Store,Clothing Store
5,Scarborough,0,Coffee Shop,Pub,Supermarket,Fast Food Restaurant,Chinese Restaurant,Gym,Bank,Auto Workshop,Grocery Store,Greek Restaurant
6,Scarborough,0,Fast Food Restaurant,Asian Restaurant,Grocery Store,Vietnamese Restaurant,Clothing Store,Gym,Greek Restaurant,Furniture / Home Store,Fried Chicken Joint,Food Court
7,Scarborough,0,Park,Ice Cream Shop,Convenience Store,Restaurant,Bus Stop,Vietnamese Restaurant,Fast Food Restaurant,Discount Store,Electronics Store,Fish Market
9,Scarborough,0,Pizza Place,Grocery Store,Auto Workshop,Pub,Coffee Shop,Sandwich Place,Electronics Store,Clothing Store,Convenience Store,Discount Store
12,Scarborough,0,Chinese Restaurant,Cantonese Restaurant,Korean Restaurant,Food Court,Electronics Store,Coffee Shop,Hong Kong Restaurant,Train Station,Asian Restaurant,Bar
13,Scarborough,0,Pizza Place,Coffee Shop,Chinese Restaurant,Fried Chicken Joint,Market,Fast Food Restaurant,Pharmacy,Grocery Store,Thai Restaurant,Big Box Store


Cluster 2 - Eat, Play n Shop Cluster
---
This cluster is also filled with restaurants, grocery, fastfood and fitness / playground area. Not a downtown unlike above cluster

In [80]:
#toronto_merged
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,1,Park,Vietnamese Restaurant,Clothing Store,Gym,Grocery Store,Greek Restaurant,Furniture / Home Store,Fried Chicken Joint,Food Court,Food & Drink Shop
11,Scarborough,1,Pizza Place,Grocery Store,Middle Eastern Restaurant,Vietnamese Restaurant,Asian Restaurant,Bakery,Bar,Breakfast Spot,Burger Joint,Café


Cluster 3 - Prime Junction
---
This cluster having restaurants, grocery, bus station, pharmacy and bar

In [81]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Scarborough,2,Middle Eastern Restaurant,Pizza Place,Grocery Store,Mediterranean Restaurant,Bus Station,Fast Food Restaurant,Pharmacy,Bus Stop,Hookah Bar,Chinese Restaurant


Cluster 4 - PlayArena
---
This cluster is active area with Playground, gym, restaurants, grocery, fastfood and fitness.

In [82]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Scarborough,3,Playground,Vietnamese Restaurant,Gym,Grocery Store,Greek Restaurant,Furniture / Home Store,Fried Chicken Joint,Food Court,Food & Drink Shop,Fish Market


Cluster 5 - Party Place
---
This cluster has pubs, bank, coffeeshops, restaurants, fastfood and fitness. Definitely hangout place !

In [83]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Scarborough,4,Coffee Shop,Pub,Supermarket,Fast Food Restaurant,Chinese Restaurant,Gym,Bank,Auto Workshop,Grocery Store,Greek Restaurant
