# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Web Scraping, Transforming Data into a Pandas Dataframe and Cleaning Data

This is the first part of the 3rd week assignement. Our task is to scrape a webpage with the table of postal codes of Canada, more specifically, Toronto, clean the data from the table and transform it into a usable *pandas dataframe*. 

The first step is to import the libraries and packages we need:

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
import folium
import json 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print('Everything is cool!')

Everything is cool!


The next step is to scrape the data we need and turn it into a pandas dataframe:

In [2]:
res = requests.get(
    'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find('table')

df = pd.read_html(str(table))
df = df[0]
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


As we can see, there are rows lacking the necessary information, so we need to clean the data and keep only those rows in the 'Borough' column where we have the data we need. Let's get rid of the missing data: 

In [3]:
df = df[df.Borough != 'Not assigned']
df = df.reset_index(drop=True)
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


(There's an alternative way to get the same result:

df = df.replace('Not assigned', np.nan)<br/>
df = df.dropna()

But, here we'll stick to the first method.)

Now, let's rename the columns, according to the suggestions from the description of the assignment, and combine the rows where the postal code is the same for more than one neighborhoods (separating their names with a comma): 

In [4]:
df.rename(columns={'Postcode': 'PostalCode',
                   'Neighbourhood': 'Neighborhood'}, inplace=True)

df = df.groupby(['PostalCode', 'Borough'])[
    'Neighborhood'].apply(', '.join).reset_index()

df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Great. Now, let's see how many rows our dataframe consists of:

In [5]:
df.shape[0]

103

Good. We have scraped data, turned it into a pandas dataframe, cleaned it and got the dataframe we'll need for the next steps. 

## Part 2: Latitude and Longitude Coordinates of Toronto Neighborhoods

The most convenient way to get the coordinates we need is to use this csv file: https://cocl.us/Geospatial_data.  

In [6]:
df_longlat = pd.read_csv('https://cocl.us/Geospatial_data')
df_longlat.head(11)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [7]:
df_longlat.shape[0]

103

If we compare *df* and *df_longlat*, we can see that the number of the rows and the data we need here are the same.

OK, now let's add the two columns from the latter dataframe to the first dataframe: 

In [8]:
df = df.assign(Latitude=df_longlat.Latitude.values, Longitude=df_longlat.Longitude.values)
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Cool. All is good now.

## Part 3: Exploring, Analyzing and Clustering the Neighborhoods in the Borough of *Downtown Toronto* in Toronto 

For starters, let's create a map of the city of Toronto with the neighborhoods superimposed on top:

In [9]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinates of '+ address + ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto, Ontario are 43.653963, -79.387207.


In [10]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color='gray',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto

Great. Now, because *Downtown Toronto* is one of the most popular and widely known parts of Toronto, let's explore this borough.

In [11]:
downtown_toronto = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_toronto.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


Let's see the map with the neighborhoods in Downtown Toronto.

In [12]:
address = 'Downtown Toronto, Toronto'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinates of ' + address + ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Downtown Toronto, Toronto are 43.6541737, -79.38081164513409.


In [13]:
downtown_map = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, label in zip(downtown_toronto['Latitude'], downtown_toronto['Longitude'], downtown_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color='gray',
        fill_opacity=0.7,
        parse_html=False).add_to(downtown_map)  
    
downtown_map

For our project, we'll need the Foursquare API and our Foursquare credentials.

In [14]:
CLIENT_ID = 'DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F'
CLIENT_SECRET = 'MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE'
VERSION = '20180605'

print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

CLIENT_ID: DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F
CLIENT_SECRET:MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE


Now, let's examine some of the coolest neighborhoods in the borough of Downtown Toronto. 

In [15]:
downtown_toronto.loc[13]

PostalCode                                            M5T
Borough                                  Downtown Toronto
Neighborhood    Chinatown, Grange Park, Kensington Market
Latitude                                          43.6532
Longitude                                           -79.4
Name: 13, dtype: object

For our analysis, we'll need the coordinates of the three neighborhoods: Chinatown, Grange Park and Kensington Market.

In [16]:
neighborhood_latitude = downtown_toronto.loc[13, 'Latitude'] 
neighborhood_longitude = downtown_toronto.loc[13, 'Longitude'] 
neighborhood_name = downtown_toronto.loc[13, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Chinatown, Grange Park, Kensington Market are 43.6532057, -79.4000493.


OK, now we are going to use the Foursquare credentials to get the data, in *json* format, about these neighborhoods. 

In [17]:
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    neighborhood_latitude,
    neighborhood_longitude,
    VERSION, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?client_id=DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F&client_secret=MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE&ll=43.6532057,-79.4000493&v=20180605&radius=500&limit=100'

Let's see the results:

In [18]:
results = requests.get(url).json()
results

       'lng': -79.40268640008452}],
        'distance': 427,
        'postalCode': 'M5T 2M2',
        'cc': 'CA',
        'city': 'Toronto',
        'state': 'ON',
        'country': 'Canada',
        'formattedAddress': ['299 Augusta Avenue',
         'Toronto ON M5T 2M2',
         'Canada']},
       'categories': [{'id': '50327c8591d4c4b30a586d5d',
         'name': 'Brewery',
         'pluralName': 'Breweries',
         'shortName': 'Brewery',
         'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/brewery_',
          'suffix': '.png'},
         'primary': True}],
       'photos': {'count': 0, 'groups': []}},
      'referralId': 'e-0-5993586c31fd14044089f8c5-69'},
     {'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ddbe8697d8b771c0b09b885',
       'name': 'Dim Sum King Seafood Restaurant',
       'location': {'address': '421 Dundas 

We can get all the information we need from the *item* key. Let's first use a function that will enable us to extract the category of the venue.

In [19]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now, let's make sense of the data we got by cleaning json and turning it into a pandas dataframe.

In [20]:
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories',
                    'venue.location.lat', 'venue.location.lng']
                    
nearby_venues = nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(
    get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(11)

Unnamed: 0,name,categories,lat,lng
0,Kid Icarus,Arts & Crafts Store,43.653933,-79.401719
1,Seven Lives - Tacos y Mariscos,Mexican Restaurant,43.654418,-79.400545
2,Essence of Life Organics,Organic Grocery,43.654111,-79.400431
3,Jimmy's Coffee,Café,43.654493,-79.401311
4,Blackbird Baking Co,Bakery,43.654764,-79.400566
5,FIKA Cafe,Café,43.65356,-79.400402
6,The Moonbean Cafe,Café,43.654147,-79.400182
7,Banh Mi Nguyen Huong,Vietnamese Restaurant,43.653628,-79.398376
8,Little Pebbles,Coffee Shop,43.654883,-79.400264
9,Golden Patty,Caribbean Restaurant,43.654659,-79.401179


In [21]:
nearby_venues.shape[0]

86

Great. We got the result of eighty seven different venues in only three neighborhoods of Downtown Toronto. 

But we want to explore all the neighborhoods in this borough. So let's do that. 
We'll repeat the same process for all the neighborhoods. 

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(
        [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']

    return(nearby_venues)

Cool. Now we can see the names of the different types of venues, alongside their coordinates, that are in this part of Toronto.

In [23]:
downtown_venues = getNearbyVenues(names=downtown_toronto['Neighborhood'],
                                  latitudes=downtown_toronto['Latitude'],
                                  longitudes=downtown_toronto['Longitude']
                                  )
downtown_venues.head(11)

Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Queen's Park


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rosedale,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Rosedale,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Rosedale,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Rosedale,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,"Cabbagetown, St. James Town",43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner
5,"Cabbagetown, St. James Town",43.667967,-79.367675,Butter Chicken Factory,43.667072,-79.369184,Indian Restaurant
6,"Cabbagetown, St. James Town",43.667967,-79.367675,Kingyo Toronto,43.665895,-79.368415,Japanese Restaurant
7,"Cabbagetown, St. James Town",43.667967,-79.367675,Merryberry Cafe + Bistro,43.66663,-79.368792,Café
8,"Cabbagetown, St. James Town",43.667967,-79.367675,F'Amelia,43.667536,-79.368613,Italian Restaurant
9,"Cabbagetown, St. James Town",43.667967,-79.367675,Murgatroid,43.667381,-79.369311,Restaurant


In [24]:
downtown_venues.shape[0]

1313

Wow, it's a big number!

OK, this is good. But let's see how many venues are in each neighborhood:

In [25]:
downtown_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,56,56,56,56,56,56
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",46,46,46,46,46,46
Central Bay Street,81,81,81,81,81,81
"Chinatown, Grange Park, Kensington Market",86,86,86,86,86,86
Christie,18,18,18,18,18,18
Church and Wellesley,86,86,86,86,86,86
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"Design Exchange, Toronto Dominion Centre",100,100,100,100,100,100


Now let's check how many unique categories of venues we have here:

In [26]:
len(downtown_venues['Venue Category'].unique())

204

Finally, let's enter the final stage of our project: close analysis of all the neighborhoods in Downtown Toronto. 

In [27]:
downtown_onehot = pd.get_dummies(
    downtown_venues[['Venue Category']], prefix="", prefix_sep="")
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood']

fixed_columns = [downtown_onehot.columns[-1]] + \
    list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head(11)

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and let's calculate the mean of the frequency of occurrence of each category:

In [28]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.012346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,...,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0
5,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.034884,0.0,0.05814,0.011628,0.0,0.0,0.0
6,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Church and Wellesley,0.011628,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.011628,0.011628,0.0
8,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
9,"Design Exchange, Toronto Dominion Centre",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0


And the top 5 common venues in each neighborhood are:

In [29]:
num_top_venues = 5
for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood']
                            == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(
        drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0       Coffee Shop  0.07
1        Restaurant  0.05
2              Café  0.04
3   Thai Restaurant  0.04
4  Sushi Restaurant  0.03


----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1              Bakery  0.04
2                Café  0.04
3         Cheese Shop  0.04
4  Seafood Restaurant  0.04


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0   Airport Service  0.18
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3       Coffee Shop  0.06
4          Boutique  0.06


----Cabbagetown, St. James Town----
         venue  freq
0   Restaurant  0.07
1  Coffee Shop  0.07
2         Café  0.07
3          Pub  0.04
4  Pizza Place  0.04


----Central Bay Street----
                venue  freq
0         Coffee Shop  0.16
1  Italian Restaurant  0.05
2      Sandwich Place  0.04
3        Burger Joint  0.04
4      Ice Crea

That's cool. But now let's try something else: creating a pandas dataframe with the top 10 common venues in each neighborhood. 

In [30]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

In [31]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']


columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [32]:
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(
        downtown_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(11)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Restaurant,Café,Thai Restaurant,Sushi Restaurant,Bar,Seafood Restaurant,Gastropub,Lounge,Cosmetics Shop
1,Berczy Park,Coffee Shop,Cheese Shop,Restaurant,Beer Bar,Seafood Restaurant,Farmers Market,Bakery,Cocktail Bar,Café,Greek Restaurant
2,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Rental Car Location,Boat or Ferry,Coffee Shop,Boutique,Bar,Airport Gate
3,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Café,Pub,Italian Restaurant,Pizza Place,Bakery,Chinese Restaurant,Breakfast Spot,Butcher
4,Central Bay Street,Coffee Shop,Italian Restaurant,Juice Bar,Japanese Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Sushi Restaurant,Department Store,Thai Restaurant
5,"Chinatown, Grange Park, Kensington Market",Bar,Vietnamese Restaurant,Café,Coffee Shop,Bakery,Chinese Restaurant,Vegetarian / Vegan Restaurant,Mexican Restaurant,Dumpling Restaurant,Dessert Shop
6,Christie,Grocery Store,Café,Park,Coffee Shop,Diner,Baby Store,Restaurant,Italian Restaurant,Candy Store,Nightclub
7,Church and Wellesley,Coffee Shop,Japanese Restaurant,Burger Joint,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Hotel,Café,Pub
8,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Gastropub,Deli / Bodega,Japanese Restaurant,Italian Restaurant
9,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Restaurant,Hotel,Bakery,Italian Restaurant,Bar,Gastropub,Seafood Restaurant,Japanese Restaurant


Lastly, we'll cluster these neighborhoods in 4 different clusters. 

In [33]:
kclusters = 4
downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(
    downtown_grouped_clustering)
kmeans.labels_[0:10]

array([1, 1, 2, 1, 1, 1, 3, 1, 1, 1])

Let's create a pandas dataframe with the cluster labels in it:

In [34]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
downtown_merged = downtown_toronto

downtown_merged = downtown_merged.join(
    neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

downtown_merged.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,0,Park,Playground,Trail,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,1,Coffee Shop,Restaurant,Café,Pub,Italian Restaurant,Pizza Place,Bakery,Chinese Restaurant,Breakfast Spot,Butcher
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,Coffee Shop,Japanese Restaurant,Burger Joint,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Hotel,Café,Pub
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,1,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Restaurant,Theater,Chocolate Shop
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937,1,Coffee Shop,Clothing Store,Japanese Restaurant,Bubble Tea Shop,Middle Eastern Restaurant,Café,Italian Restaurant,Electronics Store,Ramen Restaurant,Pizza Place
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Restaurant,Hotel,Diner,Breakfast Spot,Beer Bar,Clothing Store,Bakery,Cosmetics Shop
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1,Coffee Shop,Cheese Shop,Restaurant,Beer Bar,Seafood Restaurant,Farmers Market,Bakery,Cocktail Bar,Café,Greek Restaurant
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1,Coffee Shop,Italian Restaurant,Juice Bar,Japanese Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Sushi Restaurant,Department Store,Thai Restaurant
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568,1,Coffee Shop,Restaurant,Café,Thai Restaurant,Sushi Restaurant,Bar,Seafood Restaurant,Gastropub,Lounge,Cosmetics Shop
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,1,Coffee Shop,Aquarium,Hotel,Café,Brewery,Sporting Goods Shop,Italian Restaurant,Restaurant,Fried Chicken Joint,Scenic Lookout


A picture is worth a thousand words, so we want to visualize our results.

In [35]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +
                         str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

We can see 4 different colors which correpond to the 4 different clusters we got.

We're almost there. The last step is to show the values for each of these 4 clusters:

In [36]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0,
                     downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Park,Playground,Trail,Department Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center


In [37]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1,
                     downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,1,Coffee Shop,Restaurant,Café,Pub,Italian Restaurant,Pizza Place,Bakery,Chinese Restaurant,Breakfast Spot,Butcher
2,Downtown Toronto,1,Coffee Shop,Japanese Restaurant,Burger Joint,Gay Bar,Restaurant,Sushi Restaurant,Gastropub,Hotel,Café,Pub
3,Downtown Toronto,1,Coffee Shop,Park,Pub,Bakery,Café,Mexican Restaurant,Breakfast Spot,Restaurant,Theater,Chocolate Shop
4,Downtown Toronto,1,Coffee Shop,Clothing Store,Japanese Restaurant,Bubble Tea Shop,Middle Eastern Restaurant,Café,Italian Restaurant,Electronics Store,Ramen Restaurant,Pizza Place
5,Downtown Toronto,1,Coffee Shop,Café,Restaurant,Hotel,Diner,Breakfast Spot,Beer Bar,Clothing Store,Bakery,Cosmetics Shop
6,Downtown Toronto,1,Coffee Shop,Cheese Shop,Restaurant,Beer Bar,Seafood Restaurant,Farmers Market,Bakery,Cocktail Bar,Café,Greek Restaurant
7,Downtown Toronto,1,Coffee Shop,Italian Restaurant,Juice Bar,Japanese Restaurant,Sandwich Place,Ice Cream Shop,Burger Joint,Sushi Restaurant,Department Store,Thai Restaurant
8,Downtown Toronto,1,Coffee Shop,Restaurant,Café,Thai Restaurant,Sushi Restaurant,Bar,Seafood Restaurant,Gastropub,Lounge,Cosmetics Shop
9,Downtown Toronto,1,Coffee Shop,Aquarium,Hotel,Café,Brewery,Sporting Goods Shop,Italian Restaurant,Restaurant,Fried Chicken Joint,Scenic Lookout
10,Downtown Toronto,1,Coffee Shop,Café,Restaurant,Hotel,Bakery,Italian Restaurant,Bar,Gastropub,Seafood Restaurant,Japanese Restaurant


In [38]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2,
                     downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,2,Airport Service,Airport Terminal,Airport Lounge,Harbor / Marina,Rental Car Location,Boat or Ferry,Coffee Shop,Boutique,Bar,Airport Gate


In [39]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 3,
                     downtown_merged.columns[[1] + list(range(5, downtown_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Downtown Toronto,3,Grocery Store,Café,Park,Coffee Shop,Diner,Baby Store,Restaurant,Italian Restaurant,Candy Store,Nightclub


That's it, we're done. Thank you for your patience and hope you'll visit Toronto some day. 