## Segmenting and Clustering boroughs in Toronto, ON 

### In this notebook, I am going to cluster boroughs in the city of Toronto and segment them. So, let's start by importing the libraries.

In [2]:
import numpy as np 

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  24.08 MB/s
geopy-1.18.1-p 100% |################################| Time: 0:00:00  33.43 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  52.83 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  34.43 MB/s
vincent-0.4.4- 100% |###################

### Scrapping the wikipedia data.

In [3]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header = 0)

df = pd.DataFrame(data[0])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Rename Postcode with PostalCode.

In [5]:
df.rename(columns={'Postcode': 'PostalCode', 'Borough': 'Borough', 'Neighbourhood': 'Neighbourhood'}, inplace=True)

In [6]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
df.shape

(289, 3)

### Removing values "Not assigned" from the Bourough column.

In [8]:
df= df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [9]:
df.shape

(212, 3)

In [10]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [11]:
df.shape

(212, 3)

In [12]:
df.rename(columns={'Postcode': 'PostalCode', 'Borough': 'Borough', 'Neighbourhood': 'Neighbourhood'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [13]:
df = df.groupby(['PostalCode', 'Borough', 'Neighbourhood']).agg({'PostalCode':lambda x: ', '.join(tuple(x.tolist())),

                                     'Neighbourhood':lambda x: ', '.join(tuple(x.tolist()))}
                                   )
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighbourhood,PostalCode
PostalCode,Borough,Neighbourhood,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,Malvern,Malvern,M1B
M1B,Scarborough,Rouge,Rouge,M1B
M1C,Scarborough,Highland Creek,Highland Creek,M1C
M1C,Scarborough,Port Union,Port Union,M1C
M1C,Scarborough,Rouge Hill,Rouge Hill,M1C


### Merging all Neighbourhood values subject to identical values of PostalCode.

In [14]:
df = df.groupby('Borough').agg({'PostalCode':'first', 
                             'Neighbourhood': ', '.join 
                              }).reset_index()
df.head(20)

Unnamed: 0,Borough,Neighbourhood,PostalCode
0,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",M4N
1,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",M4W
2,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",M4E
3,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig...",M4B
4,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto, ...",M8V
5,Mississauga,Canada Post Gateway Processing Centre,M7R
6,North York,"Hillcrest Village, Fairview, Henry Farm, Oriol...",M2H
7,Queen's Park,Not assigned,M7A
8,Scarborough,"Malvern, Rouge, Highland Creek, Port Union, Ro...",M1B
9,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",M6H


In [15]:
df.shape

(11, 3)

### Interchanging the columns.

In [16]:
columnsList=["PostalCode","Borough", "Neighbourhood"]
df=df.reindex(columns=columnsList)
df.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto..."
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ..."
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind..."
3,M4B,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig..."
4,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto, ..."
5,M7R,Mississauga,Canada Post Gateway Processing Centre
6,M2H,North York,"Hillcrest Village, Fairview, Henry Farm, Oriol..."
7,M7A,Queen's Park,Not assigned
8,M1B,Scarborough,"Malvern, Rouge, Highland Creek, Port Union, Ro..."
9,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,..."


In [17]:
df.replace({'Neighbourhood': 'Not assigned'}, {'Neighbourhood': "Queen's Park"}, regex=True)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto..."
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ..."
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind..."
3,M4B,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig..."
4,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto, ..."
5,M7R,Mississauga,Canada Post Gateway Processing Centre
6,M2H,North York,"Hillcrest Village, Fairview, Henry Farm, Oriol..."
7,M7A,Queen's Park,Queen's Park
8,M1B,Scarborough,"Malvern, Rouge, Highland Creek, Port Union, Ro..."
9,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,..."


### Reading another file containing geospatial data of Toronto with respect to postal codes.

In [18]:
data2 = pd.read_csv('http://cocl.us/Geospatial_data', header = 0)

df2 = pd.DataFrame(data2)
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Making the first column identical with the previous data frame in order to merge.

In [19]:
df2.rename(columns={'Postal Code': 'PostalCode', 'Latitude': 'Latitude', 'Longitude': 'Longitude'}, inplace=True)
df2.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging the two dataframes and create a new dataframe df3.

In [20]:
df3 = pd.merge(df, df2, on="PostalCode")
df3.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",43.72802,-79.38879
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",43.679563,-79.377529
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",43.676357,-79.293031
3,M4B,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig...",43.706397,-79.309937
4,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto, ...",43.605647,-79.501321
5,M7R,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819
6,M2H,North York,"Hillcrest Village, Fairview, Henry Farm, Oriol...",43.803762,-79.363452
7,M7A,Queen's Park,Not assigned,43.662301,-79.389494
8,M1B,Scarborough,"Malvern, Rouge, Highland Creek, Port Union, Ro...",43.806686,-79.194353
9,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",43.669005,-79.442259


In [21]:
# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)
df3.head(20)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",43.72802,-79.38879
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",43.679563,-79.377529
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",43.676357,-79.293031
3,M4B,East York,"Parkview Hill, Woodbine Gardens, Woodbine Heig...",43.706397,-79.309937
4,M8V,Etobicoke,"Humber Bay Shores, Mimico South, New Toronto, ...",43.605647,-79.501321
5,M7R,Mississauga,Canada Post Gateway Processing Centre,43.636966,-79.615819
6,M2H,North York,"Hillcrest Village, Fairview, Henry Farm, Oriol...",43.803762,-79.363452
7,M7A,Queen's Park,Not assigned,43.662301,-79.389494
8,M1B,Scarborough,"Malvern, Rouge, Highland Creek, Port Union, Ro...",43.806686,-79.194353
9,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",43.669005,-79.442259


### Fetching the geographical coordinates of Toronto city.

In [22]:
address = 'Toronto City'

geolocator = Nominatim(user_agent="tn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.7394839, -79.369314.


### Create map of Toronto using latitude and longitude values.

In [23]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Consider only boroughs that contain "Toronto".

In [24]:
toronto_data = df3[df3['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",43.72802,-79.38879
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",43.679563,-79.377529
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",43.676357,-79.293031
3,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",43.669005,-79.442259


In [25]:
address = 'Toronto'

geolocator = Nominatim(user_agent="tn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of "Toronto" are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of "Toronto" are 43.653963, -79.387207.


### Creating a map of boroughs that have the word "Toronto" superimposed on top.

In [26]:

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

####  Define Foursquare Credentials and Version

In [27]:
CLIENT_ID = 'ACP2CI0OP4HWKIAAMMLRATKK2WE1GUO24BOY3HTPTTGGZLBI' # your Foursquare ID
CLIENT_SECRET = 'VFF1A3QNFR5XZWBFJRWSRIDDWXWTLI0UO1FRHIARV0VDBBFM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ACP2CI0OP4HWKIAAMMLRATKK2WE1GUO24BOY3HTPTTGGZLBI
CLIENT_SECRET:VFF1A3QNFR5XZWBFJRWSRIDDWXWTLI0UO1FRHIARV0VDBBFM


#### Let's explore the first borough in our dataframe.

In [31]:
toronto_data.loc[0, 'Borough']

'Central Toronto'

In [32]:
borough_latitude = toronto_data.loc[0, 'Latitude'] # Borough latitude value
borough_longitude = toronto_data.loc[0, 'Longitude'] # Borough longitude value

borough_name = toronto_data.loc[0, 'Borough'] # Borough name

print('Latitude and longitude values of {} are {}, {}.'.format(borough_name, 
                                                               borough_latitude, 
                                                               borough_longitude))

Latitude and longitude values of Central Toronto are 43.7280205, -79.3887901.


#### Now, let's get the top 100 venues that are in boroughs within a radius of 500 meters.

In [33]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    borough_latitude, 
    borough_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=ACP2CI0OP4HWKIAAMMLRATKK2WE1GUO24BOY3HTPTTGGZLBI&client_secret=VFF1A3QNFR5XZWBFJRWSRIDDWXWTLI0UO1FRHIARV0VDBBFM&v=20180605&ll=43.7280205,-79.3887901&radius=500&limit=100'

In [34]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c76e6f59fb6b74141a53ec5'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-50e6da19e4b0d8a78a0e9794-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d163941735',
         'name': 'Park',
         'pluralName': 'Parks',
         'primary': True,
         'shortName': 'Park'}],
       'id': '50e6da19e4b0d8a78a0e9794',
       'location': {'address': '3055 Yonge Street',
        'cc': 'CA',
        'city': 'Toronto',
        'country': 'Canada',
        'crossStreet': 'Lawrence Avenue East',
        'distance': 465,
        'formattedAddress': ['3055 Yonge Street (Lawrence Avenue East)',
         'Toronto ON',
         'Canada'],
        'labeledLatLngs': [{

### Send the GET request and examine the resutls.

In [35]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### Borrow the get_category_type function from the Foursquare lab.

In [36]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Dim Sum Deluxe,Dim Sum Restaurant,43.726953,-79.39426
2,Zodiac Swim School,Swim School,43.728532,-79.38286
3,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


### Let's see how many venues were returned by Foursquare.

In [37]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


# Now Explore Boroughs in Toronto.

#### Let's create a function to repeat the same process to all the boroughs in Toronto.

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Write the code to run the above function on each borough and create a new dataframe called *borough_venues*.

In [39]:

borough_venues = getNearbyVenues(names=toronto_data['Borough'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Central Toronto
Downtown Toronto
East Toronto
West Toronto


In [44]:
print(toronto_data.shape)
toronto_data.head()

(4, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",43.72802,-79.38879
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",43.679563,-79.377529
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",43.676357,-79.293031
3,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",43.669005,-79.442259


### Let's check how many venues were returned for each borugh.

In [48]:
borough_venues.groupby('Borough').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto,4,4,4,4,4,4
Downtown Toronto,4,4,4,4,4,4
East Toronto,4,4,4,4,4,4
West Toronto,21,21,21,21,21,21


#### Let's find out how many unique categories can be curated from all the returned venues.

In [49]:
print('There are {} uniques categories.'.format(len(borough_venues['Venue Category'].unique())))

There are 26 uniques categories.


##  Analyze Each Borough

In [50]:
# one hot encoding
toronto_onehot = pd.get_dummies(borough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Borough'] = borough_venues['Borough'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Art Gallery,Bakery,Bank,Bar,Brewery,Bus Line,Café,Coffee Shop,Dim Sum Restaurant,Discount Store,Fast Food Restaurant,Gym / Fitness Center,Health Food Store,Liquor Store,Middle Eastern Restaurant,Music Venue,Neighborhood,Park,Pharmacy,Playground,Pool,Pub,Supermarket,Swim School,Trail,Wine Shop
0,Central Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,Central Toronto,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Central Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,Central Toronto,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Downtown Toronto,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [51]:
toronto_onehot.shape

(33, 27)

### Next, let's group rows by borough and by taking the mean of the frequency of occurrence of each category.

In [52]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Art Gallery,Bakery,Bank,Bar,Brewery,Bus Line,Café,Coffee Shop,Dim Sum Restaurant,Discount Store,Fast Food Restaurant,Gym / Fitness Center,Health Food Store,Liquor Store,Middle Eastern Restaurant,Music Venue,Neighborhood,Park,Pharmacy,Playground,Pool,Pub,Supermarket,Swim School,Trail,Wine Shop
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0
1,Downtown Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0
3,West Toronto,0.047619,0.095238,0.047619,0.047619,0.047619,0.0,0.047619,0.0,0.0,0.095238,0.047619,0.047619,0.0,0.047619,0.047619,0.047619,0.0,0.047619,0.095238,0.0,0.047619,0.0,0.095238,0.0,0.0,0.047619


In [53]:
toronto_grouped.shape

(4, 27)

#### Let's print each borough along with the top 5 most common venues.

In [54]:
num_top_venues = 5

for hood in toronto_grouped['Borough']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Toronto----
                venue  freq
0         Swim School  0.25
1            Bus Line  0.25
2  Dim Sum Restaurant  0.25
3                Park  0.25
4         Art Gallery  0.00


----Downtown Toronto----
         venue  freq
0         Park  0.50
1        Trail  0.25
2   Playground  0.25
3  Art Gallery  0.00
4       Bakery  0.00


----East Toronto----
               venue  freq
0                Pub  0.25
1        Coffee Shop  0.25
2       Neighborhood  0.25
3  Health Food Store  0.25
4        Art Gallery  0.00


----West Toronto----
            venue  freq
0     Supermarket  0.10
1        Pharmacy  0.10
2          Bakery  0.10
3  Discount Store  0.10
4     Art Gallery  0.05




### Let's put that into a *pandas* dataframe.

In [55]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Let's create the new dataframe and display all the boroughs.

In [57]:
num_top_venues = 4

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
borough_venues_sorted = pd.DataFrame(columns=columns)
borough_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    borough_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

borough_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,Central Toronto,Swim School,Bus Line,Park,Dim Sum Restaurant
1,Downtown Toronto,Park,Playground,Trail,Wine Shop
2,East Toronto,Health Food Store,Pub,Neighborhood,Coffee Shop
3,West Toronto,Bakery,Supermarket,Pharmacy,Discount Store


## Cluster Boroughs

### Run k-means to cluster the borughs into 5 clusters.

In [59]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 1, 3], dtype=int32)

### Let's create a new dataframe that includes the cluster for each borough.

In [60]:
# add clustering labels
borough_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(borough_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,M4N,Central Toronto,"Lawrence Park, Davisville North, North Toronto...",43.72802,-79.38879,0,Swim School,Bus Line,Park,Dim Sum Restaurant
1,M4W,Downtown Toronto,"Rosedale, Cabbagetown, St. James Town, Church ...",43.679563,-79.377529,2,Park,Playground,Trail,Wine Shop
2,M4E,East Toronto,"The Beaches, Riverdale, The Danforth West, Ind...",43.676357,-79.293031,1,Health Food Store,Pub,Neighborhood,Coffee Shop
3,M6H,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,...",43.669005,-79.442259,3,Bakery,Supermarket,Pharmacy,Discount Store


### Let's visualize the resulting clusters.

In [61]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

### Let's examine each cluster and determine the discriminating venue categories that distinguish each cluster and name clusters as Central Toronto, East Toronto, Downtown Toronto and West Toronto repectively.

### Central Toronto

In [62]:
cluster1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster1.head(3)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,Central Toronto,0,Swim School,Bus Line,Park,Dim Sum Restaurant


### East Toronto

In [63]:
cluster2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster2.head(3)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
2,East Toronto,1,Health Food Store,Pub,Neighborhood,Coffee Shop


### Downtown Toronto

In [64]:
cluster3 =toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster3.head(3)

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
1,Downtown Toronto,2,Park,Playground,Trail,Wine Shop


### West Toronto

In [65]:
cluster4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
cluster4.head()

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
3,West Toronto,3,Bakery,Supermarket,Pharmacy,Discount Store
