<h1 align="center" style="font-weight:bold;">Exploring, Segmenting, and Clustering Neighborhoods in Toronto</h1>

<h3 align="Justify" style="font-weight:bold;">Introduction</h3>

<p>In this assignment i am required to explore, segment, and clustering the neighborhoods in the City of Toronto.</p>

Now i use this Notebook to build the code to scrap the following Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

Before get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
import numpy as np #library to handle data in a vectorized manner
import pandas as pd #library for data analysis
import requests as rqt #library to handle requests 




In [2]:
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 8.2MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.1 soupsieve-2.0.1
Note: you may need to restart the kernel to use updated packages.


<h3 style="font-weight:bold;">Download and Explore Dataset</h3> 

In [3]:
url_wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_rst = rqt.get(url_wiki).text



In [7]:
#Import BeatutifulSoup to pull data out of HTML Page
from bs4 import BeautifulSoup
soup = BeautifulSoup(wiki_rst,'html.parser')

#Now let's try to extract only table from the page 
table = soup.find('table',attrs={'class':'wikitable sortable'})


In [24]:
print(table.tr.text)


Postal Code

Borough

Neighbourhood



<h4 style='font-weight:bold;'>Extracting Data Text from the Table</h4>


In [23]:
#Now I define the table columns
headers =table.findAll('th')
for k,head in enumerate(headers):
    headers[k]=str(headers[k]).replace("<th>","").replace("</th>","").replace("\n","")

#Getting separated Data from table
rows=table.findAll('tr')
rows=rows[1:len(rows)]

#cleaning the data between rows 
for j, row in enumerate(rows): 
    rows[j] = str(rows[j]).replace("\n</td></tr>","").replace("<tr>\n<td>","")

#Making a the Dataframe 
df=pd.DataFrame(rows)
df[headers] = df[0].str.split("</td>\n<td>", n = 2, expand = True) 
df.drop(columns=[0],inplace=True)
   


In [16]:
#  Ignoring cells with a borough that is Not assigned
df = df.drop(df[(df.Borough == "Not assigned")].index)

# The neighborhood will be the same as the borough.If a cell has a borough but a Not assigned neighborhood
df.Neighbourhood.replace("Not assigned", df.Borough, inplace=True)

# copy Borough value to Neighborhood if NaN:
df.Neighbourhood.fillna(df.Borough, inplace=True)

#Eliminating duplicate rows from Dataframe
df=df.drop_duplicates()

#Printing the number of rows of the dataframe
df.shape

(180, 3)

<h4 style='font-weight:bold'>Extracting Titles from Columns</h4>
    

In [12]:
df.update(
    df.Neighbourhood.loc[
        lambda t: t.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

df.update(
    df.Borough.loc[
        lambda t: t.str.contains('title')
    ].str.extract('title=\"([^\"]*)',expand=False))

In [22]:
#Delete Toronto from Neighbourhood
df.update(
    df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace(", Toronto",""))
df.update(
    df.Neighbourhood.loc[
        lambda x: x.str.contains('Toronto')
    ].str.replace("\(Toronto\)",""))
df


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods
3,M4A\n,North York\n,Victoria Village
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z\n,Not assigned\n,Not assigned\n
176,M6Z\n,Not assigned\n,Not assigned\n
177,M7Z\n,Not assigned\n,Not assigned\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."


In [47]:
#Rename the Postal Code Column to PostalCode
dfRen = df.rename(columns={'Postal Code':'PostalCode','Neighbourhood':'Neighborhood'},inplace=False)

#Combining multiple neighborhoods with the same post code
dfNew = pd.DataFrame({'PostalCode':dfRen.PostalCode.unique()})
dfNew['Borough']=pd.DataFrame(list(set(dfRen['Borough'].loc[dfRen['PostalCode'] == x['PostalCode']])) for i, x in dfNew.iterrows())
dfNew['Neighborhood']=pd.Series(list(set(dfRen['Neighborhood'].loc[dfRen['PostalCode'] == x['PostalCode']])) for i, x in dfNew.iterrows())
dfNew['Neighborhood']=dfNew['Neighborhood'].apply(lambda x: ', '.join(x))

#Removing \n parts from strings in a column 
dfNew['PostalCode'] = dfNew['PostalCode'].map(lambda x: x.rstrip('\n'))
dfNew['Borough'] = dfNew['Borough'].map(lambda x: x.rstrip('\n'))
dfNew = dfNew.drop(dfNew[(dfNew.Borough == "Not assigned")].index)
dfNew.head(10)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [45]:
dfNew.shape

(103, 3)

<h3 style="font-weight:bold;">Now getting the latitude and the longitude coordinates of each neighborhood</h3>

In [53]:
#Reading the Geo-spacial from a csv file
dfG= pd.read_csv("http://cocl.us/Geospatial_data")
dfG.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
dfG.set_index("PostalCode")
dfNew.set_index("PostalCode")
geoData=pd.merge(dfNew, dfG)
geoData.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<h3 style="font-weight:bold;">Exploring and clustering the neighborhoods in Toronto</h3>

<h4 style="font-weight:bold;">Using the Geopy Library to get the latitude and longitude values of Toronto</h4>

<p>Importing the libraries to convert an address into latitude and longitude values</p>

In [56]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#Import the library to handle a Json file to trnasform it into a pandas dataframe
from pandas.io.json import json_normalize

#Importing a Folium Library 
import folium # map rendering library
print('Libraries imported')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Libraries imported


In [62]:
#Defining the address to get Latitude and Longitude of Toronto
address='Toronto, ON, Canada'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude_toronto = location.latitude
longitude_toronto = location.longitude
#Print Results
print('The geograpical coordinate of Toronto, ON, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON, Canada are 43.6534817, -79.3839347.


In [65]:
#Create map of Toronto using latitude and Longitude
map_toronto = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=10)
# add markers to map
for lat, lng, borough, Neighborhood in zip(geoData['Latitude'], geoData['Longitude'], geoData['Borough'], geoData['Neighborhood']):
    label = '{}, {}'.format(Neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<h4 style="font-weight:bold;">Exploring the Neighborhoods and segment them with the Foursquare API</h4>

<p>Defining Foursquare credentials and Version</p>

In [67]:
CLIENT_ID='5OMA151KURBLFP0AZNN3AQICIDDTW5OBILH5PFLNKIZ2CYCO'
CLIENT_SECRET='KBC0MZ10HDABVB04EDP4GARODRYVR4MFEENH2AUROHGCPLTR'
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5OMA151KURBLFP0AZNN3AQICIDDTW5OBILH5PFLNKIZ2CYCO
CLIENT_SECRET:KBC0MZ10HDABVB04EDP4GARODRYVR4MFEENH2AUROHGCPLTR


<h4 style="font-weight">Exploring the first neighborhood</h4>

In [69]:
geoData.loc[0,'Neighborhood']

'Parkwoods'

<h4 style="font-weight">Defining radius limit of venues to get</h4>

In [70]:
radius=500
LIMIT=100

<p>Get the Neighborhoods'latitude and longitude</p>

In [72]:
# neighborhood longitude value
neighborhood_lat = geoData.loc[0, 'Latitude'] # neighborhood longitude value
neighborhood_long = geoData.loc[0, 'Longitude'] 
# neighborhood name
neighborhood_name = geoData.loc[0, 'Neighborhood'] 

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_lat, 
                                                               neighborhood_long))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


<h4 style="font-weight:bold;">Get the top 100 venues</h4>

In [74]:
#LIMIT of number of venues returned
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_lat, 
    neighborhood_long, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=5OMA151KURBLFP0AZNN3AQICIDDTW5OBILH5PFLNKIZ2CYCO&client_secret=KBC0MZ10HDABVB04EDP4GARODRYVR4MFEENH2AUROHGCPLTR&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

<h4 style="font-weight:bold;">Send a Get Request and examine results</h4>

In [76]:
results = rqt.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f4c2c0c10aef4346ce2883c'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [77]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [78]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,KFC,Fast Food Restaurant,43.754387,-79.333021
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [79]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


<h3 style="font-weight">Explore Neighborhoods in Toronto</h3>

In [80]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id=5OMA151KURBLFP0AZNN3AQICIDDTW5OBILH5PFLNKIZ2CYCO&client_secret=KBC0MZ10HDABVB04EDP4GARODRYVR4MFEENH2AUROHGCPLTR&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = rqt.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe.

In [82]:
toronto_venues = getNearbyVenues(names=geoData['Neighborhood'],
                                   latitudes=geoData['Latitude'],
                                   longitudes=geoData['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

#### Let's check the size of the resulting dataframe

In [83]:
print(toronto_venues.shape)
toronto_venues.head()

(309, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Brookbanks Park,43.751976,-79.33214,Park
4,Victoria Village,43.725882,-79.315572,KFC,43.754387,-79.333021,Fast Food Restaurant


Let's check how many venues were returned for each neighborhood

In [84]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,3,3,3,3,3,3
"Alderwood, Long Branch",3,3,3,3,3,3
"Bathurst Manor, Wilson Heights, Downsview North",3,3,3,3,3,3
Bayview Village,3,3,3,3,3,3
"Bedford Park, Lawrence Manor East",3,3,3,3,3,3
...,...,...,...,...,...,...
"Willowdale, Willowdale West",3,3,3,3,3,3
Woburn,3,3,3,3,3,3
Woodbine Heights,3,3,3,3,3,3
York Mills West,3,3,3,3,3,3


#### Let's find out how many unique categories can be curated from all the returned venues

In [85]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 3 uniques categories.


## Analyze Each Neighborhood

In [86]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,Fast Food Restaurant,Food & Drink Shop,Park
0,Parkwoods,0,0,1
1,Parkwoods,1,0,0
2,Parkwoods,0,1,0
3,Victoria Village,0,0,1
4,Victoria Village,1,0,0


#### And let's examine the new dataframe size.

In [87]:
toronto_onehot.shape

(309, 4)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [88]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Fast Food Restaurant,Food & Drink Shop,Park
0,Agincourt,0.333333,0.333333,0.333333
1,"Alderwood, Long Branch",0.333333,0.333333,0.333333
2,"Bathurst Manor, Wilson Heights, Downsview North",0.333333,0.333333,0.333333
3,Bayview Village,0.333333,0.333333,0.333333
4,"Bedford Park, Lawrence Manor East",0.333333,0.333333,0.333333
...,...,...,...,...
94,"Willowdale, Willowdale West",0.333333,0.333333,0.333333
95,Woburn,0.333333,0.333333,0.333333
96,Woodbine Heights,0.333333,0.333333,0.333333
97,York Mills West,0.333333,0.333333,0.333333


#### Let's confirm the new size

In [89]:
toronto_grouped.shape

(99, 4)

#### Let's print each neighborhood along with the top 5 most common venues

In [90]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Alderwood, Long Branch----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Bathurst Manor, Wilson Heights, Downsview North----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Bayview Village----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Bedford Park, Lawrence Manor East----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Berczy Park----
                  venue  freq
0  Fast Food Restaurant  0.33
1     Food & Drink Shop  0.33
2                  Park  0.33


----Birch Cliff, Cliffside West----
                  venue  freq
0  Fast Foo

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [91]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [93]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Agincourt,Park,Food & Drink Shop,Fast Food Restaurant
1,"Alderwood, Long Branch",Park,Food & Drink Shop,Fast Food Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Park,Food & Drink Shop,Fast Food Restaurant
3,Bayview Village,Park,Food & Drink Shop,Fast Food Restaurant
4,"Bedford Park, Lawrence Manor East",Park,Food & Drink Shop,Fast Food Restaurant


## Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [94]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

  return_n_iter=True)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 3 venues for each neighborhood.

In [98]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = geoData

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,0,Park,Food & Drink Shop,Fast Food Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Park,Food & Drink Shop,Fast Food Restaurant
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Park,Food & Drink Shop,Fast Food Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0,Park,Food & Drink Shop,Fast Food Restaurant
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0,Park,Food & Drink Shop,Fast Food Restaurant


Finally, let's visualize the resulting clusters

In [100]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Examine Clusters

Now, I can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster.

#### Cluster 1

In [101]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,North York,0,Park,Food & Drink Shop,Fast Food Restaurant
1,North York,0,Park,Food & Drink Shop,Fast Food Restaurant
2,Downtown Toronto,0,Park,Food & Drink Shop,Fast Food Restaurant
3,North York,0,Park,Food & Drink Shop,Fast Food Restaurant
4,Downtown Toronto,0,Park,Food & Drink Shop,Fast Food Restaurant
...,...,...,...,...,...
98,Etobicoke,0,Park,Food & Drink Shop,Fast Food Restaurant
99,Downtown Toronto,0,Park,Food & Drink Shop,Fast Food Restaurant
100,East Toronto,0,Park,Food & Drink Shop,Fast Food Restaurant
101,Etobicoke,0,Park,Food & Drink Shop,Fast Food Restaurant


#### Cluster 2

In [102]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue


#### Cluster 3

In [103]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue


#### Cluster 4

In [104]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue


#### Cluster 5

In [105]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
