# Intro

 1. **Problem & background:** Someone is looking to open a coffee shop in Toronto and needs a recomendation on where they can open, 
    based on competition, location to foot traffic,and  population.
    
 2. **Data & usage:**  We will use a number of different data sources to explore neighborhoods and postal codes.
 
 a) 2016 population census Canada for population: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/File.cfm?T=1201&SR=1&RPP=9999&PR=0&CMA=0&CSD=0&S=22&O=A&Lang=Eng&OFT=CSV
     -- as this is a good indicator of local foot traffic
  
  b) postal codes and their neighborhoods: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  
      -- as people may be more familiar with Nieghborhoods then with postal codes
      
  c) foursqure data API (foursquare developer credentials required) 
  -- We will be pulling in latitudes and longitudes of all venues within 500 metres ( as this is an indication of foot traffic)
  -- We will also pull in categories of these venues and find out how many of these venues are coffee shops and cafes to understand the competition in the area. 
  
 3. **Assumptions:**  
  
  a) Venues that are not coffee shops (e.g. parks, businesses, restaurants), create good foot traffic for coffee shops  
  b) Coffee shops within the same Postcode are bad for traffic as it is competition
  c) A % of the Population of that Postcode will be considered foot traffic for that coffee shop
 
  4. **Audience:**  
  An entrepreneur looking to open a Coffee shop


# Methodology
 1. We collected Data of Toronto Neighborhoods by postal code, and population by postal code. 
 2. We then enriched the data with all venues within 500 metre radius of that postal code.
 3. We also enriched our data by categorizing all Coffee shopes within that zip code.
 4. We attributed an estimated foot traffic to our coffee shope by: 
    a) total population of Toronto/Total venue/ Total coffeeshops of the postal code in Toronto *1% (this is an estimate of how much foot traffic we would get as a result of nearby venues(e.g. parks, schools, restaurants)
    b) Total population of the postal code/ Total cofeeshops of the postal code *1%
 5. We used a cluster analysis to group each postal code by lowest to highest total foot traffic

   

## Collecting Neighborhood Data for Toronto
scrape wikipedia for all of Toronto's postal codes and join population data from census to postalcode

In [1]:
# let's download required libraries for scraping and data processings

#convert data to dataframe and processing
import pandas as pd
import numpy as np

#web requests and scraping with BeautifulSoup
import requests
import urllib.request
from bs4 import BeautifulSoup # to parse the website files
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
 


In [2]:
#request data from url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
request = requests.get(url)

soup = BeautifulSoup(request.text, 'lxml')
# print(soup.prettify()) to find table tags location

In [3]:
#searching through prettify, or inspecting the website, we know the table we want is in class':'wikitable sortable'
table = soup.find('table',{'class':'wikitable sortable'}) # find table 
ths=table.find_all('th') # Search through the tables for the with the headings we want.



In [4]:
# Saving the heading we want, requires us to search through the tables for the headings and save as column_name
column_name=[]
for th in ths:
    column_name.append(th.text.strip())
print(column_name)


['Postal Code', 'Borough', 'Neighborhood']


In [5]:
# Search through the tables for the information in the cells we want and save as list
output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text.strip())
    output_rows.append(output_row)
table=output_rows[1:]

In [6]:
# convert table to DF
df_postcode=pd.DataFrame(table, columns =column_name)

In [7]:
#looking at describe, we can see that there are a lot of postal codes not assigned to Boroughs or Neighborhoods. 
df_postcode.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,180,180,180
unique,180,11,100
top,M4X,Not assigned,Not assigned
freq,1,77,77


In [10]:
#let's change the column name
df_postcode.columns=['Postcode','Borough','Neighborhood']
#let's preview the data
df_postcode.head()


Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


# Joining population data from census 2016
Let's download the CSV and save is as population.csv

We downloaded the CSV from 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/File.cfm?T=1201&SR=1&RPP=9999&PR=0&CMA=0&CSD=0&S=22&O=A&Lang=Eng&OFT=CSV'
and deleted the header and footers we didnt need

In [11]:
#lets read this info to a datafram
population_df = pd.read_csv('population.csv')
population_df.head()

Unnamed: 0,Geographic code,Geographic name,Province or territory,"Incompletely enumerated Indian reserves and Indian settlements, 2016","Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
0,A0A,A0A,Newfoundland and Labrador,,46587,26155,19426
1,A0B,A0B,Newfoundland and Labrador,,19792,13658,8792
2,A0C,A0C,Newfoundland and Labrador,,12587,8010,5606
3,A0E,A0E,Newfoundland and Labrador,,22294,12293,9603
4,A0G,A0G,Newfoundland and Labrador,,35266,21750,15200


In [12]:
#lets drop the columns we don't need
population_df=population_df[['Geographic code','Province or territory','Population, 2016']]
#lets rename our remaining columns
population_df.columns=['Postcode','State','pop']
#lets also only look at Geographic codes starting with 'M' 
population_df=population_df[population_df.Postcode.str[0] == 'M']
population_df.shape

(102, 3)

In [13]:
#Let's merge our 2 data frames together under population_df
population_df=pd.merge(df_postcode,population_df,on='Postcode',how='left')

In [14]:
#let's investigate how many postal codes do not have a match to df_postcode
print('postal codes witout population matched:',population_df.State.isna().sum())

postal codes witout population matched: 78


In [15]:
#Let's drop all rows without assigned Boroughs
population_df = population_df[population_df.Borough != 'Not assigned']

In [16]:
#Re-running postal code matches
print('postal codes witout population matched:',population_df.State.isna().sum())

postal codes witout population matched: 1


In [17]:
#let's investivate the missing value
df_na = population_df[population_df.isna().any(axis=1)]
df_na.head()

Unnamed: 0,Postcode,Borough,Neighborhood,State,pop
114,M7R,Mississauga,Canada Post Gateway Processing Centre,,


In [18]:
#let's drop that row where State is nan as well
population_df.dropna(subset = ["State"], inplace=True)


In [19]:
#let's drop that row where population is less than 1000
population_df=population_df[population_df['pop']>1000]

In [20]:
#let's sort by population
population_df.sort_values(by=['pop'],ascending =True, inplace=True)

In [21]:
# Note: lat/long is not in the format Folium uses..use a different mapping library
#import pgeocode to locate lat long !pip install pgeocode if module is not found
import pgeocode
nomi = pgeocode.Nominatim('ca') 

#reset the index and drop the old one
population_df.reset_index(inplace=True, drop=True) 

#loop over postal codes and append lat, long on 

postal_code = population_df['Postcode']
lat=[]
long=[]
count=0

for latlong in postal_code:    
    location = nomi.query_postal_code(postal_code[count])
    lat.append(location.latitude)
    long.append(location.longitude)
    count+=1

In [135]:
#convert lat long lists to dataframe and add to population_df
population_df['Latitude']=pd.DataFrame(lat) 
population_df['Longitude']=pd.DataFrame(long)
population_df.head()

Unnamed: 0,Postcode,Total_Venues,Borough,Neighborhood,State,pop,Latitude,Longitude
0,M1B,122,Scarborough,"Malvern, Rouge",Ontario,66108.0,43.6496,-79.3833
1,M1C,118,Scarborough,"Rouge Hill, Port Union, Highland Creek",Ontario,35626.0,43.6513,-79.3756
2,M1E,122,Scarborough,"Guildwood, Morningside, West Hill",Ontario,46943.0,43.739,-79.4692
3,M1G,122,Scarborough,Woburn,Ontario,29690.0,43.75,-79.3978
4,M1H,88,Scarborough,Cedarbrae,Ontario,24383.0,43.6564,-79.386


# Enriching Data with Foursquare to find out businesses that exist in that area

In [23]:
import foursquare_secrets1

In [24]:
#set foursquare dev credentials 
CLIENT_ID = foursquare_secrets1.secrets['CLIENT_ID'] # your Foursquare ID
CLIENT_SECRET = foursquare_secrets1.secrets['CLIENT_SECRET'] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

## testing out foursquare

In [30]:
neighborhood_latitude = population_df.loc[0,'Latitude'] # neighborhood latitude value
neighborhood_longitude = population_df.loc[0,'Longitude'] # neighborhood longitude value

neighborhood_name = population_df.loc[0,'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Richmond, Adelaide, King are 43.6496, -79.3833.


In [155]:
#radius is measured in meters, limit is the limit number of search results
radius = 500
LIMIT=200
search_query = ''
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=NSNY0NEA1GEXCEOUCGBGGSEZ5RZX1NNNQEX02S0DUARW0QG4&client_secret=Z2M5ODRGWB5GZT4SCO0WIK1AFTCOQ2HADKXFT5DKR5QMTPO5&ll=43.6496,-79.3833&v=20180605&query=&radius=500&limit=200'

In [156]:
results = requests.get(url).json()


In [157]:
#data that is passed back by foursquare
len(results['response']['venues'])

125

In [469]:
results['response']['venues'][0]

{'id': '4f7387f4e4b053123ff596aa',
 'name': 'Baker & Company',
 'location': {'address': '130 Adelaide Street West',
  'crossStreet': 'York Street',
  'lat': 43.648165652052725,
  'lng': -79.38215476105805,
  'labeledLatLngs': [{'label': 'display',
    'lat': 43.648165652052725,
    'lng': -79.38215476105805}],
  'distance': 184,
  'postalCode': 'M5H 3P5',
  'cc': 'CA',
  'city': 'Toronto',
  'state': 'ON',
  'country': 'Canada',
  'formattedAddress': ['130 Adelaide Street West (York Street)',
   'Toronto ON M5H 3P5',
   'Canada']},
 'categories': [{'id': '4bf58dd8d48988d124941735',
   'name': 'Office',
   'pluralName': 'Offices',
   'shortName': 'Office',
   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/default_',
    'suffix': '.png'},
   'primary': True}],
 'referralId': 'v-1593698662',
 'hasPerk': False}

In [81]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [103]:
import json # library to handle JSON files
from pandas import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['venues']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
 
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

#drop initial categories
nearby_venues=nearby_venues[['name','venue.categories','location.lat','location.lng']]


# clean columns by removing column name before period
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Baker & Company,Office,43.648166,-79.382155
1,JJ Bean,Coffee Shop,43.649595,-79.383383
2,EY Tower,Office,43.649766,-79.382473
3,Richmond Adelaide Centre,Office,43.649443,-79.383355
4,Starbucks,Coffee Shop,43.649928,-79.383247


In [102]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

125 venues were returned by Foursquare.


In [127]:
nearby_venues.categories.value_counts()

Office                           31
Coffee Shop                       7
Bank                              4
Mobile Phone Shop                 3
Building                          3
Asian Restaurant                  3
Restaurant                        3
Optical Shop                      2
Dentist's Office                  2
Parking                           2
Doctor's Office                   2
Mediterranean Restaurant          2
Pizza Place                       2
Salon / Barbershop                2
Convenience Store                 2
Miscellaneous Shop                2
Travel Agency                     2
Indian Restaurant                 2
Art Gallery                       2
Financial or Legal Service        2
Meeting Room                      2
Arts & Crafts Store               1
Steakhouse                        1
Recruiting Agency                 1
Medical Center                    1
Clothing Store                    1
Newsstand                         1
Fast Food Restaurant        

# Digging into Zip codes

In [311]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
  
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['venues']
        
  
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
            v['categories']) for v in results])
        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [312]:
Toronto_venues = getNearbyVenues(names=population_df['Postcode'],
                                   latitudes=population_df['Latitude'],
                                   longitudes=population_df['Longitude']
                                  )

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


In [313]:
print(Toronto_venues.shape)
len(Toronto_venues)

(10748, 7)


10748

In [314]:
# clean up category for each row
categories=[]
x=0
for row in range(0,len(Toronto_venues)):
    if len(Toronto_venues['Venue Category'][x]) == 0:
        
        categories.append('none')
        x+=1
    else:
        categories.append(Toronto_venues['Venue Category'][x][0]['name'])
        x+=1
     


In [410]:
#add categories list to Toronto_venues dataframe and drop the old categories column
Toronto_venues['categories']=pd.DataFrame(categories)
Toronto_venues.drop(columns=['Venue Category'], inplace=True)

KeyError: "['Venue Category'] not found in axis"

# We will calculate how many venunes are within 500 meter radius of each postalcode

In [470]:
Toronto_venues.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,categories
0,M1B,43.6496,-79.3833,Baker & Company,43.648166,-79.382155,Office
1,M1B,43.6496,-79.3833,JJ Bean,43.649595,-79.383383,Coffee Shop
2,M1B,43.6496,-79.3833,Richmond Adelaide Centre,43.649443,-79.383355,Office
3,M1B,43.6496,-79.3833,Starbucks,43.649928,-79.383247,Coffee Shop
4,M1B,43.6496,-79.3833,EY Tower,43.649766,-79.382473,Office


In [471]:
#group venues  by postcode and count, we will also drop columns we don't need and rename venue to venue_count
venue_count=Toronto_venues.groupby('Postcode').count().reset_index()
venue_count=venue_count[['Postcode','Venue']]
venue_count.columns=['Postcode','Venue_count']
venue_count.head()

Unnamed: 0,Postcode,Venue_count
0,M1B,125
1,M1C,122
2,M1E,121
3,M1G,46
4,M1H,124


In [472]:
#Let's merge our 2 data frames together under final_df on Postcode
final_df=pd.merge(population_df,venue_count,on='Postcode',how='left')


In [478]:
final_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,State,pop,Latitude,Longitude,Venue_count
0,M1B,Scarborough,"Malvern, Rouge",Ontario,66108.0,43.6496,-79.3833,125
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",Ontario,35626.0,43.6513,-79.3756,122
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",Ontario,46943.0,43.739,-79.4692,121
3,M1G,Scarborough,Woburn,Ontario,29690.0,43.75,-79.3978,46
4,M1H,Scarborough,Cedarbrae,Ontario,24383.0,43.6564,-79.386,124


# Let's measure competition within each Postcode by counting Coffee and Cafe Venues


In [480]:
#venue by neighborhood
Competition_df= Toronto_venues[(Toronto_venues['categories']=='Coffee Shop') | (Toronto_venues['categories']=='Café')]

#group all competition by postcode 
Competition_df=Competition_df.groupby('Postcode').count().reset_index()

#drop columns we don't need
Competition_df=Competition_df[['Postcode', 'Venue']]
                             
#rename competition column
Competition_df.columns=['Postcode', 'Other_coffee_shops']


In [481]:
#Let's merge this data  together under final_df on Postcode
final_df=pd.merge(final_df,Competition_df,on='Postcode',how='left')


In [527]:
#fill all Nan values as 0 within other_coffee_shop so we can apply subtraction
final_df['Other_coffee_shops'] = final_df['Other_coffee_shops'].fillna(value=0)

In [528]:
#Let's find out what the population to venue density is for all of Toronto

Toronto_pop= final_df[['pop']]
Toronto_pop= int(Toronto_pop.sum(axis=0))
print('Toronto has a population of :', Toronto_pop)

Toronto_venues= venue_count[['Venue_count']]
Toronto_venues= int(Toronto_venues.sum(axis=0))
print('Toronto has {} venues'. format(Toronto_venues)) 

Population_per_venue= Toronto_pop/Toronto_venues
print('Population/Venue:', Population_per_venue)

Toronto has a population of : 2732094
Toronto has 10748 venues
Population/Venue: 254.19557126907333


In [536]:
#We will assume each venue brings 1% of foot traffic from the postal code population
#For simplicity, We will also assume each non-competing venue will bring in foot traffic equivalaent to the population_per_venue

foot_traffic= .01
final_df['pop_foot_traffic']=final_df['pop']*foot_traffic/(final_df['Other_coffee_shops']+1)
final_df['venu_foot_traffic']= Population_per_venue *(final_df['Venue_count']-final_df['Other_coffee_shops'])/(final_df['Other_coffee_shops']+1) *foot_traffic
 

In [540]:
final_df['total_foot_traffic']=final_df['pop_foot_traffic']+final_df['venu_foot_traffic']
final_df.sort_values(by=['total_foot_traffic'], ascending= True, inplace=True)

In [549]:
final_df.reset_index(drop=True, inplace=True)
final_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,State,pop,Latitude,Longitude,Venue_count,Other_coffee_shops,pop_foot_traffic,venu_foot_traffic,total_foot_traffic
0,M5E,Downtown Toronto,Berczy Park,Ontario,9118.0,43.6683,-79.4205,106,7.0,11.3975,31.456702,42.854202
1,M3M,North York,Downsview,Ontario,24046.0,43.6684,-79.3689,126,11.0,20.038333,24.360409,44.398742
2,M5H,Downtown Toronto,"Richmond, Adelaide, King",Ontario,2005.0,43.6505,-79.5517,115,4.0,4.01,56.431417,60.441417
3,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",Ontario,10787.0,43.6934,-79.4857,116,5.0,17.978333,47.026181,65.004514
4,M5M,North York,"Bedford Park, Lawrence Manor East",Ontario,25975.0,43.648,-79.4177,120,7.0,32.46875,35.905124,68.373874


# Cluster Analysis

In [456]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [550]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = final_df[['total_foot_traffic']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

In [551]:
# add clustering labels
final_df.insert(0, 'Cluster Labels', kmeans.labels_)


In [591]:
summary_table= final_df.groupby('Cluster Labels').mean()
summary_table.sort_values('total_foot_traffic')
summary_table['legend']=['red','orange','white','blue','green']

In [599]:
summary_table.drop(['Latitude','Longitude'],inplace=True,axis=1)

In [609]:
#formatting our summary table
summary_table['pop']=summary_table['pop'].astype(int)
summary_table['Venue_count']=summary_table['Venue_count'].astype(int)
summary_table['pop_foot_traffic']=summary_table['pop_foot_traffic'].astype(int)
summary_table['venu_foot_traffic']=summary_table['venu_foot_traffic'].astype(int)
summary_table['total_foot_traffic']=summary_table['total_foot_traffic'].astype(int)
summary_table

Unnamed: 0_level_0,pop,Venue_count,Other_coffee_shops,pop_foot_traffic,venu_foot_traffic,total_foot_traffic,legend
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,28129,109,2.5625,79,77,156,red
1,41278,104,0.777778,233,158,392,orange
2,33883,121,0.0,338,308,647,white
3,22121,117,4.764706,37,54,91,blue
4,33704,107,1.5,132,111,243,green


In [612]:
final_df.groupby('Cluster Labels').count()

Unnamed: 0_level_0,Postcode,Borough,Neighborhood,State,pop,Latitude,Longitude,Venue_count,Other_coffee_shops,pop_foot_traffic,venu_foot_traffic,total_foot_traffic
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,32,32,32,32,32,32,32,32,32,32,32,32
1,9,9,9,9,9,9,9,9,9,9,9,9
2,3,3,3,3,3,3,3,3,3,3,3,3
3,34,34,34,34,34,34,34,34,34,34,34,34
4,18,18,18,18,18,18,18,18,18,18,18,18


# Mapping our Clusters

In [573]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


In [574]:
import folium # map rendering library
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#placement of initial map
address = 'Toronto, On'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [587]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

 
              
# Function to change the marker color  
# set color scheme for the clusters
def color(cluster): 
    if cluster ==3: 
        col = 'red'
    elif cluster ==0: 
        col = 'orange'
    elif cluster ==4: 
        col = 'white'
    elif cluster ==1: 
        col = 'blue'     
    else: 
        col='green'
    return col 
 



# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighborhood'], final_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=color(cluster),
        fill=True,
        fill_color=color(cluster),
        fill_opacity=0.7
        ).add_to(map_clusters)
    
legend_name='Cluster'      

map_clusters
