First step is to import the relevant pandas package and load the wiki(HTML) data.
To obtain the proper columns only we have to slice tables as tables[0]

In [1]:
import pandas as pd
import numpy as np
import requests
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors

tables=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df=pd.DataFrame(tables[0])

In [2]:
print(df)

    Postcode           Borough          Neighbourhood
0        M1A      Not assigned           Not assigned
1        M2A      Not assigned           Not assigned
2        M3A        North York              Parkwoods
3        M4A        North York       Victoria Village
4        M5A  Downtown Toronto           Harbourfront
..       ...               ...                    ...
282      M8Z         Etobicoke              Mimico NW
283      M8Z         Etobicoke     The Queensway West
284      M8Z         Etobicoke  Royal York South West
285      M8Z         Etobicoke         South of Bloor
286      M9Z      Not assigned           Not assigned

[287 rows x 3 columns]


Next we have to select only the rows in the dataframe which have a value in the "Borough' column which is unequal to "Not assigned". The remaining rows are included in dataframe df2

In [3]:
df2=df[df['Borough'] != "Not assigned"]
print(df2)

    Postcode           Borough             Neighbourhood
2        M3A        North York                 Parkwoods
3        M4A        North York          Victoria Village
4        M5A  Downtown Toronto              Harbourfront
5        M6A        North York          Lawrence Heights
6        M6A        North York            Lawrence Manor
..       ...               ...                       ...
281      M8Z         Etobicoke  Kingsway Park South West
282      M8Z         Etobicoke                 Mimico NW
283      M8Z         Etobicoke        The Queensway West
284      M8Z         Etobicoke     Royal York South West
285      M8Z         Etobicoke            South of Bloor

[210 rows x 3 columns]


After this we have to eliminate the double items in (combined) 'Postcode' and 'Borough' columns. The neighbourhood items are combined and seperated by a ','.   

In [4]:
df3=df2.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()

Next we have identify the 'Not assigned' rows in the "Neighbourhood' column. These need to be replaced with the value in the 'Borough' column. As seen below this involves only 1 item, which is replaced by the related 'Borough' field contents.

In [5]:
print(df3[df3['Neighbourhood']=='Not assigned'])

   Postcode       Borough Neighbourhood
85      M7A  Queen's Park  Not assigned


In [6]:
df3['Neighbourhood']=df3['Neighbourhood'].str.replace("Not assigned","Queen's Park")

In [7]:
print(df3)

    Postcode      Borough                                      Neighbourhood
0        M1B  Scarborough                                      Rouge,Malvern
1        M1C  Scarborough               Highland Creek,Rouge Hill,Port Union
2        M1E  Scarborough                    Guildwood,Morningside,West Hill
3        M1G  Scarborough                                             Woburn
4        M1H  Scarborough                                          Cedarbrae
..       ...          ...                                                ...
98       M9N         York                                             Weston
99       M9P    Etobicoke                                          Westmount
100      M9R    Etobicoke  Kingsview Village,Martin Grove Gardens,Richvie...
101      M9V    Etobicoke  Albion Gardens,Beaumond Heights,Humbergate,Jam...
102      M9W    Etobicoke                                          Northwest

[103 rows x 3 columns]


Using the shape method we see that there are 103 rows remaining in the dataset after all transformations have been performed

In [8]:
df3.shape

(103, 3)

The next part of the assignment is to load the geospatial dataset and merge with the wiki dataset(df3).
First step is to load the csv and transform to dataframe.

In [9]:
geo=pd.read_csv('Geospatial_Coordinates.csv')

In [10]:
df_geo=pd.DataFrame(geo)
print(df_geo)

    Postal Code   Latitude  Longitude
0           M1B  43.806686 -79.194353
1           M1C  43.784535 -79.160497
2           M1E  43.763573 -79.188711
3           M1G  43.770992 -79.216917
4           M1H  43.773136 -79.239476
..          ...        ...        ...
98          M9N  43.706876 -79.518188
99          M9P  43.696319 -79.532242
100         M9R  43.688905 -79.554724
101         M9V  43.739416 -79.588437
102         M9W  43.706748 -79.594054

[103 rows x 3 columns]


The 2 dataframes(df3 and df_geo) are merged using the merge function in pandas using the Postcode/Postal Code keys. As default
we apply an inner join (default setting) 

In [11]:
df_comb=df3.merge(df_geo, left_on='Postcode', right_on='Postal Code')

After the merging of the 2 dataframes we check that we have the same number of rows left from df3 which is indeed the case (103 rows before and after merging). We drop the 'Postal Code' column as the values are the same as the 'Postcode' column.

In [12]:
df_comb=df_comb.drop(columns=['Postal Code'])
print(df_comb)


    Postcode      Borough                                      Neighbourhood  \
0        M1B  Scarborough                                      Rouge,Malvern   
1        M1C  Scarborough               Highland Creek,Rouge Hill,Port Union   
2        M1E  Scarborough                    Guildwood,Morningside,West Hill   
3        M1G  Scarborough                                             Woburn   
4        M1H  Scarborough                                          Cedarbrae   
..       ...          ...                                                ...   
98       M9N         York                                             Weston   
99       M9P    Etobicoke                                          Westmount   
100      M9R    Etobicoke  Kingsview Village,Martin Grove Gardens,Richvie...   
101      M9V    Etobicoke  Albion Gardens,Beaumond Heights,Humbergate,Jam...   
102      M9W    Etobicoke                                          Northwest   

      Latitude  Longitude  
0    43.806

For sake of simplification and the limitations of the Foursquare sandbox account we first investigate the Borough column in the dataset and subsequently select the boroughs with "Toronto" in their name i.e. West, East, Downtown and Central

In [13]:
df_comb['Borough'].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

In [14]:
df_comb=df_comb[(df_comb.Borough == 'West Toronto') | (df_comb.Borough == 'East Toronto') | (df_comb.Borough == 'Downtown Toronto') | (df_comb.Borough == 'Central Toronto')].reset_index() 
print(df_comb)

    index Postcode           Borough  \
0      37      M4E      East Toronto   
1      41      M4K      East Toronto   
2      42      M4L      East Toronto   
3      43      M4M      East Toronto   
4      44      M4N   Central Toronto   
5      45      M4P   Central Toronto   
6      46      M4R   Central Toronto   
7      47      M4S   Central Toronto   
8      48      M4T   Central Toronto   
9      49      M4V   Central Toronto   
10     50      M4W  Downtown Toronto   
11     51      M4X  Downtown Toronto   
12     52      M4Y  Downtown Toronto   
13     53      M5A  Downtown Toronto   
14     54      M5B  Downtown Toronto   
15     55      M5C  Downtown Toronto   
16     56      M5E  Downtown Toronto   
17     57      M5G  Downtown Toronto   
18     58      M5H  Downtown Toronto   
19     59      M5J  Downtown Toronto   
20     60      M5K  Downtown Toronto   
21     61      M5L  Downtown Toronto   
22     63      M5N   Central Toronto   
23     64      M5P   Central Toronto   


We select the relevant packages/libraries necessary for the map viz and clustering.

In [15]:
import folium
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Let's apply one of the data points as starting point

In [51]:
latitude = df_comb['Latitude'].mean()
longitude = df_comb['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(latitude, longitude))


The geographical coordinates of Toronto are 43.66726218421052, -79.38988323421053


Next we are going to cycle through all datapoints and plot them on the folium map in color blue for now

In [52]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_comb.Latitude, df_comb.Longitude, df_comb.Borough, df_comb.Neighbourhood):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next we include the Foursquare identification to apply the Foursquare API

In [19]:
LIMIT = 100
CLIENT_ID = 'GHNHGWRHFKZE5DGOGHZUYBEDFWXFDMAUPW5JVVATOK3YC0ZH' # your Foursquare ID
CLIENT_SECRET = 'YBL2RZCIKT3F5A2KOWQ0ET04D0YII0XE5GRTSLFJOC1UCHAY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: GHNHGWRHFKZE5DGOGHZUYBEDFWXFDMAUPW5JVVATOK3YC0ZH
CLIENT_SECRET:YBL2RZCIKT3F5A2KOWQ0ET04D0YII0XE5GRTSLFJOC1UCHAY


Let's obtain the nearby venues within a 500 radius (again we want to limit this while we are using a Foursquare sandbox account) and define the columns we want in our dataset 

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We are going to apply above function on each neighboorhood and create data frame toronto_venues

In [21]:
toronto_venues = getNearbyVenues(names=df_comb['Neighbourhood'],
                                   latitudes=df_comb['Latitude'],
                                   longitudes=df_comb['Longitude']
                                  )
toronto_venues=pd.DataFrame(toronto_venues)

The Beaches
The Danforth West,Riverdale
The Beaches West,India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Rosedale
Cabbagetown,St. James Town
Church and Wellesley
Harbourfront
Ryerson,Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide,King,Richmond
Harbourfront East,Toronto Islands,Union Station
Design Exchange,Toronto Dominion Centre
Commerce Court,Victoria Hotel
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville
Harbord,University of Toronto
Chinatown,Grange Park,Kensington Market
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place,Underground city
Christie
Dovercourt Village,Dufferin
Little Portugal,Trinity
Brockton,Exhibition Place,Parkdale Village
High Park,The Junction South
Parkdale,Roncesvalles
Runnymede

Show the size of the dataframe

In [22]:
print(toronto_venues.shape)
toronto_venues.head()

(1705, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West,Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Display the number of venues for each neighboorhood

In [23]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,King,Richmond",100,100,100,100,100,100
Berczy Park,57,57,57,57,57,57
"Brockton,Exhibition Place,Parkdale Village",21,21,21,21,21,21
Business Reply Mail Processing Centre 969 Eastern,18,18,18,18,18,18
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",16,16,16,16,16,16
"Cabbagetown,St. James Town",43,43,43,43,43,43
Central Bay Street,82,82,82,82,82,82
"Chinatown,Grange Park,Kensington Market",96,96,96,96,96,96
Christie,17,17,17,17,17,17
Church and Wellesley,89,89,89,89,89,89


Lets find out how many unique venue categories are included

In [24]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 234 uniques categories.


We use one hot encoding to create dummy variables, showing the unique venue categories in columns showing either 0(not applicable) or 1(applicable)

Let's first investigate the different Venues and the frequency of occurring for all Neighbourhoods

In [25]:
# Toronto_venues['Venue Category'].unique()
toronto_venues_group=toronto_venues['Venue Category'].groupby(toronto_venues['Venue Category']).count()
onto_venues_groutorp=toronto_venues_group.sort_values(ascending=False)
print(toronto_venues_group)

Venue Category
Afghan Restaurant         1
Airport                   1
Airport Food Court        1
Airport Gate              1
Airport Lounge            2
                         ..
Vietnamese Restaurant    10
Wine Bar                 10
Wine Shop                 1
Wings Joint               1
Yoga Studio               6
Name: Venue Category, Length: 234, dtype: int64


As we only want to work with restaurants let's limit the venues to items which contain the word "Restaurant" only and select the top 20 (by count) 

In [26]:
toronto_venues_group[toronto_venues_group.index.str.contains("Restaurant")].head(20)

Venue Category
Afghan Restaurant               1
American Restaurant            26
Asian Restaurant               14
Belgian Restaurant              1
Brazilian Restaurant            3
Cajun / Creole Restaurant       1
Caribbean Restaurant            6
Chinese Restaurant             14
Colombian Restaurant            1
Comfort Food Restaurant         7
Cuban Restaurant                2
Dim Sum Restaurant              1
Doner Restaurant                1
Dumpling Restaurant             4
Eastern European Restaurant     3
Ethiopian Restaurant            2
Falafel Restaurant              2
Fast Food Restaurant           12
Filipino Restaurant             1
French Restaurant              11
Name: Venue Category, dtype: int64

Let's assume as business case we want to adopt a strategy where we want to start a Vegetarian/Vegan Restaurant close to an American Restaurant as we expect that the customers of American Restaurants(:)) are most likely to switch their food preference once they are confronted with the Vegetarian/Vegan Restaurant menu.  

In [27]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


let's see the size of the dateframe

In [28]:
toronto_onehot.shape

(1705, 235)

To prepare for clustering we want to group by neighborhood and obtain the mean of the frequency of occurring of each category 

In [29]:
toronto_grouped=toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.0625,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,0.0,...,0.0,0.0,0.0,0.012195,0.0,0.0,0.012195,0.0,0.0,0.012195
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.041667,0.0,0.052083,0.010417,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011236,0.0,0.0,0.0,0.0,0.0,0.0,0.011236,0.0,...,0.0,0.0,0.0,0.0,0.011236,0.011236,0.0,0.011236,0.011236,0.011236


Let's display the shape of the new dataframe

In [30]:
toronto_grouped.shape

(38, 235)

Now we want to identify the American restaurants

In [31]:
toronto_American_Rest=toronto_grouped[['Neighbourhood','American Restaurant']]
toronto_American_Rest=toronto_American_Rest.sort_values(by=['American Restaurant'],ascending=False)
print(toronto_American_Rest.head(10))


                                        Neighbourhood  American Restaurant
13  Deer Park,Forest Hill SE,Rathnelly,South Hill,...             0.062500
33                                    Studio District             0.052632
34                  The Annex,North Midtown,Yorkville             0.047619
10                      Commerce Court,Victoria Hotel             0.040000
14            Design Exchange,Toronto Dominion Centre             0.040000
0                              Adelaide,King,Richmond             0.030000
16              First Canadian Place,Underground city             0.030000
11                                         Davisville             0.026316
37                        The Danforth West,Riverdale             0.023810
31                                     St. James Town             0.020000


Next we are going to cluster the Neighbourhoods in 4 clusters and using the KMeans clusting algorithm to assign each item in the dataframe to one of the clusters

In [32]:
# set number of cluster
kclusters = 4

toronto_grouped_clustering = toronto_American_Rest.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 2, 2, 2, 2, 2, 2, 3])

Now we want to create a dataframe showing the cluster labels for each neighbourhood and obtain the Latitude and Longitude for each Neighbourhood and Venue. Print the head and size of the dataframe.

In [33]:
Am_merged = toronto_American_Rest.copy()
Am_merged["Cluster Labels"] = kmeans.labels_

Am_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)

Am_merged = Am_merged.join(toronto_venues.set_index("Neighbourhood"), on="Neighbourhood")

print(Am_merged.shape)
Am_merged.head()


(1705, 9)


Unnamed: 0,Neighbourhood,American Restaurant,Cluster Labels,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0625,0,43.686412,-79.400049,LCBO,43.686991,-79.399238,Liquor Store
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0625,0,43.686412,-79.400049,The Market By Longo’s,43.686711,-79.399536,Supermarket
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0625,0,43.686412,-79.400049,Daeco Sushi,43.687838,-79.395652,Sushi Restaurant
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0625,0,43.686412,-79.400049,Union Social Eatery,43.687895,-79.394916,American Restaurant
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.0625,0,43.686412,-79.400049,Tim Hortons,43.687682,-79.39684,Coffee Shop


Sort the results by cluster label

In [34]:
print(Am_merged.shape)
Am_merged.sort_values(["Cluster Labels"], inplace=True)
Am_merged

(1705, 9)


Unnamed: 0,Neighbourhood,American Restaurant,Cluster Labels,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
13,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.062500,0,43.686412,-79.400049,LCBO,43.686991,-79.399238,Liquor Store
34,"The Annex,North Midtown,Yorkville",0.047619,0,43.672710,-79.405678,Ezra's Pound,43.675153,-79.405858,Café
33,Studio District,0.052632,0,43.659526,-79.340923,RiverRock Cafe,43.661497,-79.340235,Café
33,Studio District,0.052632,0,43.659526,-79.340923,Saulter Street Brewery,43.658412,-79.346392,Brewery
33,Studio District,0.052632,0,43.659526,-79.340923,Gale's Snack Bar,43.658239,-79.339077,Diner
...,...,...,...,...,...,...,...,...,...
6,Central Bay Street,0.012195,3,43.657952,-79.387383,Reds Midtown Tavern,43.659128,-79.382266,Wine Bar
6,Central Bay Street,0.012195,3,43.657952,-79.387383,Sambuca Grill,43.656110,-79.392946,Italian Restaurant
6,Central Bay Street,0.012195,3,43.657952,-79.387383,Toronto Vegetarian Association,43.655953,-79.392854,Office
6,Central Bay Street,0.012195,3,43.657952,-79.387383,Arctic Bites,43.656085,-79.392913,Ice Cream Shop


Let's look at the datatypes of the columns and create the map

In [35]:
print(Am_merged.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1705 entries, 13 to 32
Data columns (total 9 columns):
Neighbourhood              1705 non-null object
American Restaurant        1705 non-null float64
Cluster Labels             1705 non-null int32
Neighbourhood Latitude     1705 non-null float64
Neighbourhood Longitude    1705 non-null float64
Venue                      1705 non-null object
Venue Latitude             1705 non-null float64
Venue Longitude            1705 non-null float64
Venue Category             1705 non-null object
dtypes: float64(5), int32(1), object(3)
memory usage: 126.5+ KB
None


In [53]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Am_merged['Neighbourhood Latitude'], Am_merged['Neighbourhood Longitude'], Am_merged['Neighbourhood'], Am_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examination of the clusters

Cluster 0, first print entire dataframe and subsequently show the Venue Category and total American Restaurants only

In [37]:
Am_C0=Am_merged.loc[Am_merged['Cluster Labels'] == 0]
print(Am_C0)

                                        Neighbourhood  American Restaurant  \
13  Deer Park,Forest Hill SE,Rathnelly,South Hill,...             0.062500   
34                  The Annex,North Midtown,Yorkville             0.047619   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   
..                                                ...                  ...   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   
33                                    Studio District             0.052632   

    Cluster Labels  Neighbourhood Latitude  Neighbourhood Longi

In [38]:
Am_C0_sel=Am_C0.loc[Am_C0['Venue Category'] == 'American Restaurant']
print(Am_C0_sel)
print(Am_C0_sel.count())

                                        Neighbourhood  American Restaurant  \
33                                    Studio District             0.052632   
34                  The Annex,North Midtown,Yorkville             0.047619   
33                                    Studio District             0.052632   
13  Deer Park,Forest Hill SE,Rathnelly,South Hill,...             0.062500   

    Cluster Labels  Neighbourhood Latitude  Neighbourhood Longitude  \
33               0               43.659526               -79.340923   
34               0               43.672710               -79.405678   
33               0               43.659526               -79.340923   
13               0               43.686412               -79.400049   

                  Venue  Venue Latitude  Venue Longitude       Venue Category  
33      Brooklyn Tavern       43.661937       -79.335938  American Restaurant  
34          Rose & Sons       43.675668       -79.403617  American Restaurant  
33           

same for cluster 1, , first print entire dataframe and subsequently show the Venue Category and total American Restaurants only

In [39]:
Am_C1=Am_merged.loc[Am_merged['Cluster Labels'] == 1]
print(Am_C1)

                                        Neighbourhood  American Restaurant  \
22                                      Lawrence Park                  0.0   
22                                      Lawrence Park                  0.0   
22                                      Lawrence Park                  0.0   
8                                            Christie                  0.0   
21                       High Park,The Junction South                  0.0   
..                                                ...                  ...   
17                 Forest Hill North,Forest Hill West                  0.0   
18                      Harbord,University of Toronto                  0.0   
4   CN Tower,Bathurst Quay,Island airport,Harbourf...                  0.0   
19                                       Harbourfront                  0.0   
19                                       Harbourfront                  0.0   

    Cluster Labels  Neighbourhood Latitude  Neighbourhood Longi

In [40]:
Am_C1_sel=Am_C1.loc[Am_C1['Venue Category'] == 'American Restaurant']
print(Am_C1_sel)
print(Am_C1_sel.count())

Empty DataFrame
Columns: [Neighbourhood, American Restaurant, Cluster Labels, Neighbourhood Latitude, Neighbourhood Longitude, Venue, Venue Latitude, Venue Longitude, Venue Category]
Index: []
Neighbourhood              0
American Restaurant        0
Cluster Labels             0
Neighbourhood Latitude     0
Neighbourhood Longitude    0
Venue                      0
Venue Latitude             0
Venue Longitude            0
Venue Category             0
dtype: int64


and Cluster 2..

In [41]:
Am_C2=Am_merged.loc[Am_merged['Cluster Labels'] == 2]
print(Am_C2)

                              Neighbourhood  American Restaurant  \
37              The Danforth West,Riverdale              0.02381   
16    First Canadian Place,Underground city              0.03000   
16    First Canadian Place,Underground city              0.03000   
16    First Canadian Place,Underground city              0.03000   
16    First Canadian Place,Underground city              0.03000   
..                                      ...                  ...   
14  Design Exchange,Toronto Dominion Centre              0.04000   
14  Design Exchange,Toronto Dominion Centre              0.04000   
14  Design Exchange,Toronto Dominion Centre              0.04000   
37              The Danforth West,Riverdale              0.02381   
37              The Danforth West,Riverdale              0.02381   

    Cluster Labels  Neighbourhood Latitude  Neighbourhood Longitude  \
37               2               43.679557               -79.352188   
16               2               43.64842

In [42]:
Am_C2_sel=Am_C2.loc[Am_C2['Venue Category'] == 'American Restaurant']
print(Am_C2_sel)
print(Am_C2_sel.count())

                              Neighbourhood  American Restaurant  \
16    First Canadian Place,Underground city             0.030000   
16    First Canadian Place,Underground city             0.030000   
16    First Canadian Place,Underground city             0.030000   
0                    Adelaide,King,Richmond             0.030000   
11                               Davisville             0.026316   
37              The Danforth West,Riverdale             0.023810   
10            Commerce Court,Victoria Hotel             0.040000   
14  Design Exchange,Toronto Dominion Centre             0.040000   
10            Commerce Court,Victoria Hotel             0.040000   
10            Commerce Court,Victoria Hotel             0.040000   
10            Commerce Court,Victoria Hotel             0.040000   
14  Design Exchange,Toronto Dominion Centre             0.040000   
14  Design Exchange,Toronto Dominion Centre             0.040000   
14  Design Exchange,Toronto Dominion Centre     

Cluster 3

In [43]:
Am_C3=Am_merged.loc[Am_merged['Cluster Labels'] == 3]
print(Am_C3)

                      Neighbourhood  American Restaurant  Cluster Labels  \
9              Church and Wellesley             0.011236               3   
31                   St. James Town             0.020000               3   
32  Stn A PO Boxes 25 The Esplanade             0.010101               3   
32  Stn A PO Boxes 25 The Esplanade             0.010101               3   
32  Stn A PO Boxes 25 The Esplanade             0.010101               3   
..                              ...                  ...             ...   
6                Central Bay Street             0.012195               3   
6                Central Bay Street             0.012195               3   
6                Central Bay Street             0.012195               3   
6                Central Bay Street             0.012195               3   
32  Stn A PO Boxes 25 The Esplanade             0.010101               3   

    Neighbourhood Latitude  Neighbourhood Longitude  \
9                43.665860      

In [44]:
Am_C3_sel=Am_C3.loc[Am_C3['Venue Category'] == 'American Restaurant']
print(Am_C3_sel)
print(Am_C3_sel.count())

                      Neighbourhood  American Restaurant  Cluster Labels  \
32  Stn A PO Boxes 25 The Esplanade             0.010101               3   
30          Ryerson,Garden District             0.010000               3   
9              Church and Wellesley             0.011236               3   
31                   St. James Town             0.020000               3   
31                   St. James Town             0.020000               3   
6                Central Bay Street             0.012195               3   

    Neighbourhood Latitude  Neighbourhood Longitude             Venue  \
32               43.646435               -79.374846              Jump   
30               43.657162               -79.378937              JOEY   
9                43.665860               -79.383160   The Blake House   
31               43.651494               -79.375418     The Gabardine   
31               43.651494               -79.375418  Richmond Station   
6                43.657952   

So in summary, number of American Restaurants per cluster:
Cluster 0: 3 instances,
Cluster 1: 0 instances,
Cluster 2: 15 instances
Cluster 3: 7 instances
As seen above and as check the total(25 American Restaurants reconciles with the total number of American Restaurants as identified before..  

As final step let's explore the dataframe of cluster 2

In [45]:
print(Am_C2_sel.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 16 to 0
Data columns (total 9 columns):
Neighbourhood              16 non-null object
American Restaurant        16 non-null float64
Cluster Labels             16 non-null int32
Neighbourhood Latitude     16 non-null float64
Neighbourhood Longitude    16 non-null float64
Venue                      16 non-null object
Venue Latitude             16 non-null float64
Venue Longitude            16 non-null float64
Venue Category             16 non-null object
dtypes: float64(5), int32(1), object(3)
memory usage: 1.2+ KB
None


In [50]:
Am_C2_sel.groupby(['Neighbourhood','American Restaurant']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Cluster Labels,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,American Restaurant,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Adelaide,King,Richmond",0.03,3,3,3,3,3,3,3
"Commerce Court,Victoria Hotel",0.04,4,4,4,4,4,4,4
Davisville,0.026316,1,1,1,1,1,1,1
"Design Exchange,Toronto Dominion Centre",0.04,4,4,4,4,4,4,4
"First Canadian Place,Underground city",0.03,3,3,3,3,3,3,3
"The Danforth West,Riverdale",0.02381,1,1,1,1,1,1,1


In conclusion there are 2 neighbourhoods within cluster 2 with 4 American restaurants. Without any additional features in the dataset and additional requirements/criteria from the defined business case there is no preference between these 2 neighboorhoods i.e. Commerce Count, Victoria Hotel on one hand, Design Exchange Exchange, Toronto Dominion Centre on the other hand