# Applied Data Science Capstone Project

#### Gregory Smith

The body of this notebook consists  the Applied Data Science capstone project as part of the Applied Data Science specialization on Coursera.

In [2]:
import pandas as pd
import numpy as np

In [3]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Segmenting and Clustering Neighborhoods in Toronto

### Importing and Cleaning Dataframe

Scraping Toronto postal codes from Wikipedia

In [4]:
import urllib.request

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()

with open('List_of_postal_codes_of_Canada:_M.html', 'w') as fo:
    fo.write(article)

df_toronto_neigh = pd.read_html('List_of_postal_codes_of_Canada:_M.html')[0]

In [5]:
df_toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Dropping postcodes that are unassigned to a borough

In [6]:
df_toronto_neigh = df_toronto_neigh[df_toronto_neigh['Borough']!='Not assigned']
df_toronto_neigh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Creating a new dataframe which has all neighbourhoods under the same postal code and borough grouped together. This was done by creating a blank dataframe, creating temporary dataframes of postal code borough combinations, concatentating the neighbourhoods within this dataframe, and concatentating the resulting row with the new dataframe.

In [7]:
# generating empty dataframe with same columns as 'df_toronto_neigh'
df_toronto_neigh_2 = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])

# enumerating through distinct 'Borough' and 'Postcodes' strings in original dataframe
for i, borough in enumerate(df_toronto_neigh['Borough'].unique()):
    for j, post in enumerate(df_toronto_neigh['Postcode'].unique()):
        # generating a temporary df consisting of the entries of the original dataframe where the current enumerated
        # 'Borough' and 'Postcode' is present
        temp_df = df_toronto_neigh.loc[(df_toronto_neigh['Borough']==borough) & (df_toronto_neigh['Postcode']==post)]
        # while the df is larger than one row, we will append the neighborhood element from the second row to the neigborhood
        # element of the first row and then drop that row from the temp_df
        while temp_df.shape[0]>1:
            temp_df.iloc[0,2]=temp_df.iloc[0,2]+', '+temp_df.iloc[1,2]
            temp_df.drop(temp_df.index[1], inplace=True)
        # append the df row generated during the while loop to the new dataframe
        df_toronto_neigh_2=pd.concat([df_toronto_neigh_2,temp_df])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [8]:
df_toronto_neigh_2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
5,M6A,North York,"Lawrence Heights, Lawrence Manor"
13,M3B,North York,Don Mills North
18,M6B,North York,Glencairn


Finding which boroughs do not have an assigned neighborhood

In [9]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Not assigned


Assigning boroughs without a neighbourhood to have borough and neighbourhood be the same

In [10]:
for i in range(df_toronto_neigh_2.shape[0]):
    if (df_toronto_neigh_2.iloc[i,2]=='Not assigned'):
        df_toronto_neigh_2.iloc[i,2] = df_toronto_neigh_2.iloc[i,1]

In [11]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [12]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']==df_toronto_neigh_2['Borough']]

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


Cleaning dataframe by renaming a column and reseting the indices

In [13]:
df_toronto_neigh_2.rename(columns={'Postcode': 'Postal Code'}, inplace=True)

In [14]:
df_toronto_neigh_2.reset_index(drop=True, inplace=True)

Details of cleaned dataframe 'df_toronto_neigh_2'

In [452]:
df_toronto_neigh_2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M6A,North York,"Lawrence Heights, Lawrence Manor"
3,M3B,North York,Don Mills North
4,M6B,North York,Glencairn


In [15]:
df_toronto_neigh_2.describe()

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,11,102
top,M5P,North York,Queen's Park
freq,1,24,2


In [16]:
df_toronto_neigh_2.loc[df_toronto_neigh_2['Neighbourhood']=='Queen\'s Park']

Unnamed: 0,Postal Code,Borough,Neighbourhood
25,M7A,Downtown Toronto,Queen's Park
43,M9A,Queen's Park,Queen's Park


The only neighbourhood with multiplicity is Queen's Park. Upon investing Queen's Park, its postal code is listed as M7A. At first I thought that maybe M9A was reserved for government buildings within Queen's Park, but that does not appear to be the case, atleast not based on my research. I will keep this possible issue in mind when working further on the project.

In [17]:
df_toronto_neigh_2.shape

(103, 3)

### Adding Geographic Coordinates to Data Frame

Creating a dataframe of Toronto postal codes and corresponding latitude and longitude

In [18]:
df_toronto_latlong = pd.read_csv('https://cocl.us/Geospatial_data')
df_toronto_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Joining 'df_toronto_neigh_2' and 'df_toronto_latlong' on the 'Postal Code' column

In [19]:
df_toronto_neigh_2 = df_toronto_neigh_2.merge(df_toronto_latlong, on='Postal Code')

In [20]:
df_toronto_neigh_2.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
3,M3B,North York,Don Mills North,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073
5,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923
6,M2H,North York,Hillcrest Village,43.803762,-79.363452
7,M3H,North York,"Bathurst Manor, Downsview North, Wilson Heights",43.754328,-79.442259
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
9,M3J,North York,"Northwood Park, York University",43.76798,-79.487262


### Cluster Analysis of Toronto Neighbourhoods

Setting FourSquare account parameter and 'version' value

In [21]:
import datetime

CLIENT_ID = 'ON2TUF0QITX32Z5D2VZ5RJSBDRV1NUZTX4FTCGO0CQTWFQZR'
CLIENT_SECRET = 'PJAJJPIIJYTTAO4YQET0OY2JKPCLG5NQJ2AZHTBOQZULFKJK'
VERSION = datetime.datetime.now().strftime("%Y%m%d")

Defining function to find venues near geographic coordinates

In [22]:
import requests
from pandas.io.json import json_normalize 

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Setting a limit of 10 venue results per neighborhood and finding the venues in each Toronto neighborhood

In [23]:
LIMIT=10
df_toronto_neigh_venues = getNearbyVenues(names=df_toronto_neigh_2['Neighbourhood'],
                                          latitudes=df_toronto_neigh_2['Latitude'],
                                          longitudes=df_toronto_neigh_2['Longitude']
                                          )

Parkwoods
Victoria Village
Lawrence Heights, Lawrence Manor
Don Mills North
Glencairn
Flemingdon Park, Don Mills South
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
CFB Toronto, Downsview East
Silver Hills, York Mills
Downsview West
Downsview, North Park, Upwood Park
Humber Summit
Newtonbrook, Willowdale
Downsview Central
Bedford Park, Lawrence Manor East
Emery, Humberlea
Willowdale South
Downsview Northwest
York Mills West
Willowdale West
Harbourfront
Queen's Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25

In [25]:
df_toronto_neigh_venues.shape

(689, 7)

In [26]:
df_toronto_neigh_venues.head(10)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,GreenWin pool,43.756232,-79.333842,Pool
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
5,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
6,Victoria Village,43.725882,-79.315572,Memories of Africa,43.726602,-79.312427,Grocery Store
7,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,Roots,43.718221,-79.466776,Boutique
8,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,Kitchen Stuff Plus (Clearance Outlet),43.719096,-79.462675,Furniture / Home Store
9,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,Lac Vien Vietnamese Restaurant,43.721259,-79.468472,Vietnamese Restaurant


Creating a dataframe mapping the venues to the category the venues are considered. For example the venue with index 0, Brookbanks Park, is mapped with index 0, with a value of 1 for only the Park category.

In [27]:
# one hot encoding
toronto_onehot = pd.get_dummies(df_toronto_neigh_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = df_toronto_neigh_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head(20)

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,...,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,"Lawrence Heights, Lawrence Manor",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,"Lawrence Heights, Lawrence Manor",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Lawrence Heights, Lawrence Manor",0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [28]:
toronto_onehot.shape

(689, 175)

In [29]:
toronto_onehot.iloc[0][toronto_onehot.iloc[0]==1]

Park    1
Name: 0, dtype: object

Grouping venues by neighborhood

In [30]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,Asian Restaurant,...,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.1,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
1,Agincourt,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
4,"Alderwood, Long Branch",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
5,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
6,Bayview Village,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
7,"Bedford Park, Lawrence Manor East",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
8,Berczy Park,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.1,0.000000,0.0,0.0,0.0,0.0,0.00,0.0
9,"Birch Cliff, Cliffside West",0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.00,0.0,0.000000,0.0,0.0,0.0,0.0,0.00,0.0


In [31]:
toronto_grouped.shape

(99, 175)

Printing each neigbourhood with the 5 most common venue types

In [32]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
          venue  freq
0    Steakhouse   0.2
1  Concert Hall   0.1
2          Café   0.1
3     Speakeasy   0.1
4   Opera House   0.1


----Agincourt----
                       venue  freq
0             Breakfast Spot   0.2
1                     Lounge   0.2
2             Clothing Store   0.2
3  Latin American Restaurant   0.2
4               Skating Rink   0.2


----Agincourt North, L'Amoreaux East, Milliken, Steeles East----
               venue  freq
0               Park   0.5
1         Playground   0.5
2  Accessories Store   0.0
3              Motel   0.0
4     Medical Center   0.0


----Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown----
                  venue  freq
0         Grocery Store  0.22
1            Beer Store  0.11
2   Fried Chicken Joint  0.11
3          Liquor Store  0.11
4  Fast Food Restaurant  0.11


----Alderwood, Long Branch----
                venue  freq
0         Pizza

Creating a new dataframe with the 10 most common venues in each neighbourhood

In [33]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [34]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Steakhouse,Opera House,Coffee Shop,Concert Hall,Vegetarian / Vegan Restaurant,Speakeasy,Plaza,Café,Hotel,Dance Studio
1,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Clothing Store,Construction & Landscaping,Convenience Store,Concert Hall,Drugstore,Dog Run
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Playground,Park,Yoga Studio,Dance Studio,Drugstore,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pharmacy,Pizza Place,Fried Chicken Joint,Sandwich Place,Beer Store,Liquor Store,Fast Food Restaurant,Yoga Studio,Deli / Bodega
4,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Dance Studio,Sandwich Place,Gym,Pharmacy,Athletics & Sports,Skating Rink,Pub,Discount Store


Clustering the neighbourhoods into 10 clusters

In [35]:
from sklearn.cluster import KMeans

kclusters = 10
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_toronto_neigh_2
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,6.0,Park,Food & Drink Shop,Pool,Drugstore,Dog Run,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop,Department Store
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Grocery Store,Coffee Shop,Portuguese Restaurant,Hockey Arena,Yoga Studio,Dance Studio,Dog Run,Discount Store,Diner,Dim Sum Restaurant
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,2.0,Clothing Store,Accessories Store,Arts & Crafts Store,Furniture / Home Store,Event Space,Coffee Shop,Boutique,Women's Store,Vietnamese Restaurant,American Restaurant
3,M3B,North York,Don Mills North,43.745906,-79.352188,0.0,Café,Gym / Fitness Center,Japanese Restaurant,Basketball Court,Baseball Field,Caribbean Restaurant,Dessert Shop,Eastern European Restaurant,Drugstore,Dog Run
4,M6B,North York,Glencairn,43.709577,-79.445073,2.0,Japanese Restaurant,Metro Station,Pub,Bakery,Yoga Studio,Deli / Bodega,Dog Run,Discount Store,Diner,Dim Sum Restaurant


Mapping the clusters

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
from geopy.geocoders import Nominatim
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: / 

In [44]:
# finding latitude and longitude of Toronto
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode('Toronto, ON')
latitude = location.latitude
longitude = location.longitude
print(location, latitude, longitude)

NameError: name 'Nominatim' is not defined

In [41]:
# creating map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# adding markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

NameError: name 'latitude' is not defined

Let's see what cluster 5 looks like

In [441]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Victoria Village,Intersection,Financial or Legal Service,Coffee Shop,Pizza Place,Hockey Arena,Portuguese Restaurant,Yoga Studio,Electronics Store,Eastern European Restaurant,Drugstore
2,"Lawrence Heights, Lawrence Manor",Clothing Store,Accessories Store,Boutique,Furniture / Home Store,Event Space,Coffee Shop,Women's Store,Vietnamese Restaurant,Airport Terminal,Discount Store
7,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Diner,Deli / Bodega,Restaurant,Middle Eastern Restaurant,Bank,Sushi Restaurant,Ice Cream Shop,Bridal Shop,Dance Studio
9,"Northwood Park, York University",Coffee Shop,Bar,Caribbean Restaurant,Furniture / Home Store,Massage Studio,Yoga Studio,Diner,Event Space,Empanada Restaurant,Electronics Store
18,"Bedford Park, Lawrence Manor East",Coffee Shop,Comfort Food Restaurant,Indian Restaurant,Restaurant,Italian Restaurant,Thai Restaurant,Café,Pub,Sushi Restaurant,Electronics Store
25,Queen's Park,Coffee Shop,Yoga Studio,Italian Restaurant,Gym,Creperie,Portuguese Restaurant,Burrito Place,Park,Garden Center,Garden
29,Central Bay Street,Coffee Shop,Park,Modern European Restaurant,Bubble Tea Shop,Gastropub,Japanese Restaurant,Italian Restaurant,Sushi Restaurant,Electronics Store,Eastern European Restaurant
33,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Gym,Gastropub,Hotel,Beer Bar,Restaurant,Pub,Café,Yoga Studio,Dessert Shop
43,Queen's Park,Coffee Shop,Yoga Studio,Italian Restaurant,Gym,Creperie,Portuguese Restaurant,Burrito Place,Park,Garden Center,Garden
47,Woburn,Coffee Shop,Korean Restaurant,Yoga Studio,Dessert Shop,Event Space,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Dog Run


In [486]:
int(toronto_merged['Cluster Labels'])

TypeError: cannot convert the series to <class 'int'>

In [1]:
for i in range(10):
    print(toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]])

NameError: name 'toronto_merged' is not defined