<a id='#scrapping'></a>
# Scrapping

Importing libraries for BeautifulSoup

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Extracting HTML from the URL

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

wikipedia_html = requests.get(url)
soup = BeautifulSoup(wikipedia_html.content, 'html.parser')

Extracting table from HTML and converting to dataframe

In [3]:
table = soup.find_all('table', class_='wikitable sortable')

df_toronto = pd.read_html(str(table))[0]
df_toronto.rename(columns = {'Neighbourhood':'Neighborhood'}, inplace = True)

Data cleansing process : Removeing 'Not Assigned' Boroughs.

In [4]:
df_toronto = df_toronto[df_toronto.Borough != 'Not assigned']

Checking Neighbourhood column for 'Not assigned'

In [5]:
df_toronto[df_toronto.Neighborhood == 'Not assigned'].count()

Postal Code     0
Borough         0
Neighborhood    0
dtype: int64

As there are no 'Not assigned' neighbourhoods, therefor no action required. Now checking for duplicate postal codes.

In [6]:
df_toronto[df_toronto['Postal Code'] == 'M5A'].count()

Postal Code     1
Borough         1
Neighborhood    1
dtype: int64

As seen above, there are no duplicate postal codes. No further cleansing is required, and data is now ready for next stage of analysis.

In [7]:
print('The final shape of the data is', df_toronto.shape)

The final shape of the data is (103, 3)


<a id='#latlong'></a>
# Latitude/Longitude

Now that we have finalized are data, it's time get corresponding Latitude and Longitude for the borough/neighbourhood. For this purpose, we will be using GeoCoder API to extract the location details

In [8]:
import geocoder

In [9]:
# Defining function for retrieving latitude/longitude based on post code

def get_latitude_longitude(post_code):
    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    return latitude, longitude

Looping through whole dataset and passing postal code to above function and retreive the latitude and longitude. The retrieved data will then be stored to same data set.

In [10]:
for i, row in df_toronto.iterrows():
    postal_code = row['Postal Code']
    
    #Function call
    #lat, long = get_latitude_longitude(postal_code)
    
    #Appending to dataframe
    #df_toronto.at[i, 'Latitude'] = lat
    #df_toronto.at[i, 'Longitude'] = long

In ideal circumstances, above code (after uncomment) should be able to connect to GeoCoder API and get the Latitude/Longitude against each Postal Code, and then append them to new columns against each postal code. But as we are not able to connect to API even after waiting for 3 hours, will be using the manual provided sheet.

In [11]:
#URL for CSV containing latitude/Longitudes
csv_url = 'https://cocl.us/Geospatial_data'

#Populating into dataframe
df_lat_long = pd.read_csv(csv_url)
df_lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging two dataframes to finalize dataset for analysis

In [12]:
df_toronto = pd.merge(df_toronto, df_lat_long)

In [13]:
df_toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<a id='#analysis'></a>
# Analysis / Foursquare / Exploration / Clustering

As we have now finalized our dataframe, we are now going to proceed with data exploration, visualization and analysis

For this purpose, we will only be using subset of data. Our subset will be Borough containing the name York

In [14]:
df_york = df_toronto[df_toronto['Borough'].str.contains('York')]
df_york.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
10,M6B,North York,Glencairn,43.709577,-79.445073
13,M3C,North York,Don Mills,43.7259,-79.340923
14,M4C,East York,Woodbine Heights,43.695344,-79.318389
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512


In [15]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_york['Borough'].unique()),
        df_york.shape[0]
    )
)

The dataframe has 3 boroughs and 34 neighborhoods.


Importing libraries

In [16]:
# library to handle requests
import requests

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# map rendering library
import folium 

Use geopy library to get the latitude and longitude values of Toronto City.

In [17]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="cn_explorer")

location = geolocator.geocode(address)

latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


Create a map of Toronto with neighborhoods superimposed on top.

In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_york['Latitude'], df_york['Longitude'], df_york['Borough'], df_york['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version

In [19]:
# @hidden_cell

CLIENT_ID = 'UVIZFMZHGH02PUGI2F1DLYPLIBSTU2CR1MI4RR15YZKHNCPK' # your Foursquare ID
CLIENT_SECRET = 'FWMB0BKSIOY50XXZMGUMJFLSY4F3QZ3EWKJIOD2PYL4NMKPI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Let's explore the first neighborhood in our dataframe.

In [20]:
neighborhood_latitude = df_york.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_york.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df_york.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


Setting URL for API call

In [21]:
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, 500, 100)

Send the GET request and examine the resutls

In [22]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f2e9531f39d6963a4de9599'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [23]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [24]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,649 Variety,Convenience Store,43.754513,-79.331942
2,Variety Store,Food & Drink Shop,43.751974,-79.333114


And how many venues were returned by Foursquare?

In [25]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


### Explore Neighborhoods

Let's create a function to repeat the same process to all the neighborhoods matching the critaria of York in the Borugh name

In [26]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called *york_venues*.

In [27]:
# type your answer here
LIMIT = 100
york_venues = getNearbyVenues(names = df_york['Neighborhood'],
                                   latitudes = df_york['Latitude'],
                                   longitudes = df_york['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Parkview Hill, Woodbine Gardens
Glencairn
Don Mills
Woodbine Heights
Humewood-Cedarvale
Caledonia-Fairbanks
Leaside
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Del Ray, Mount Dennis, Keelsdale and Silverthorn
Humberlea, Emery
Willowdale, Willowdale East
Downsview
Runnymede, The Junction North
Weston
York Mills West
Willowdale, Willowdale West


Let's check the size of the resulting dataframe

In [28]:
print(york_venues.shape)
york_venues.head()

(333, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


Let's check how many venues were returned for each neighborhood

In [29]:
york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",27,27,27,27,27,27
Caledonia-Fairbanks,4,4,4,4,4,4
"Del Ray, Mount Dennis, Keelsdale and Silverthorn",4,4,4,4,4,4
Don Mills,28,28,28,28,28,28
Downsview,13,13,13,13,13,13
"East Toronto, Broadview North (Old East York)",4,4,4,4,4,4
"Fairview, Henry Farm, Oriole",64,64,64,64,64,64
Glencairn,4,4,4,4,4,4


Let's find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 124 uniques categories.


### Analyze Each Neighborhood

In [31]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [32]:
york_onehot.shape

(333, 125)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Women's Store,Yoga Studio
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,...,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
4,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Don Mills,0.0,0.0,0.0,0.035714,0.0,0.035714,0.035714,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview,0.0,0.076923,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"East Toronto, Broadview North (Old East York)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Fairview, Henry Farm, Oriole",0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0,0.03125,...,0.0,0.015625,0.015625,0.0,0.015625,0.0,0.0,0.0,0.015625,0.0
9,Glencairn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirming new size

In [34]:
york_grouped.shape

(29, 125)

Let's print each neighborhood along with the top 5 most common venues

In [35]:
num_top_venues = 5

for hood in york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
            venue  freq
0            Bank  0.09
1     Coffee Shop  0.09
2     Bridal Shop  0.04
3  Sandwich Place  0.04
4      Restaurant  0.04


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1   Chinese Restaurant  0.25
2                 Café  0.25
3                 Bank  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.11
1      Sandwich Place  0.07
2          Restaurant  0.07
3         Coffee Shop  0.07
4         Pizza Place  0.04


----Caledonia-Fairbanks----
           venue  freq
0           Park  0.50
1  Women's Store  0.25
2           Pool  0.25
3  Jewelry Store  0.00
4  Movie Theater  0.00


----Del Ray, Mount Dennis, Keelsdale and Silverthorn----
               venue  freq
0        Coffee Shop  0.25
1  Convenience Store  0.25
2     Sandwich Place  0.25
3     Discount Store  0.25
4  Accessories Store  0.00

Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [37]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Shopping Mall,Restaurant,Bridal Shop,Pizza Place,Pharmacy,Pet Store,Middle Eastern Restaurant,Gas Station
1,Bayview Village,Japanese Restaurant,Chinese Restaurant,Café,Bank,Yoga Studio,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Restaurant,Sandwich Place,Coffee Shop,Sushi Restaurant,Breakfast Spot,Hobby Shop,Pizza Place,Pharmacy,Grocery Store
3,Caledonia-Fairbanks,Park,Women's Store,Pool,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
4,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",Discount Store,Coffee Shop,Convenience Store,Sandwich Place,Yoga Studio,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Dance Studio


### Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [38]:
# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 3, 1, 3, 3, 3, 1, 3, 1])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [39]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = df_york

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

york_merged # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Convenience Store,Food & Drink Shop,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice
1,M4A,North York,Victoria Village,43.725882,-79.315572,2.0,Financial or Legal Service,French Restaurant,Coffee Shop,Pizza Place,Portuguese Restaurant,Hockey Arena,Cosmetics Shop,Department Store,Deli / Bodega,Dance Studio
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,3.0,Furniture / Home Store,Clothing Store,Accessories Store,Event Space,Boutique,Coffee Shop,Gift Shop,Vietnamese Restaurant,Food Truck,Dim Sum Restaurant
7,M3B,North York,Don Mills,43.745906,-79.352188,3.0,Gym,Café,Coffee Shop,Japanese Restaurant,Restaurant,Beer Store,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Caribbean Restaurant
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,2.0,Pizza Place,Intersection,Gastropub,Pharmacy,Athletics & Sports,Café,Bank,Gym / Fitness Center,Dance Studio,Dim Sum Restaurant
10,M6B,North York,Glencairn,43.709577,-79.445073,1.0,Park,Sushi Restaurant,Japanese Restaurant,Pub,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Diner,Curling Ice
13,M3C,North York,Don Mills,43.7259,-79.340923,3.0,Gym,Café,Coffee Shop,Japanese Restaurant,Restaurant,Beer Store,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Caribbean Restaurant
14,M4C,East York,Woodbine Heights,43.695344,-79.318389,3.0,Skating Rink,Park,Video Store,Beer Store,Bus Stop,Athletics & Sports,Curling Ice,Dance Studio,Department Store,Diner
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191,3.0,Tennis Court,Trail,Hockey Arena,Field,Yoga Studio,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega,Dance Studio
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512,1.0,Park,Women's Store,Pool,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice


Dropping any NaN values from merged dataframe

In [40]:
york_merged.dropna(subset = ["Cluster Labels"], inplace = True)

Finally, let's visualize the resulting clusters

In [41]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Exploring Clusters

Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [42]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,0.0,Food Service,Baseball Field,Yoga Studio,Distribution Center,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio


#### Cluster 2

In [43]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,1.0,Park,Convenience Store,Food & Drink Shop,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice
10,North York,1.0,Park,Sushi Restaurant,Japanese Restaurant,Pub,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Diner,Curling Ice
21,York,1.0,Park,Women's Store,Pool,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
35,East York,1.0,Intersection,Convenience Store,Park,Distribution Center,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Dance Studio
66,North York,1.0,Park,Convenience Store,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice,Dance Studio


#### Cluster 3

In [44]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,2.0,Financial or Legal Service,French Restaurant,Coffee Shop,Pizza Place,Portuguese Restaurant,Hockey Arena,Cosmetics Shop,Department Store,Deli / Bodega,Dance Studio
8,East York,2.0,Pizza Place,Intersection,Gastropub,Pharmacy,Athletics & Sports,Café,Bank,Gym / Fitness Center,Dance Studio,Dim Sum Restaurant
50,North York,2.0,Pizza Place,Gym,Yoga Studio,Discount Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice
63,York,2.0,Pizza Place,Convenience Store,Grocery Store,Brewery,Yoga Studio,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Curling Ice
72,North York,2.0,Bank,Coffee Shop,Pharmacy,Pizza Place,Yoga Studio,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega,Dance Studio


#### Cluster 4

In [45]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,North York,3.0,Furniture / Home Store,Clothing Store,Accessories Store,Event Space,Boutique,Coffee Shop,Gift Shop,Vietnamese Restaurant,Food Truck,Dim Sum Restaurant
7,North York,3.0,Gym,Café,Coffee Shop,Japanese Restaurant,Restaurant,Beer Store,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Caribbean Restaurant
13,North York,3.0,Gym,Café,Coffee Shop,Japanese Restaurant,Restaurant,Beer Store,Sporting Goods Shop,Clothing Store,Chinese Restaurant,Caribbean Restaurant
14,East York,3.0,Skating Rink,Park,Video Store,Beer Store,Bus Stop,Athletics & Sports,Curling Ice,Dance Studio,Department Store,Diner
16,York,3.0,Tennis Court,Trail,Hockey Arena,Field,Yoga Studio,Dim Sum Restaurant,Dessert Shop,Department Store,Deli / Bodega,Dance Studio
23,East York,3.0,Coffee Shop,Sporting Goods Shop,Furniture / Home Store,Burger Joint,Bank,Mexican Restaurant,Department Store,Pet Store,Dessert Shop,Grocery Store
27,North York,3.0,Dog Run,Golf Course,Pool,Fast Food Restaurant,Mediterranean Restaurant,Yoga Studio,Curling Ice,Department Store,Deli / Bodega,Dance Studio
28,North York,3.0,Coffee Shop,Bank,Shopping Mall,Restaurant,Bridal Shop,Pizza Place,Pharmacy,Pet Store,Middle Eastern Restaurant,Gas Station
29,East York,3.0,Sandwich Place,Indian Restaurant,Yoga Studio,Bank,Burger Joint,Bus Line,Coffee Shop,Discount Store,Fast Food Restaurant,Grocery Store
33,North York,3.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar,Japanese Restaurant,Chinese Restaurant,Bakery,Bank,Cosmetics Shop


#### Cluster 5

In [46]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
52,North York,4.0,Park,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
64,York,4.0,Park,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Convenience Store,Cosmetics Shop,Curling Ice,Dance Studio
