# IBM Data Science Capstone Week 3 Assignment
---
This notebook contains the assignment for week 3 of the IBM Data Science Capstone Project on Coursera. It was created by Tim de Zwart.

### Part 1

First of, the necessary packages are imported.

In [1]:
import pandas as pd

The list of postal codes of Canada that start with "M" (region Toronto) is scraped off of the provided Wikipedia page and loaded into a pandas dataframe. Check **[here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)** for the list of postal codes. The pd.read_html functions returns a lists of the dataframes on the website, so we will first check which is the right one.

In [2]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

for df in dfs:
    print(df.head())

  Postal code           Borough                Neighborhood
0         M1A      Not assigned                         NaN
1         M2A      Not assigned                         NaN
2         M3A        North York                   Parkwoods
3         M4A        North York            Victoria Village
4         M5A  Downtown Toronto  Regent Park / Harbourfront
                                                  0   \
0                                                NaN   
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
2                                                 NL   
3                                                  A   

                                                  1   \
0                              Canadian postal codes   
1  NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...   
2                                                 NS   
3                                                  B   

                                                  2    3    4    5    6    7  

---
It turns out the first table is the one with the boroughs and the postal codes, which is the one that we need. This is therefore assigned to the dataframe `canada_pc`.

In [3]:
canada_pc = dfs[0]
canada_pc.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


The table is now cleaned according to the instructions in the assignment. The following tasks are performed:
* Only cells with an assigned borough are processed. Cells where the borough has a value of **Not assigned** are dropped.
* Postal codes that have multiple neighborhoods assigned to it are listed now with a " / " in-between (see for example Regent Park / Harbourfront in the cell above). Postal codes with multiple neighborhoods will now be separated by comma's.

In [4]:
# Drop the postal codes that are not assigned to a borough
canada_pc.drop(canada_pc[canada_pc['Borough']=="Not assigned"].index, inplace=True)

# Replace forward slash with comma's
canada_pc['Neighborhood'].replace({" / ": ", "}, regex=True, inplace=True)

# Show cleaned dataframe
canada_pc.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Finally, the assignment states that there might be postal codes that are assigned to a borough, but are not assigned to a neighborhood. In order to check this, we see if there are any rows in the cleaned dataframe where the `Neighborhood` column is **Not assigned**.

In [5]:
canada_pc[canada_pc['Neighborhood']=="Not assigned"]

Unnamed: 0,Postal code,Borough,Neighborhood


As can be seen above, the returned dataframe is empty, which means there are no postal codes assigned to boroughs but not to neighborhoods. The only thing that remains now is to reset the index to start at 0 again, and to check the shape of the cleaned dataframe.

In [6]:
canada_pc.reset_index(drop=True, inplace=True)
canada_pc.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
print("The shape of the cleaned dataframe is", canada_pc.shape)

The shape of the cleaned dataframe is (103, 3)


### Part 2

Now it's time to get the longitute and latitude coordinates of each neighborhood in order to obtain the Foursquare data. We will try this first with Google's `geocoder` package, as described in the assignment. We first import the package.

In [8]:
import geocoder

As described in the assignment, it can be difficult to obtain the coordinates through the `geocoder` package, since it will return **None** many times before actually giving the coordinate values. Therefore, we will use a `while` loop, to keep trying until a value different than **None** is returned. We do this for every postal code in the column `Postal codes`.

*Note: this is a markdown cell so it won't run when rerunning the code. In order to try the code, copy it to a code cell and run it.*

```python
# Create empty lists to store latitude and longitude values of postal codes
latitude = []
longitude = []

# Loop through all the postal codes
for pc in canada_pc['Postal code'].values:
    # Start with None in order to keep trying
    lat_lng_coords = None
    
    # Use the geocoder package until coordinates are retrieved
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(pc))
        lat_lng_coords = g.latlng
    
    # Append latitude and longitude lists for current postal codes
    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])
```

As was predicted in the assignment, the `geocoder` package unfortunately takes too much time to return the coordinates. Therefore we will "cheat" and import the provided .csv file with the coordinates.

In [9]:
coords = pd.read_csv('http://cocl.us/Geospatial_data')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The `Postal Code` column in this dataframe is stylized differently than in the postal codes dataframe. In order to merge them seemlessly, the column is renamed to `Postal codes`, and then merged with the postal codes dataframe. The new dataframe is now saved as `toronto_data`.

In [10]:
coords.rename(columns={'Postal Code': 'Postal code'}, inplace=True)

toronto_data = pd.merge(canada_pc, coords, on='Postal code')
toronto_data.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


### Part 3

For the final part of the assignment we will retrieve the data of the Toronto neighborhoods from Foursquare, and cluster them in a way that will give us information about which neighborhoods are comparable. Therefore, we will first import the necessary packages. Info was added in order to show what the packages are used for.

In [11]:
import numpy as np

# Handle and normalize JSON files
import json
from pandas import json_normalize

# Convert addres to latitude and longitude
from geopy.geocoders import Nominatim

# Handle requests
import requests

# Plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Clustering algorithm
from sklearn.cluster import KMeans

# Making maps
import folium

print('Libraries imported.')

Libraries imported.


First, let's retrieve the geographical coordinates of Toronto using `geopy`.

In [12]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


Now that we have the geographical coordinates, we can use the `folium` package to create a map of Toronto with all the neighborhoods superimposed on top. This will give a clear image of where the neighborhoods are located.

In [13]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

This is a very nice overview, however it is very crowded. Let's focus on the boroughs that include the name "Toronto", in order to narrow down the amount of data to be analysed, since this is just a practice assignment.

In [14]:
toronto_boroughs = toronto_data[toronto_data['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_boroughs.head(15)

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


Now let's show the map of Toronto with just the "Toronto"-boroughs.

In [15]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(toronto_boroughs['Latitude'], toronto_boroughs['Longitude'], toronto_boroughs['Borough'], toronto_boroughs['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

Now that we have the geographical data of all the neighborhoods in the "Toronto"-boroughs, we can use the Foursquare API to retrieve the data of venues per neighborhood (*note: we are actually checking per postal code, and multiple neighborhoods can share the same postal code, but neighborhoods is easier to remember*). First, let's set up url for the GET request. (*note: I am using my client_id and client_secret, for reproduction or testing please use **your own** id and secret*).

In [16]:
CLIENT_ID = "IMJYBRUN14HSL2VZYUJ0ZO5V1LLRNIEWWQREUJLOEATESNDF"
CLIENT_SECRET = "5K2ECNE5N4BLN5NUASD0VIWPJWAGOJSGIPYW4MKND4Y5CMSN"
VERSION = '20180605'

For every (set of) neighborhood(s) we will retrieve the first 100 venues in a 500 meter radius.

In [17]:
LIMIT = 100
radius = 500

A function is defined in order to retrieve the venues for every neighborhood. The process is:
* For every neighborhood an API request URL is set up using the defined data.
* The GET request retrieves a JSON file with all the available information about the venues.
* The relevant information is returned from the JSON files. In order to see how this is done, print one of the JSON files and take a look, or refer to the Coursera course.
* Store every venue for every neighborhood in a dataframe.

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name) # in order to keep process of how far along we are
        
        # API request URL is created, calling upon global variables defined before
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # Make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # Filter the relevant information
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    
    # Store in dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']
    
    return(nearby_venues)

Run the function in order to create the dataframe with venues.

In [19]:
toronto_venues = getNearbyVenues(names=toronto_boroughs['Neighborhood'],
                                 latitudes=toronto_boroughs['Latitude'],
                                 longitudes=toronto_boroughs['Longitude']
                                )

print("\nFinished.")

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst  Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town

Check if the dataframe looks good.

In [20]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


That looks how it's supposed to look! We can check how many venues were returned for each neighborhood.

In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,56,56,56,56,56,56
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
Business reply mail Processing CentrE,18,18,18,18,18,18
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,73,73,73,73,73,73
Christie,18,18,18,18,18,18
Church and Wellesley,77,77,77,77,77,77
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,31,31,31,31,31,31
Davisville North,7,7,7,7,7,7


There seem to be some neighborhoods with very few neighborhoods. Since we want to live in a lively neighborhood with a lot of venues, let's drop all the neighborhoods with less than 20 venues. We use a dummy dataframe `df` for the transformation. If we check the new amount of venues that we have for the neighborhoods, we see that we now are left with only the neighborhoods with at least 20 venues.

In [22]:
df = toronto_venues

df = df[df.groupby('Neighborhood')['Neighborhood'].transform('count').ge(20)]
df.reset_index(drop=True, inplace=True)

toronto_venues = df

toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,56,56,56,56,56,56
"Brockton, Parkdale Village, Exhibition Place",22,22,22,22,22,22
Central Bay Street,73,73,73,73,73,73
Church and Wellesley,77,77,77,77,77,77
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,31,31,31,31,31,31
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"High Park, The Junction South",24,24,24,24,24,24


In order to analyze each neighborhood, we will use one hot encoding one the venue types. We can use pandas' `get_dummies` method for this.

In [23]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add the neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# Move the neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# Check the mean occurence to see which venue category has the biggest frequency per neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head(10)

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.017857,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.013699,0.0,0.013699,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.013699,0.0,0.0,0.013699,0.0,0.0,0.0
3,Church and Wellesley,0.025974,0.012987,0.012987,0.0,0.0,0.0,0.012987,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0
4,"Commerce Court, Victoria Hotel",0.0,0.0,0.04,0.0,0.0,0.01,0.0,0.01,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
5,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.03,0.0,0.0,0.01,0.0,0.03,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0
7,"Garden District, Ryerson",0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.05,0.01,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0
9,"High Park, The Junction South",0.0,0.0,0.0,0.041667,0.0,0.0,0.041667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We create a function in order to sort the venues in descending order. The top 10 venues for each neighborhood are then stored in a dataframe, which will later be used for clustering the neighborhoods.

In [24]:
# Create the function for sorting the venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

# Create a dataframe where the top 10 venues for each neighborhood are displayed
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    # Make sure that the right indicator is used: 1st, 2nd, 3rd, 4-10th etc.
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Beer Bar,Bakery,Italian Restaurant,Restaurant,Café,Seafood Restaurant,Farmers Market,Cheese Shop,Cocktail Bar
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Coffee Shop,Bar,Burrito Place,Restaurant,Climbing Gym,Stadium,Italian Restaurant,Intersection
2,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Japanese Restaurant,Middle Eastern Restaurant,Spa,Thai Restaurant,Bubble Tea Shop,Ice Cream Shop
3,Church and Wellesley,Coffee Shop,Gay Bar,Japanese Restaurant,Restaurant,Sushi Restaurant,Yoga Studio,Burger Joint,Pub,Hotel,Café
4,"Commerce Court, Victoria Hotel",Coffee Shop,Café,Restaurant,Hotel,American Restaurant,Gym,Italian Restaurant,Seafood Restaurant,Japanese Restaurant,Deli / Bodega


Now the *k-means* algorithm can be used in order to cluster the neighborhoods. To do this, we use the `scikit-learn` module that was imported earlier. We will cluster the neighborhoods into 5 clusters. As can be seen, we have five different labels (0-4).

In [25]:
# Amount of clusters
n_clusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', axis=1)

# K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=10).fit(toronto_grouped_clustering)

print(kmeans.labels_)

[1 1 1 1 1 4 1 1 1 2 4 1 1 1 1 1 1 2 1 1 1 1 3 0 1 2]


A new dataframe called `toronto_merged` is created to store the neighborhood data, its cluster label and the most common venues. This dataframe is then used to visualise the clusters on a map.

In [26]:
# Insert the cluster labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# Merge the dataframe with the toronto_boroughs dataframe
toronto_merged = toronto_boroughs
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# Drop the NaN values (since only neighborhoods with > 20 venues were analysed, not every neighborhood has a label)
toronto_merged.dropna(axis=0, inplace=True)
toronto_merged.reset_index(drop=True, inplace=True)
toronto_merged.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Park,Pub,Bakery,Mexican Restaurant,Café,Breakfast Spot,Farmers Market,Event Space,Electronics Store
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Diner,Yoga Studio,Bar,Sandwich Place,Café,Juice Bar,Burrito Place,Burger Joint,Mexican Restaurant
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1.0,Coffee Shop,Clothing Store,Bubble Tea Shop,Café,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Fast Food Restaurant,Tea Room
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1.0,Coffee Shop,Café,Cocktail Bar,Hotel,American Restaurant,Restaurant,Italian Restaurant,Beer Bar,Seafood Restaurant,Diner
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,1.0,Coffee Shop,Beer Bar,Bakery,Italian Restaurant,Restaurant,Café,Seafood Restaurant,Farmers Market,Cheese Shop,Cocktail Bar


Finally, let's create a map to visualise the resulting clusters. We use the same method as before, using `folium` with labels.

In [27]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set the color schemes for the clusters
x = np.arange(n_clusters)
ys = [i + x + (i*x)**2 for i in range(n_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(int(cluster)), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Obviously, the most populated cluster is cluster 1. Let's look what type of venues are in the neighborhoods in cluster 1.

In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1.0,Coffee Shop,Park,Pub,Bakery,Mexican Restaurant,Café,Breakfast Spot,Farmers Market,Event Space,Electronics Store
1,Downtown Toronto,1.0,Coffee Shop,Diner,Yoga Studio,Bar,Sandwich Place,Café,Juice Bar,Burrito Place,Burger Joint,Mexican Restaurant
2,Downtown Toronto,1.0,Coffee Shop,Clothing Store,Bubble Tea Shop,Café,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Fast Food Restaurant,Tea Room
3,Downtown Toronto,1.0,Coffee Shop,Café,Cocktail Bar,Hotel,American Restaurant,Restaurant,Italian Restaurant,Beer Bar,Seafood Restaurant,Diner
4,Downtown Toronto,1.0,Coffee Shop,Beer Bar,Bakery,Italian Restaurant,Restaurant,Café,Seafood Restaurant,Farmers Market,Cheese Shop,Cocktail Bar
5,Downtown Toronto,1.0,Coffee Shop,Café,Italian Restaurant,Sandwich Place,Japanese Restaurant,Middle Eastern Restaurant,Spa,Thai Restaurant,Bubble Tea Shop,Ice Cream Shop
6,Downtown Toronto,1.0,Coffee Shop,Café,Gym,Restaurant,Bakery,Deli / Bodega,Thai Restaurant,Hotel,Sushi Restaurant,Bookstore
7,Downtown Toronto,1.0,Coffee Shop,Aquarium,Restaurant,Italian Restaurant,Café,Hotel,Scenic Lookout,Brewery,Sporting Goods Shop,Fried Chicken Joint
8,West Toronto,1.0,Bar,Restaurant,Vietnamese Restaurant,Men's Store,Coffee Shop,Café,Asian Restaurant,Yoga Studio,Brewery,Cuban Restaurant
10,Downtown Toronto,1.0,Coffee Shop,Hotel,Café,Restaurant,Seafood Restaurant,American Restaurant,Italian Restaurant,Gastropub,Bar,Bakery


We can see that most of the neighborhoods in this cluster are located Downtown, and there are lots of coffee shops, bars and cafés, as well as restaurants. Seems like there is a lot of fun to be had in those neighborhoods!

### This concludes this notebook, thanks for reading!