## Notebook for week 3 - Data Science Capstone Project

This notebook is the second one requiered by the *Data Science Capstone Project course*. It deals with the city Toronto, in Canada and the venues that are there. We will be clustering different Boroughs in Toronto, according to the frequency of venue's types within them. And at the end we will see a map reflecting that clustering.

The word is divided in ***four*** major parts:

1. **Web Scraping**
2. **Obtaining the coordinates**
3. **Getting venue's data**
4. **Data Analysis**

### Part 0. Importing the required libraries.

In [1]:
# For web scraping
from urllib.request import urlopen
from urllib.error import HTTPError
import requests
from bs4 import BeautifulSoup

# For data manipulitaion and analysis
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

# For data visualization
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

### Part 1. Web Scraping.

In [2]:
# Gets and scraps the html data
def getTable(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        # Returns only the text from the table
        table = bs.body.table.get_text()
    except AttributeError as e:
        return None
    return table

# Transforms the html strings to a list 
def clean_html(html: str) -> list:
    dirty_output = html.split('\n')
    # final result
    cleaned_output = []
    # removes spaces
    for item in dirty_output:
        if item:
            cleaned_output.append(item)
    return cleaned_output        

# Transforms the html to a final dict 
# that will be used to create the DataFrame
def html_to_dict(html: str) -> dict:
    items = clean_html(html)
    # Really bad way to extract the headers IN THIS CASE
    output = {items[0]: [], items[1]: [], items[2]: []}
    # We'll add every first, second and third to the respective key in output
    for index in range(3, len(items), 3):
        # Filters out any entry without a Borough
        if items[index + 1] != 'Not assigned':
            output[items[0]].append(items[index])
            output[items[1]].append(items[index + 1])
            output[items[2]].append(items[index + 2])
    return output   

### Considerations:

   - Wikipedia already grouped the Postal Codes with their different Neighbourhoods.
   - Wikipedia has edited the table and the site no longer has non-assigned Neighbourhoods with named Boroughs.

In [3]:
# Here we'll store the final product
table_dict = None
html_table = getTable('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
if html_table == None:
    # Displays if the function
    # finds nothing
    print('Table could not be found')
else:
    table_dict = html_to_dict(html_table)

# Transforms dict to Pandas DataFrame    
table_df = pd.DataFrame(table_dict)
table_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [72]:
table_df.shape

(103, 3)

### Part 2. Getting the coordinates.

In [4]:
coor = pd.read_csv('Geospatial_Coordinates.csv')
coor.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [5]:
data = table_df.join(coor.set_index('Postal Code'), on='Postal Code')
data.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [6]:
data['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Here we verified if the new DataFrame has the same Postal Codes as the *table_df* DataFrame:

In [7]:
all(data['Postal Code'].unique() == table_df['Postal Code'].unique())

True

### Part 3. Getting venue's data.

In [27]:
from IPython.display import HTML
from IPython.display import display

# Taken from https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook
tag = HTML('''<script>
code_show=true; 
function code_toggle() {
    if (code_show){
        $('div.cell.code_cell.rendered.selected div.input').hide();
    } else {
        $('div.cell.code_cell.rendered.selected div.input').show();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
Here are the API credentials <a href="javascript:code_toggle()">.</a>.''')
display(tag)

############### Write code below ##################

CLIENT_ID = 'MHCAGDREEVUQ3EGNHW1SPRHCGM4DNDYIXS51MC4NENLEOU0Q' # your Foursquare ID
CLIENT_SECRET = '52XY4CS4F5P3TRMBCMXGPNLQVNUUC4C0YQPZPFZBHO1NZESB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

The next function will be the one making the API requests to Foursquare.

In [9]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # Creates the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, 
            lat, lng, radius, 
            LIMIT)
            
        # Makes the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Returns only relevant information for each nearby venue
        for v in results:
            venues_list.append([(
                name, lat, lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name'])])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                             'Borough Latitude', 
                             'Borough Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']

    return(nearby_venues)

In [10]:
toronto_venues = getNearbyVenues(names=data['Borough'],
                                   latitudes=data['Latitude'],
                                   longitudes=data['Longitude']
                                  )
# Prints done when the
# function has finished
print("Done.")

Done.


Now we can see the information provided by the API.

In [11]:
toronto_venues.head()

Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,North York,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,North York,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,North York,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,North York,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


### Part 4. Data Analysis.

First of all, we can check the amount of times each borough is present in the dataframe.

In [12]:
# Shows the amount of entries
# by Borough in the DataFrame
toronto_venues.groupby('Borough').count()

Unnamed: 0_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto,104,104,104,104,104,104
Downtown Toronto,1248,1248,1248,1248,1248,1248
East Toronto,119,119,119,119,119,119
East York,79,79,79,79,79,79
Etobicoke,74,74,74,74,74,74
Mississauga,13,13,13,13,13,13
North York,241,241,241,241,241,241
Scarborough,90,90,90,90,90,90
West Toronto,153,153,153,153,153,153
York,20,20,20,20,20,20


We use one hote enconding because the clustering algorithm can only work with normalized numerical values.

In [13]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,North York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009615,...,0.0,0.0,0.0,0.0,0.009615,0.0,0.0,0.0,0.0,0.009615
1,Downtown Toronto,0.0,0.000801,0.000801,0.000801,0.000801,0.001603,0.001603,0.000801,0.013622,...,0.002404,0.0,0.011218,0.001603,0.004006,0.0,0.00641,0.0,0.0,0.005609
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02521,...,0.0,0.0,0.0,0.0,0.0,0.0,0.008403,0.0,0.0,0.016807
3,East York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.012658,0.0,0.0,0.0,0.012658
4,Etobicoke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0
5,Mississauga,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North York,0.004149,0.0,0.004149,0.0,0.0,0.0,0.0,0.0,0.008299,...,0.0,0.0,0.0,0.004149,0.008299,0.0,0.0,0.0,0.016598,0.0
7,Scarborough,0.011111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011111,...,0.011111,0.0,0.0,0.0,0.011111,0.0,0.0,0.0,0.0,0.0
8,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.013072,0.0,0.013072,0.0,0.006536,0.0,0.0,0.013072
9,York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0


With the help of the Scikit-Learn package we'll run the clustering algorithm.

In [15]:
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 0, 4, 3, 2, 0, 2, 1])

In [16]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Café,Park,Pizza Place,Sushi Restaurant,Restaurant,Dessert Shop,Bagel Shop,Italian Restaurant
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
2,East Toronto,Coffee Shop,Greek Restaurant,Brewery,Italian Restaurant,Restaurant,Park,Ice Cream Shop,Pizza Place,American Restaurant,Bakery
3,East York,Coffee Shop,Bank,Intersection,Burger Joint,Sandwich Place,Sporting Goods Shop,Pizza Place,Park,Pharmacy,Indian Restaurant
4,Etobicoke,Pizza Place,Coffee Shop,Sandwich Place,Pharmacy,Grocery Store,Gym,Fast Food Restaurant,Liquor Store,Café,Bakery


In [18]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = data

# merge manhattan_grouped with manhattan_data to add| latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place,Restaurant,Park,Bank,Grocery Store,Café
1,M4A,North York,Victoria Village,43.725882,-79.315572,2,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place,Restaurant,Park,Bank,Grocery Store,Café
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,2,Coffee Shop,Clothing Store,Japanese Restaurant,Pizza Place,Sandwich Place,Restaurant,Park,Bank,Grocery Store,Café
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,2,Coffee Shop,Café,Restaurant,Hotel,Japanese Restaurant,Italian Restaurant,Bakery,Park,Gym,Seafood Restaurant




To create a map we have to locate Ontario on the map. For that we'll use the ***geopy's Nominatim*** function.

In [19]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [70]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], 
                                  toronto_merged['Longitude'], 
                                  toronto_merged['Borough'], 
                                  toronto_merged['Cluster Labels']):
    
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

**Since folium might not show in GitHub, I leave here a PNG image of it**

!['Toronto Clusters Map'](./toronto_map.png)

### Part 5. Clusters.

Here we can see how the different clusters are made of.

In [29]:
toronto_cluster1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [59]:
toronto_cluster1["Borough"].value_counts().to_frame()

Unnamed: 0,Borough
Scarborough,17
East York,5


In [60]:
toronto_cluster1["1st Most Common Venue"].value_counts().to_frame()

Unnamed: 0,1st Most Common Venue
Bakery,17
Coffee Shop,5


In [30]:
toronto_cluster2 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [52]:
toronto_cluster2["Borough"].value_counts().to_frame()

Unnamed: 0,Borough
York,5


In [61]:
toronto_cluster2["1st Most Common Venue"].value_counts().to_frame()

Unnamed: 0,1st Most Common Venue
Park,5


In [31]:
toronto_cluster3 =toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [53]:
toronto_cluster3["Borough"].value_counts().to_frame()

Unnamed: 0,Borough
North York,24
Downtown Toronto,19
Central Toronto,9
West Toronto,6
East Toronto,5


In [62]:
toronto_cluster3["1st Most Common Venue"].value_counts().to_frame()

Unnamed: 0,1st Most Common Venue
Coffee Shop,57
Café,6


In [32]:
toronto_cluster4 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [54]:
toronto_cluster4["Borough"].value_counts().to_frame()

Unnamed: 0,Borough
Mississauga,1


In [63]:
toronto_cluster4["1st Most Common Venue"].value_counts().to_frame()

Unnamed: 0,1st Most Common Venue
Coffee Shop,1


In [33]:
toronto_cluster5 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

In [55]:
toronto_cluster5["Borough"].value_counts().to_frame()

Unnamed: 0,Borough
Etobicoke,12


In [64]:
toronto_cluster5["1st Most Common Venue"].value_counts().to_frame()

Unnamed: 0,1st Most Common Venue
Pizza Place,12


Is clear that the most varied cluster is the third one, which covers the greatests amount of boroughs. The least diverse is the fourth one, which covers only one borough in total. We can see some similarities regarding the most common venue in the area. The first, thrid and fourth have Coffee Shops as their most common venues. The other ones have an specific type of venue that doesn't repeat (on the most common one).