## Coursera IBM Data Science Capstone Project

By Tamela Maciel, June 2020

This jupyter notebook completes Coursera's IBM Data Science Professional Certificate capstone project.

It uses the Foursquare API, BeautifulSoup for webscraping, and various python libraries to gather location data for the city of Toronto and compares and clusters various neighborhoods.

### Import libraries

In [1]:
import pandas as pd  #database wrangling
import numpy as np #linear algebra
import string
from bs4 import BeautifulSoup #webscraping

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import requests # library to handle requests

import geocoder # convert a postcode into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### Part 1 - Create Database of Toronto Neighborhoods
Use BeautifulSoup to scrape postcodes, boroughs, and neighborhoods from this wikipedia table:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Step 1. Request html from the website url

In [33]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup = BeautifulSoup(response.text)

Step 2. Find the first table in the html items

In [34]:
table = soup.find('table')

Step 3. Iterate through rows of table (denoted by 'tr' tags), and append to 'neighborhood_df' dataframe if the borough is assigned a name

In [35]:
#create empty dataframe 
column_names=['PostalCode','Borough','Neighborhood']
neighborhood_df = pd.DataFrame(columns=column_names) 

#read in data using table and get_text() functions from Beautiful Soup
row_id = 0
for row in table.find_all('tr'):
    columns = row.find_all('td')
    try:
        postcode=columns[0].get_text().rstrip('\n')
        borough=columns[1].get_text().rstrip('\n')
        neighborhood=columns[2].get_text().rstrip('\n')
        #print(repr(postcode), repr(borough),neighborhood)
        if borough!='Not assigned':
            neighborhood_df = neighborhood_df.append({'PostalCode': postcode,
                                                      'Borough':borough,
                                                      'Neighborhood': neighborhood
                                                     }, ignore_index=True)
    except:
        pass
    row_id += 1

neighborhood_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [36]:
print("There are {} rows in the 'neighborhoods_df' database".format(neighborhood_df.shape[0]))

There are 103 rows in the 'neighborhoods_df' database


### Part 2 - Get neighborhood lat, long using geocoder

After quite a lot of trial and error with geocoder as well as geopy.geocoders, I'm unable to consistently get lat and long data points for all neighborhoods. Most post codes repeatedly return 'None'.

So will read in csv file instead

In [20]:
### OLD CODE
#postal_code='M1B'
## initialize your variable to None
#lat_lng_coords = None
#
## loop until you get the coordinates
##while(lat_lng_coords is None):
#g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#lat_lng_coords = g.latlng
#
#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#print(latitude,longitude)

In [37]:
lat_lng_file=pd.read_csv("Geospatial_Coordinates.csv")

In [38]:
lat_lng_file=lat_lng_file.rename(columns={'Postal Code':'PostalCode'})
lat_lng_file.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [39]:
neighborhood_df=neighborhood_df.merge(lat_lng_file, on="PostalCode")
neighborhood_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [40]:
print("There are now {} rows and {} columns in the 'neighborhoods_df' database".format(neighborhood_df.shape[0],neighborhood_df.shape[1]))

There are now 103 rows and 5 columns in the 'neighborhoods_df' database


### Part 3 - Cluster neighborhoods by type of venue using foursquare API and k-means

**Step 1 - define client IDs for Foursquare API**

In [41]:
CLIENT_ID = 'LRZS1Q1R12WGF52OQPOSHLS4CC3NNRZYTSKECAHUTTYW0TMR' # your Foursquare ID
CLIENT_SECRET = 'SYHT1RYKRHP5VV3ARJC01GTXMM0ERHOLZXGPL4TYGOPHS1PB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LRZS1Q1R12WGF52OQPOSHLS4CC3NNRZYTSKECAHUTTYW0TMR
CLIENT_SECRET:SYHT1RYKRHP5VV3ARJC01GTXMM0ERHOLZXGPL4TYGOPHS1PB


**Step 2 - get venues around each distinct postcode.**  

Since the postcodes are unique to the boroughs and not the neighborhoods (some boroughs contain more than one neighborhood, but the same postcode for all), we can only gather venue info and cluster over boroughs. 

I include all 103 boroughs in my cluster analysis.

For boroughs that have the same name but different postcodes, I will distingish between them by appending the post code, e.g. 'Downtown Toronto - M5G'. These will be called 'areas' in the resulting database.

In [42]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [44]:
# function to get max 100 nearby venues within 500 metres of neighborhood lat, long
def getNearbyVenues(boroughs, postcodes, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for borough, postcode, lat, lng in zip(boroughs, postcodes, latitudes, longitudes):
        name=borough+' - '+postcode
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Area', 
                  'Area Latitude', 
                  'Area Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [66]:
raw_venues = getNearbyVenues(boroughs=neighborhood_df['Borough'],
                                   postcodes=neighborhood_df['PostalCode'],
                                   latitudes=neighborhood_df['Latitude'],
                                   longitudes=neighborhood_df['Longitude']
                                  )
raw_venues.head()

North York - M3A
North York - M4A
Downtown Toronto - M5A
North York - M6A
Downtown Toronto - M7A
Etobicoke - M9A
Scarborough - M1B
North York - M3B
East York - M4B
Downtown Toronto - M5B
North York - M6B
Etobicoke - M9B
Scarborough - M1C
North York - M3C
East York - M4C
Downtown Toronto - M5C
York - M6C
Etobicoke - M9C
Scarborough - M1E
East Toronto - M4E
Downtown Toronto - M5E
York - M6E
Scarborough - M1G
East York - M4G
Downtown Toronto - M5G
Downtown Toronto - M6G
Scarborough - M1H
North York - M2H
North York - M3H
East York - M4H
Downtown Toronto - M5H
West Toronto - M6H
Scarborough - M1J
North York - M2J
North York - M3J
East York - M4J
Downtown Toronto - M5J
West Toronto - M6J
Scarborough - M1K
North York - M2K
North York - M3K
East Toronto - M4K
Downtown Toronto - M5K
West Toronto - M6K
Scarborough - M1L
North York - M2L
North York - M3L
East Toronto - M4L
Downtown Toronto - M5L
North York - M6L
North York - M9L
Scarborough - M1M
North York - M2M
North York - M3M
East Toronto - 

Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York - M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,North York - M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,North York - M3A,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,North York - M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,North York - M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


**Step 3 - explore and wrangle venue data so that it's ready to cluster**

In [67]:
print(raw_venues.shape)
raw_venues.head()

(2129, 7)


Unnamed: 0,Area,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York - M3A,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,North York - M3A,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,North York - M3A,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,North York - M4A,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,North York - M4A,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [68]:
#check how many venues per area
venues_grouped=raw_venues.groupby('Area').count()
venues_grouped.head(20)

Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto - M4N,3,3,3,3,3,3
Central Toronto - M4P,8,8,8,8,8,8
Central Toronto - M4R,21,21,21,21,21,21
Central Toronto - M4S,33,33,33,33,33,33
Central Toronto - M4T,4,4,4,4,4,4
Central Toronto - M4V,17,17,17,17,17,17
Central Toronto - M5N,1,1,1,1,1,1
Central Toronto - M5P,5,5,5,5,5,5
Central Toronto - M5R,21,21,21,21,21,21
Downtown Toronto - M4W,4,4,4,4,4,4


The number of venues per area varies hugely between 1 and 100. For the purposes of clustering, we'll want to have enough venues to build a profile of an area.

So let's drop all areas that have less than 5 venues.

In [69]:
areas_to_drop=venues_grouped[venues_grouped['Venue']<5].index.tolist()
print(len(areas_to_drop))

35


This results in 35 areas to drop:

In [72]:
toronto_venues=raw_venues.set_index('Area')
toronto_venues=toronto_venues.drop(areas_to_drop,axis=0)
toronto_venues=toronto_venues.reset_index()
print("There are now {} rows and {} columns in the 'toronto_venues' database".format(toronto_venues.shape[0],toronto_venues.shape[1]))

There are now 2024 rows and 7 columns in the 'toronto_venues' database
