## Part 1: Scrape the Toronto neighborhood data from Wikipedia

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0
import folium
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
# Scrape data from Wikipedia regarding postal codes of Toronto neighborhoods
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').content
soup = BeautifulSoup(html_data, 'html5lib')

In [3]:
# Turn HTML data into dataframe using example code provided in the lab
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
tor_df=pd.DataFrame(table_contents)
tor_df['Borough']=tor_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [4]:
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [5]:
tor_df.shape

(103, 3)

## Part 2: Getting latitude & longitude coordinates

In [6]:
# Download the CSV file from the assignment site and save it to a dataframe
geo_df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv")
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# Merge the dataframes together with a database-style join on the Postal Code column
nb_df = pd.merge(tor_df, geo_df, left_on="PostalCode", right_on="Postal Code")
nb_df.drop(columns=["Postal Code"], inplace=True)
nb_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [8]:
nb_df.shape

(103, 5)

## Part 3: Clustering the neighborhoods

In [9]:
CLIENT_ID = 'ZKUNPTW5K3WSHUAQX4AKPMUJ5FDV5LFRS4PL0OBK3U224GNF' # your Foursquare ID
CLIENT_SECRET = 'A55L15R2JMDGHJORV1O0KRYVFDJPAGAFFENE0WM1J2CK1MS2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [10]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
# Get list of venues in Toronto neighborhoods
toronto_venues = getNearbyVenues(names=nb_df['Neighborhood'], latitudes=nb_df['Latitude'], longitudes=nb_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East
The Danforth

In [12]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [13]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 274 unique categories.


To use k-means to analyze the categories in terms of frequency of venue category, we have to transform the "Venue Category" column into one-hot encoding.

In [14]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Fill neighborhood dummy values with actual values from venue table and move it to first position
toronto_onehot["Neighborhood"] = toronto_venues["Neighborhood"]
nf_column = toronto_onehot.pop("Neighborhood")

toronto_onehot.insert(0, "Neighborhood", nf_column)
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Transform table: group by neighborhood, take mean of frequency of occurence of each category of venue
toronto_grouped = toronto_onehot.groupby("Neighborhood").mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# Build dataframe of most popular venue categories for when clustering has been done
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Breakfast Spot,Lounge,Skating Rink,Mobile Phone Shop,Moroccan Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Accessories Store
1,"Alderwood, Long Branch",Pizza Place,Playground,Pharmacy,Sandwich Place,Gym,Pub,Coffee Shop,Mobile Phone Shop,Motel,Moroccan Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Bridal Shop,Mobile Phone Shop,Pizza Place,Shopping Mall,Fried Chicken Joint,Supermarket,Sushi Restaurant,Sandwich Place
3,Bayview Village,Japanese Restaurant,Bank,Café,Chinese Restaurant,Accessories Store,Monument / Landmark,Movie Theater,Motel,Moroccan Restaurant,Molecular Gastronomy Restaurant
4,"Bedford Park, Lawrence Manor East",Sandwich Place,Coffee Shop,Italian Restaurant,Pharmacy,Pub,Pizza Place,Butcher,Café,Liquor Store,Sushi Restaurant


In [17]:
# Run k-means clustering on the neighborhoods
k = 5

toronto_grouped_clustering = toronto_grouped.drop("Neighborhood", 1)

kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_grouped_clustering)

# Insert cluster labels into the sorted venue table
neighborhoods_venues_sorted.insert(0, "Cluster Label", kmeans.labels_)

In [18]:
# Merge cluster labels back into the original table with latitude/longitude data
toronto_merged = nb_df

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index("Neighborhood"), on="Neighborhood")

# Drop rows for neighborhoods that didn't have any venues returned from Foursquare
toronto_merged.dropna(inplace=True)

# Convert cluster labels to int for plotting
toronto_merged = toronto_merged.astype({"Cluster Label": "int"})

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2,Fast Food Restaurant,Food & Drink Shop,Park,Miscellaneous Shop,Moroccan Restaurant,Monument / Landmark,Molecular Gastronomy Restaurant,Modern European Restaurant,Mobile Phone Shop,Mexican Restaurant
1,M4A,North York,Victoria Village,43.725882,-79.315572,1,Pizza Place,Portuguese Restaurant,Hockey Arena,Coffee Shop,French Restaurant,Accessories Store,Modern European Restaurant,Motel,Moroccan Restaurant,Monument / Landmark
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1,Coffee Shop,Pub,Bakery,Café,Park,Theater,Restaurant,Sushi Restaurant,Breakfast Spot,Spa
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1,Clothing Store,Accessories Store,Vietnamese Restaurant,Miscellaneous Shop,Furniture / Home Store,Coffee Shop,Boutique,Performing Arts Venue,Park,Mediterranean Restaurant
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,1,Coffee Shop,Café,Diner,Yoga Studio,Theater,Mexican Restaurant,Smoothie Shop,Spa,Fried Chicken Joint,Sushi Restaurant


In [19]:
address = 'Toronto, ON, CA'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


In [20]:
# Re-use mapping code from Manhattan lab to plot the neighborhoods
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Determining the characteristics of each cluster

In this last section, I'm going to get the most frequent value for the 1st, 2nd, and 3rd most common venue in the cluster to get a general idea of what type of cluster it is.

In [21]:
# Define a helper function that will print the most frequent 1st, 2nd, and 3rd most common venues in the cluster
def top3(df):
    common_venues = []
    for indicator in ["1st", "2nd", "3rd"]:
        common_venues.append(df[f"{indicator} Most Common Venue"].value_counts()[:1].index.tolist()[0])
        
    return common_venues

# Print top 3 venue types for each of the 5 clusters
for clst_idx in range(0,5):
    clst_df = toronto_merged.loc[toronto_merged['Cluster Label'] == clst_idx, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]
    print(f"Cluster {clst_idx} Top Venue Types: {top3(clst_df)}")

Cluster 0 Top Venue Types: ['Park', 'Accessories Store', 'Miscellaneous Shop']
Cluster 1 Top Venue Types: ['Coffee Shop', 'Coffee Shop', 'Café']
Cluster 2 Top Venue Types: ['Convenience Store', 'Park', 'Park']
Cluster 3 Top Venue Types: ['Construction & Landscaping', 'Accessories Store', 'Mobile Phone Shop']
Cluster 4 Top Venue Types: ['Playground', 'Mobile Phone Shop', 'Movie Theater']


Based on the information above, here are the observations I have about each of the 5 clusters.

### Cluster 0:

With accessory stores second in frequency to parks, I would make the assumption that neighborhoods in this cluster have parks in them that are focused more on hiking than picnicking.


### Cluster 1:

Venues in these neighborhoods are dominated by coffee shops and cafes, which point to neighborhoods that are primarily designed to serve working professionals.


### Cluster 2: 

I would think that neighborhoods in this cluster are primarily focused on seeing to the needs of visiting parkgoers, who are more likely to make a quick stop to pick up something they need for a picnic in the park itself, instead of going into a sit-down restaurant.


### Cluster 3:

Neighborhoods in this cluster are centered around their baseball fields and supplying players and fans.


### Cluster 4:

Neighborhoods in this cluster are focused on serving the families that come to the local playgrounds.