# Rob's Capstone Project

This notebook with contain my work for the [capstone project](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome) for IBM Data Science Specialization certificate.
### Week 5

**Name**: Robert Barrimond

**Date**: May 20, 2021

**REVIEWER PLEASE NOTE**
I do _not_ comment my code with Markdown. As an SRE (Site Reliability Engineer), I do as application developers should do: document code _in the code_ and everywhere possible by the code itself. Having said that, SREs are also called to be data scientists as well. So, I use Markdown to "tell the story" as that first overview course taught me so many months ago. I hope this assignment was easy to follow and grade!

---

## Problem Statement
I've decided to see if it's worth pursuing opening a Cambodian restaurant somewhere in Toronto. I noticed from the previous assignments that the city is quite cosmopolitan and would welcome such a restaurant. The real problem is _where_ to locate it. My strategy will be to narrow the list of venues to Asian restaurants, get premium data on just those venues, e.g. number of like and rating, and use that to produce better clusters that can make my decision easier.


In [159]:
#
# Import the necessary modules
#

# Data analysis and transformation
import pandas as pd
import numpy as np

# REST API access
import requests

# File access
import os
from os import path
import pickle

# Geocoders
import geopy
from geopy.geocoders import Nominatim

# Regex
import re

# Progress bars
from tqdm import tqdm

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Mapping
import folium


## Retrieve and Clean Venue Data

Because it took [some work](https://github.com/rbarrimond/Coursera_Capstone/blob/5afe9c180839529c96ee71c8a5fae69746b9f4c3/toronto-kmeans-clustering.ipynb) to build a clean dataframe of FSAs from Wikipedia, I'll omit that work here and simple read the pickle from disk. Next I use the Foursquare API `explore` endpoint to get all the nearby venues. Once I retrieve these I'll sift out all restaurants and make Asain restaurants as a feature to use in my clustering analysis. This is a improvement over what was done in Week 4. I learned a good bit on getting information from the explore API up front and using that correct FSAs. This time I'm going to clean *all* the FSAs for *all* venues not just restaurants. Then I'll enrich just the asian restaurants for more detailed analysis.

In [2]:
# Load the FSA data from previous work
fsa_df = pd.read_pickle('fsa_df.pkl')
fsa_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [3]:
# FourSquare credentials
CLIENT_ID = '0II4MXQK5GVKQA3YIZRXT3D0KWBAKEH2BCCYRWIK4H0DS5XH' # your Foursquare ID
CLIENT_SECRET = 'ZBOCOGUCP2AXOQNAFSPX05IAXWAPBNUBC2FTAGYJV4DDS3AA' # your Foursquare Secret
ACCESS_TOKEN = 'ESEQDUIWNVRAS11OKDDXMICGNUPXLZCHPVHCZ53OTT2LQWBS' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [4]:
# This function takes a sequence of names, lats and longs and produces a dataframe of nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    nearby_venues = pd.DataFrame()

    with tqdm(total=len(names)) as pbar:
        for name, lat, lng in zip(names, latitudes, longitudes):
                
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search'
            payload = {
                'client_id': CLIENT_ID,
                'client_secret': CLIENT_SECRET,
                'v': VERSION,
                'll': '{},{}'.format(lat,lng),
                'radius': radius,
                'limit': LIMIT
            }
        
            # make the GET request, raise exception if an error
            r = requests.get(url, params=payload)
            r.raise_for_status()
            
            # create dataframe, return only relevant information for each nearby venue
            results = pd.json_normalize(r.json()["response"]['venues'])
            results.rename(columns={
                                    'id': 'Venue ID',
                                    'name': 'Venue', 
                                    'location.lat': 'Venue Latitude', 
                                    'location.lng': 'Venue Longitude',
                                    'location.address': 'Venue Address',
                                    'location.postalCode': 'Venue Postal Code'
                                }, inplace=True)
            results['Venue Category'] = results['categories'].loc[ results['categories'].notna() ].apply(lambda x: x[0]['name'] if len(x) > 0 else None)
            results['Neighborhood'] = name
            results['Neighborhood Latitude'] = lat
            results['Neighborhood Longitude'] = lng

            columns = ['Neighborhood', 
                        'Neighborhood Latitude', 
                        'Neighborhood Longitude', 
                        'Venue ID',
                        'Venue', 
                        'Venue Latitude', 
                        'Venue Longitude',
                        'Venue Address',
                        'Venue Postal Code',
                        'Venue Category']

            nearby_venues = nearby_venues.append(results[columns], ignore_index=True)
            pbar.update()
            
    return nearby_venues

In [5]:
# Retrieve all the venues in Toronto
toronto_venues = getNearbyVenues(fsa_df['PostalCode'], fsa_df['Latitude'], fsa_df['Longitude'])

# Quick cleanup
toronto_venues.drop_duplicates(subset=['Venue ID'], inplace=True, ignore_index=True)
toronto_venues.dropna(axis='index', subset=['Venue Category'], inplace=True)
toronto_venues.reset_index(drop=True, inplace=True)
toronto_venues['Venue Postal Code'] = toronto_venues['Venue Postal Code'].str.upper()
toronto_venues

100%|██████████| 103/103 [00:24<00:00,  4.25it/s]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category
0,M3A,43.753259,-79.329656,4e8d9dcdd5fbbbb6b3003c7b,Brookbanks Park,43.751976,-79.332140,Toronto,,Park
1,M3A,43.753259,-79.329656,4f3a69f9e4b024185be5a99b,17 Brookbanks Drive,43.752266,-79.332322,15 Brookbanks Dr.,M3A 2S9,Residential Building (Apartment / Condo)
2,M3A,43.753259,-79.329656,4dcc586845dd853165f01864,Tailor Made,43.741513,-79.319707,,,Laundry Service
3,M3A,43.753259,-79.329656,4b5ce8b2f964a520654a29e3,Shoppers Drug Mart,43.754171,-79.358057,1859 Leslie St,M3B 2M1,Pharmacy
4,M3A,43.753259,-79.329656,5e111e7e9316a70007fb9653,Subway,43.760334,-79.326906,"1277 York Mills Road, Unit F1-2, Bldg F",M3A 1Z5,Sandwich Place
...,...,...,...,...,...,...,...,...,...,...
8701,M8Z,43.628841,-79.520999,5a201e8c9411f219b8c6806a,Reel Espresso Bar,43.629726,-79.528760,777 Kipling Ave,M8Z 5Z4,Coffee Shop
8702,M8Z,43.628841,-79.520999,5e6f65d3a441c20008b7b3fe,Landscape Coffee Roaster,43.633419,-79.522599,195 Norseman St,M8Z 0E9,Coffee Shop
8703,M8Z,43.628841,-79.520999,5a580180da5e5645319e258a,Kerry's Place Autism Services,43.628693,-79.518080,,M8Z 2G6,Social Club
8704,M8Z,43.628841,-79.520999,4b4b2c24f964a520ad9326e3,Esso,43.623736,-79.515545,1000 The Queensway,M8Z 1P7,Gas Station


As we can see there are some mismatches in the data. The FSAs that we initially set as `Neighborhood` don't match the offical FSA in `Venue Postal Code`. The good news is that most of them were pretty close so that bodes well for me to start adjusting the `Neighborhood` column to reflect the "true" FSA. The first thing to do is clean the column and adjust the known FSAs.

In [15]:
# Clean up invalid values for the FSA
toronto_venues.loc[ ~toronto_venues['Venue Postal Code'].str.fullmatch('(\w{3}(?:\s{1}\w{3})?).*$', na=False), 'Venue Postal Code' ] = np.nan

# Create masks to filter the dataframe
postal_code_mask = pd.notna(toronto_venues['Venue Postal Code'])
postal_code_match_mask = (toronto_venues['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == toronto_venues['Neighborhood'])
address_mask = pd.notna(toronto_venues['Venue Address'])

# Venues with mismatched postal codes
toronto_venues[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[~postal_code_match_mask]

Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
0,Brookbanks Park,M3A,,Toronto
2,Tailor Made,M3A,,
3,Shoppers Drug Mart,M3A,M3B 2M1,1859 Leslie St
6,Pheasant Run Golf Course,M3A,,
7,Joey,M3A,,
...,...,...,...,...
8696,vinces shop of un-rektage,M8Z,,
8697,Hemisphere Freight & Brokerage,M8Z,,
8698,QBC,M8Z,,950 Islington Avenue
8700,TTC Stop #2726,M8Z,,Islington


In [16]:
# Adjust Neighborhood to known FSA
mask = postal_code_mask & ~postal_code_match_mask
toronto_venues.loc[ mask, 'Neighborhood'] = toronto_venues.loc[mask, 'Venue Postal Code'].str.extract('^(\w{3})', expand=False)

# Reset masks
postal_code_mask = pd.notna(toronto_venues['Venue Postal Code'])
postal_code_match_mask = (toronto_venues['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == toronto_venues['Neighborhood'])
address_mask = pd.notna(toronto_venues['Venue Address'])

# Check results
toronto_venues[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[~postal_code_match_mask]

Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
0,Brookbanks Park,M3A,,Toronto
2,Tailor Made,M3A,,
6,Pheasant Run Golf Course,M3A,,
7,Joey,M3A,,
12,Mrs. Claus' Sweatshop,M3A,,Cassandra Blvd.
...,...,...,...,...
8695,Valassis,M8Z,,
8696,vinces shop of un-rektage,M8Z,,
8697,Hemisphere Freight & Brokerage,M8Z,,
8698,QBC,M8Z,,950 Islington Avenue


Next we use Nominatum to do reverse lookups of each venue to update its postal code where that postal code is not set.

In [27]:
# Find remaining venues that need to be adjusted
postal_code_mask = pd.notna(toronto_venues['Venue Postal Code'])
postal_code_match_mask = (toronto_venues['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == toronto_venues['Neighborhood'])
address_mask = pd.notna(toronto_venues['Venue Address'])
mask = ~postal_code_match_mask & ~postal_code_mask

# Do a reverse geocode lookup and extract postal code from address found
p = re.compile(r"Ontario, (\w{3}(?:\s{1}\w{3})?).*$")
geolocator = Nominatim(user_agent="robs_ba_explorer")
for index in tqdm(toronto_venues[mask].index):
    location = geolocator.reverse("{}, {}".format(toronto_venues.loc[index, 'Venue Latitude'], toronto_venues.loc[index, 'Venue Longitude']))
    m = p.search(location.address)
    toronto_venues.loc[index, 'Venue Postal Code'] = m.group(1)
    toronto_venues.loc[index, 'Neighborhood'] = m.group(1)[0:3]
toronto_venues[mask]

100%|██████████| 2739/2739 [22:49<00:00,  2.00it/s]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category
2,M3A,43.753259,-79.329656,4dcc586845dd853165f01864,Tailor Made,43.741513,-79.319707,,M3A 1C6,Laundry Service
6,M3A,43.753259,-79.329656,4c4c83c646240f47898fe7f4,Pheasant Run Golf Course,43.758386,-79.337191,,M3A 3L6,Golf Course
7,M3A,43.753259,-79.329656,4bda3d363904a59320d5459e,Joey,43.753441,-79.321640,,M3A 2M8,Burger Joint
19,M3A,43.753259,-79.329656,5c141c8f396de0002cb02757,TTC Bus 995 York Mills Express,43.760417,-79.328885,,M3A 1Z5,Bus Line
23,M3A,43.753259,-79.329656,4eda7f23722e1da30263657a,Broadlands Skating Rink,43.746689,-79.322678,,M3A 2P5,Skating Rink
...,...,...,...,...,...,...,...,...,...,...
8693,M8Z,43.628841,-79.520999,4de44e4818385df2b0518036,Childrens Aid,43.630245,-79.515001,Chartwell Avenue,M8Z 4G6,Coworking Space
8694,M8Z,43.628841,-79.520999,5001a0c7e4b0946791d18e46,Coin Op Kar Wash,43.626837,-79.527294,,M8Z 2G9,Building
8695,M8Z,43.628841,-79.520999,4e85b87a0aafb44008406499,Valassis,43.627049,-79.519157,,M8Z 2G6,Coworking Space
8696,M8Z,43.628841,-79.520999,570951a9498e4f01dbbe556f,vinces shop of un-rektage,43.627149,-79.527908,,M8Z 2G9,Medical Center


In [29]:
# Fix the lat, long of the Neighborhoods
pd.reset_option("max_rows")
fix_lat_long = fsa_df.set_index('PostalCode')[ ['Latitude', 'Longitude'] ]
toronto_venues = toronto_venues.join(fix_lat_long, on='Neighborhood')
toronto_venues = toronto_venues.drop(columns=[ 'Neighborhood Latitude', 'Neighborhood Longitude'])
toronto_venues = toronto_venues.rename(columns={
    'Latitude': 'Neighborhood Latitude',
    'Longitude': 'Neighborhood Longitude'
})
cols = toronto_venues.columns.to_list()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
toronto_venues = toronto_venues[cols]
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category
0,M3A,43.753259,-79.329656,4e8d9dcdd5fbbbb6b3003c7b,Brookbanks Park,43.751976,-79.332140,Toronto,M5H,Park
1,M3A,43.753259,-79.329656,4f3a69f9e4b024185be5a99b,17 Brookbanks Drive,43.752266,-79.332322,15 Brookbanks Dr.,M3A 2S9,Residential Building (Apartment / Condo)
2,M3A,43.753259,-79.329656,4dcc586845dd853165f01864,Tailor Made,43.741513,-79.319707,,M3A 1C6,Laundry Service
3,M3B,43.745906,-79.352188,4b5ce8b2f964a520654a29e3,Shoppers Drug Mart,43.754171,-79.358057,1859 Leslie St,M3B 2M1,Pharmacy
4,M3A,43.753259,-79.329656,5e111e7e9316a70007fb9653,Subway,43.760334,-79.326906,"1277 York Mills Road, Unit F1-2, Bldg F",M3A 1Z5,Sandwich Place
...,...,...,...,...,...,...,...,...,...,...
8701,M8Z,43.628841,-79.520999,5a201e8c9411f219b8c6806a,Reel Espresso Bar,43.629726,-79.528760,777 Kipling Ave,M8Z 5Z4,Coffee Shop
8702,M8Z,43.628841,-79.520999,5e6f65d3a441c20008b7b3fe,Landscape Coffee Roaster,43.633419,-79.522599,195 Norseman St,M8Z 0E9,Coffee Shop
8703,M8Z,43.628841,-79.520999,5a580180da5e5645319e258a,Kerry's Place Autism Services,43.628693,-79.518080,,M8Z 2G6,Social Club
8704,M8Z,43.628841,-79.520999,4b4b2c24f964a520ad9326e3,Esso,43.623736,-79.515545,1000 The Queensway,M8Z 1P7,Gas Station


## Tag Asian Restaurants as a Feature Using the Foursquare API

In [162]:
# Explore the types of restaurants
restaurants = toronto_venues.loc[ toronto_venues['Venue Category'].str.contains('restaurant', case=False, regex=True) ].copy()
restaurants.reset_index(drop=True, inplace=True)
types = restaurants['Venue Category'].str.replace("Restaurant", "").str.strip().unique()
sorted(types)

['',
 'Afghan',
 'African',
 'American',
 'Argentinian',
 'Asian',
 'Bangladeshi',
 'Belgian',
 'Brazilian',
 'Burmese',
 'Cajun / Creole',
 'Cambodian',
 'Cantonese',
 'Caribbean',
 'Chinese',
 'Comfort Food',
 'Cuban',
 'Dim Sum',
 'Dumpling',
 'Eastern European',
 'English',
 'Ethiopian',
 'Falafel',
 'Fast Food',
 'Filipino',
 'French',
 'German',
 'Gluten-free',
 'Greek',
 'Hakka',
 'Halal',
 'Hungarian',
 'Indian',
 'Italian',
 'Japanese',
 'Jewish',
 'Korean',
 'Korean BBQ',
 'Latin American',
 'Mediterranean',
 'Mexican',
 'Middle Eastern',
 'Modern European',
 'Moroccan',
 'New American',
 'North Indian',
 'Pakistani',
 'Peking Duck',
 'Persian',
 'Portuguese',
 'Ramen',
 'Seafood',
 'Shanghai',
 'South Indian',
 'Spanish',
 'Sri Lankan',
 'Sushi',
 'Swiss',
 'Szechuan',
 'Tapas',
 'Thai',
 'Theme',
 'Tibetan',
 'Turkish',
 'Vegetarian / Vegan',
 'Vietnamese']

In [163]:
# Set the Asian categories based on the above and create a cleaned restaurants dataframe that calls out asian restaurants
asian_categories = [ 'Asian', 'Cambodian', 'Cantonese', 'Chinese', 'Dim Sum', 'Dumpling', 'Filipino', 'Hakka', 'Japanese', 'Korean', 'Korean BBQ', 'Peking Duck', 'Ramen', 'Shanghai', 'Sushi', 'Szechuan', 'Thai', 'Tibetan', 'Vietnamese']
restaurants['Asian'] = restaurants['Venue Category'].str.contains("|".join(asian_categories), case=False, regex=True)

In [164]:
# How many Asian restaurants
restaurants.groupby(by=['Asian']).count()['Venue ID']

Asian
False    544
True     264
Name: Venue ID, dtype: int64

### Pull Full Data from Foursquare for Each Asian Restaurant
NOTE: The `/venue` endpoint is a premium API. As a result, I cached results to file. 


In [165]:
# Read in venue data either from cache or from Foursquare
asian_venue_data = {}
if path.exists('asian_venue_data.pkl'):
    with open('asian_venue_data.pkl', 'rb') as f:
        asian_venue_data = pickle.load(f)
else:
    payload = {
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'v': VERSION
        }

    for venue_id in tqdm(restaurants.loc[ restaurants['Asian'] == True, 'Venue ID']):
        try:
            url = 'https://api.foursquare.com/v2/venues/{}'.format(venue_id)
            r = requests.get(url, params=payload)
            r.raise_for_status()
            asian_venue_data[venue_id] = r.json()['response']
        except:
            pass        
       
    with open('asian_venue_data.pkl', 'wb') as f:
        pickle.dump(asian_venue_data, f, pickle.HIGHEST_PROTOCOL)


In [166]:
# Create a dataframe for the venue data
asian_venue_data_df = pd.DataFrame()
for id in tqdm(asian_venue_data):
    asian_venue_data_df = asian_venue_data_df.append(pd.json_normalize(asian_venue_data[id]), ignore_index=True)
asian_venue_data_df.set_index('venue.id', inplace=True)
asian_venue_data_df

100%|██████████| 264/264 [00:05<00:00, 44.05it/s]


Unnamed: 0_level_0,venue.name,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,venue.location.cc,venue.location.country,venue.location.formattedAddress,venue.canonicalUrl,venue.categories,venue.verified,...,venue.page.user.type,venue.page.user.tips.count,venue.page.user.lists.groups,venue.page.user.bio,venue.location.neighborhood,venue.venuePage.id,venue.storeId,venue.page.user.venue.id,venue.parent.location.neighborhood,venue.page.pageInfo.description
venue.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,"[{'label': 'display', 'lat': 43.65414810180664...",CA,Canada,[Canada],https://foursquare.com/v/huayu-kitchen/4f73a47...,"[{'id': '4bf58dd8d48988d145941735', 'name': 'C...",False,...,,,,,,,,,,
5ab3d9f875a6ea3a7ddc4d2b,Thai Express,43.661630,-79.387340,"[{'label': 'display', 'lat': 43.66163, 'lng': ...",CA,Canada,"[76 Grenville St, Toronto ON M5S 1B2, Canada]",https://foursquare.com/v/thai-express/5ab3d9f8...,"[{'id': '4bf58dd8d48988d149941735', 'name': 'T...",False,...,,,,,,,,,,
59a86be58d1070397a5101be,Sushi Shop,43.661620,-79.387636,"[{'label': 'display', 'lat': 43.66162, 'lng': ...",CA,Canada,"[76 Grenville St, Woman's College Hospital, To...",https://foursquare.com/v/sushi-shop/59a86be58d...,"[{'id': '4bf58dd8d48988d1d2941735', 'name': 'S...",True,...,,,,,,,,,,
4c61c478edd320a1835bab29,Bella's Lechon,43.801291,-79.198378,"[{'label': 'display', 'lat': 43.80129149338062...",CA,Canada,"[1139 Morningside Ave, Unit 23, Toronto ON M1B...",https://foursquare.com/v/bellas-lechon/4c61c47...,"[{'id': '4eb1bd1c3b7b55596b4a748f', 'name': 'F...",False,...,,,,,,,,,,
4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,43.800982,-79.200233,"[{'label': 'display', 'lat': 43.80098159718747...",CA,Canada,"[3-1158 Morningside Ave (Sheppard Ave), Toront...",https://foursquare.com/v/charleys-exotic-cuisi...,"[{'id': '4bf58dd8d48988d145941735', 'name': 'C...",False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56b2cd62498e0819ad42f567,My Little Dumplings,43.664504,-79.325709,"[{'label': 'display', 'lat': 43.66450364048565...",CA,Canada,"[1372 Queen St E (at Greenwood Ave), Toronto O...",https://foursquare.com/v/my-little-dumplings/5...,"[{'id': '4bf58dd8d48988d108941735', 'name': 'D...",False,...,,,,,"Leslieville, Toronto, ON",,,,,
5d34d6dfe57689000792bf15,Hakka Fire,43.693030,-79.315832,"[{'label': 'display', 'lat': 43.69303, 'lng': ...",CA,Canada,"[1235 Woodbine Avenue (Lumsden Avenue), Toront...",https://foursquare.com/v/hakka-fire/5d34d6dfe5...,"[{'id': '52af3ac83cf9994f4e043bf3', 'name': 'H...",False,...,,,,,,,,,,
4b1711a6f964a520cbc123e3,Federick Restaurant,43.774697,-79.241142,"[{'label': 'display', 'lat': 43.77469659057996...",CA,Canada,"[1920 Ellesmere Rd (at Bellamy Rd. N), Scarbor...",https://foursquare.com/v/federick-restaurant/4...,"[{'id': '52af3ac83cf9994f4e043bf3', 'name': 'H...",False,...,,,,,,,,,,
5a5814ebe679bc7b2fa7cca4,Tuk Tuk Canteen,43.650806,-79.450520,"[{'label': 'display', 'lat': 43.650806, 'lng':...",CA,Canada,"[397 Roncesvalles Ave, Toronto ON M6R 2N1, Can...",https://foursquare.com/v/tuk-tuk-canteen/5a581...,"[{'id': '52e81612bcbc57f1066b7a03', 'name': 'C...",False,...,,,,,Roncesvalles Village,,,,,


In [167]:
#
# Join key data fields to the restaurants data frame and identify the dupes
#
key_venue_cols = [
    'venue.stats.tipCount',
    'venue.price.tier',
    'venue.rating',
    'venue.likes.count'
]
restaurants = restaurants.join(asian_venue_data_df[key_venue_cols], on='Venue ID')

# Cleanup columns
restaurants.rename(columns={
    'venue.stats.tipCount': 'Venue Tip Count',
    'venue.price.tier': 'Venue Price Tier',
    'venue.rating': 'Venue Rating',
    'venue.likes.count': 'Venue Likes'
}, inplace=True)

restaurants['Venue Price Tier'] = restaurants['Venue Price Tier'].astype(pd.Int64Dtype())
restaurants['Venue Tip Count'] = restaurants['Venue Tip Count'].astype(pd.Int64Dtype())
restaurants['Venue Likes'] = restaurants['Venue Likes'].astype(pd.Int64Dtype())

# Drop invalid neighborhoods
restaurants.dropna(subset=['Neighborhood Latitude', 'Neighborhood Longitude'], inplace=True)

restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Asian,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M3A,43.753259,-79.329656,4b8991cbf964a520814232e3,Allwyn's Bakery,43.759840,-79.324719,81 Underhill drive,M3A 1Z5,Caribbean Restaurant,False,,,,
1,M3A,43.753259,-79.329656,4e6696b6d16433b9ffff47c3,KFC,43.754387,-79.333021,,M3A 2S3,Fast Food Restaurant,False,,,,
2,M4A,43.725882,-79.315572,4d689350b6f46dcb77ee15b2,The Frig,43.727051,-79.317418,,M4A 1K2,French Restaurant,False,,,,
3,M4A,43.725882,-79.315572,4f3ecce6e4b0587016b6f30d,Portugril,43.725819,-79.312785,1733 Eglinton Avenue East,M4A 1J8,Portuguese Restaurant,False,,,,
4,M4A,43.725882,-79.315572,51c1d125498ef8fda0942e6c,Vinnia Meats,43.730465,-79.307520,1050 Birchmount Ave,M1P 4N4,German Restaurant,False,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,M7Y,43.662744,-79.321558,4ad9ebdcf964a520e61b21e3,Chick-n-Joy,43.665181,-79.321403,1483 Queen St. E,M4L 1E2,Fast Food Restaurant,False,,,,
804,M8Z,43.628841,-79.520999,4aec9552f964a52007c921e3,McDonald's,43.630007,-79.518041,1001 Islington Ave,M8Z 4P8,Fast Food Restaurant,False,,,,
805,M8Z,43.628841,-79.520999,4c6d5881e13db60c516ed8b1,Lakeshore Super Submarine,43.627321,-79.529354,2939 Lakeshore Blvd West,M8Z 5G5,Fast Food Restaurant,False,,,,
806,M8Z,43.628841,-79.520999,509ee7d8e4b03075378182a4,Ricco's Plum Tomato,43.632760,-79.518120,,M8Z 2R4,Italian Restaurant,False,,,,


In [168]:
# Save the data to file
with open('restaurants.pkl', 'wb') as f:
        pickle.dump(restaurants, f, pickle.HIGHEST_PROTOCOL)

## Analyze Restaurants
### Categorizing Each Postal Code by All Restaurants with Emphasis on Asian
By taking the mean across each postal code you get a measure of what percentage of that postal code is characterized by the venue category. The advantage of this is that the data is already scaled from zero to one. This is the final result we want: each postal code with segmented by its makeup of venue categories.

In [173]:
if path.exists('restaurants.pkl'):
    with open('restaurants.pkl', 'rb') as f:
        restaurants = pickle.load(f)

restaurant_one_hot = pd.get_dummies(restaurants['Venue Category'], prefix='', prefix_sep='')

# add neighborhood columns back to dataframe
restaurant_one_hot['Neighborhood'] = restaurants['Neighborhood'] 
restaurant_one_hot['Asian'] = restaurants['Asian'].astype(int)

# move neighborhood column to the first column
fixed_columns = list(restaurant_one_hot.columns[-2:]) + list(restaurant_one_hot.columns[:-2])
restaurant_one_hot = restaurant_one_hot[fixed_columns]

restaurant_clusters = restaurant_one_hot.groupby('Neighborhood').mean()
restaurant_clusters.head(10)

Unnamed: 0_level_0,Asian,Afghan Restaurant,African Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,Bangladeshi Restaurant,Belgian Restaurant,Brazilian Restaurant,Burmese Restaurant,...,Sushi Restaurant,Swiss Restaurant,Szechuan Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,0.1875,0.0,0.0625,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1C,0.285714,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1E,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1G,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1H,0.277778,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0
M1J,0.25,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1K,0.444444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1L,0.25,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1M,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0
M1N,0.333333,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.0


In [174]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(restaurant_clusters)

# check cluster labels generated for each row in the dataframe
print('Cluster Labels:\n', kmeans.labels_)
restaurant_clusters.insert(0, 'Cluster', kmeans.labels_)
restaurant_clusters.head(10)

Cluster Labels:
 [0 0 0 0 0 0 4 0 0 0 0 0 4 0 4 0 0 4 4 4 0 4 4 0 3 3 4 4 4 4 4 4 0 0 0 0 0
 4 4 0 0 0 0 0 3 0 1 0 0 0 1 4 0 1 0 4 0 0 0 0 0 0 0 4 0 0 4 4 2 4 4 3 4 0
 0 0 0 4 3 0 0 0 4 0 4 0 2 3 0 4 4 3 3 4 0 1 1 0 4 0 0 0]


Unnamed: 0_level_0,Cluster,Asian,Afghan Restaurant,African Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,Bangladeshi Restaurant,Belgian Restaurant,Brazilian Restaurant,...,Sushi Restaurant,Swiss Restaurant,Szechuan Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,0,0.1875,0.0,0.0625,0.0,0.0,0.0625,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1C,0,0.285714,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1E,0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1G,0,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1H,0,0.277778,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0
M1J,0,0.25,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1K,4,0.444444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1L,0,0.25,0.0,0.0,0.0,0.0,0.083333,0.083333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
M1M,0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0
M1N,0,0.333333,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.0


In [175]:
restaurants = restaurants.join(restaurant_clusters['Cluster'], on='Neighborhood')
restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Asian,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes,Cluster
0,M3A,43.753259,-79.329656,4b8991cbf964a520814232e3,Allwyn's Bakery,43.759840,-79.324719,81 Underhill drive,M3A 1Z5,Caribbean Restaurant,False,,,,,3
1,M3A,43.753259,-79.329656,4e6696b6d16433b9ffff47c3,KFC,43.754387,-79.333021,,M3A 2S3,Fast Food Restaurant,False,,,,,3
2,M4A,43.725882,-79.315572,4d689350b6f46dcb77ee15b2,The Frig,43.727051,-79.317418,,M4A 1K2,French Restaurant,False,,,,,0
3,M4A,43.725882,-79.315572,4f3ecce6e4b0587016b6f30d,Portugril,43.725819,-79.312785,1733 Eglinton Avenue East,M4A 1J8,Portuguese Restaurant,False,,,,,0
4,M4A,43.725882,-79.315572,51c1d125498ef8fda0942e6c,Vinnia Meats,43.730465,-79.307520,1050 Birchmount Ave,M1P 4N4,German Restaurant,False,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,M7Y,43.662744,-79.321558,4ad9ebdcf964a520e61b21e3,Chick-n-Joy,43.665181,-79.321403,1483 Queen St. E,M4L 1E2,Fast Food Restaurant,False,,,,,3
804,M8Z,43.628841,-79.520999,4aec9552f964a52007c921e3,McDonald's,43.630007,-79.518041,1001 Islington Ave,M8Z 4P8,Fast Food Restaurant,False,,,,,3
805,M8Z,43.628841,-79.520999,4c6d5881e13db60c516ed8b1,Lakeshore Super Submarine,43.627321,-79.529354,2939 Lakeshore Blvd West,M8Z 5G5,Fast Food Restaurant,False,,,,,3
806,M8Z,43.628841,-79.520999,509ee7d8e4b03075378182a4,Ricco's Plum Tomato,43.632760,-79.518120,,M8Z 2R4,Italian Restaurant,False,,,,,3


In [178]:
address = 'Toronto, Ontario'

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(restaurants['Neighborhood Latitude'], restaurants['Neighborhood Longitude'], restaurants['Neighborhood'], restaurants['Cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [179]:
map_clusters

In [189]:
restaurant_clusters.describe()

Unnamed: 0,Cluster,Asian,Afghan Restaurant,African Restaurant,American Restaurant,Argentinian Restaurant,Asian Restaurant,Bangladeshi Restaurant,Belgian Restaurant,Brazilian Restaurant,...,Sushi Restaurant,Swiss Restaurant,Szechuan Restaurant,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Tibetan Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
count,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,...,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0
mean,1.539216,0.294828,0.007298,0.018227,0.037128,0.000891,0.037746,0.000817,0.001401,0.00098,...,0.041633,0.001961,0.000577,0.001401,0.022054,0.0007,0.000545,0.004606,0.0142,0.031587
std,1.832985,0.227801,0.040787,0.103993,0.144663,0.009001,0.081352,0.008251,0.014145,0.009901,...,0.086154,0.019803,0.005824,0.014145,0.053461,0.007072,0.005501,0.029292,0.043504,0.118969
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.145604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,0.428571,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,...,0.06746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,4.0,1.0,0.333333,1.0,1.0,0.090909,0.5,0.083333,0.142857,0.1,...,0.444444,0.2,0.058824,0.142857,0.333333,0.071429,0.055556,0.25,0.2,1.0


In [197]:
restaurant_clusters.loc[ :, [True, True] + list(restaurant_clusters.iloc[:,2:].max() == 1.0) ]

Unnamed: 0_level_0,Cluster,Asian,African Restaurant,American Restaurant,Fast Food Restaurant,Italian Restaurant,Japanese Restaurant,Restaurant,Vietnamese Restaurant
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
M1B,0,0.187500,0.062500,0.000000,0.187500,0.000000,0.0,0.187500,0.0
M1C,0,0.285714,0.142857,0.000000,0.142857,0.285714,0.0,0.000000,0.0
M1E,0,0.200000,0.000000,0.000000,0.100000,0.000000,0.1,0.300000,0.0
M1G,0,0.222222,0.000000,0.000000,0.111111,0.000000,0.0,0.000000,0.0
M1H,0,0.277778,0.000000,0.000000,0.055556,0.055556,0.0,0.111111,0.0
...,...,...,...,...,...,...,...,...,...
M9N,0,0.222222,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0
M9P,4,0.428571,0.000000,0.142857,0.000000,0.000000,0.0,0.000000,0.0
M9R,0,0.333333,0.000000,0.333333,0.333333,0.000000,0.0,0.000000,0.0
M9V,0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0
