# Rob's Capstone Project

This notebook with contain my work for the [capstone project](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome) for IBM Data Science Specialization certificate.
### Week 5

**Name**: Robert Barrimond

**Date**: May 20, 2021

**REVIEWER PLEASE NOTE**
I do _not_ comment my code with Markdown. As an SRE (Site Reliability Engineer), I do as application developers should do: document code _in the code_ and everywhere possible by the code itself. Having said that, SREs are also called to be data scientists as well. So, I use Markdown to "tell the story" as that first overview course taught me so many months ago. I hope this assignment was easy to follow and grade!

---

## Problem Statement
I've decided to see if it's worth pursuing opening a Cambodian restaurant somewhere in Toronto. I noticed from the previous assignments that the city is quite cosmopolitan and would welcome such a restaurant. The real problem is _where_ to locate it. My strategy will be to narrow the list of venues to Asian restaurants, get premium data on just those venues, e.g. number of like and rating, and use that to produce better clusters that can make my decision easier.


In [1]:
#
# Import the necessary modules
#

# Data analysis and transformation
import pandas as pd
import numpy as np

# REST API access
import requests

# File access
import os
from os import path
import pickle

# Geocoders
import geopy
from geopy.geocoders import Nominatim

# Regex
import re

# Progress bars
from tqdm import tqdm


## Retrieve and Clean Data for Analysis
### Tag Asian Restaurants as a Feature Using the Foursquare API

Because it took [some work](https://github.com/rbarrimond/Coursera_Capstone/blob/5afe9c180839529c96ee71c8a5fae69746b9f4c3/toronto-kmeans-clustering.ipynb) to build a clean dataframe of FSAs from Wikipedia, I'll omit that work here and simple read the pickle from disk. Next I use the Foursquare API `explore` endpoint to get all the nearby venues. Once I retrieve these I'll sift out all restaurants and make Asain restaurants as a feature to use in my clustering analysis. This is a improvement over what was done in Week 4. I learned a good bit on getting information from the explore API up front and using that correct FSAs.

In [2]:
# Load the FSA data from previous work
fsa_df = pd.read_pickle('fsa_df.pkl')
fsa_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [3]:
# FourSquare credentials
CLIENT_ID = '0II4MXQK5GVKQA3YIZRXT3D0KWBAKEH2BCCYRWIK4H0DS5XH' # your Foursquare ID
CLIENT_SECRET = 'ZBOCOGUCP2AXOQNAFSPX05IAXWAPBNUBC2FTAGYJV4DDS3AA' # your Foursquare Secret
ACCESS_TOKEN = 'ESEQDUIWNVRAS11OKDDXMICGNUPXLZCHPVHCZ53OTT2LQWBS' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [4]:
# This function takes a sequence of names, lats and longs and produces a dataframe of nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    nearby_venues = pd.DataFrame()

    with tqdm(total=len(names)) as pbar:
        for name, lat, lng in zip(names, latitudes, longitudes):
                
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search'
            payload = {
                'client_id': CLIENT_ID,
                'client_secret': CLIENT_SECRET,
                'v': VERSION,
                'll': '{},{}'.format(lat,lng),
                'radius': radius,
                'limit': LIMIT
            }
        
            # make the GET request, raise exception if an error
            r = requests.get(url, params=payload)
            r.raise_for_status()
            
            # create dataframe, return only relevant information for each nearby venue
            results = pd.json_normalize(r.json()["response"]['venues'])
            results.rename(columns={
                                    'id': 'Venue ID',
                                    'name': 'Venue', 
                                    'location.lat': 'Venue Latitude', 
                                    'location.lng': 'Venue Longitude',
                                    'location.address': 'Venue Address',
                                    'location.postalCode': 'Venue Postal Code'
                                }, inplace=True)
            results['Venue Category'] = results['categories'].loc[ results['categories'].notna() ].apply(lambda x: x[0]['name'] if len(x) > 0 else None)
            results['Neighborhood'] = name
            results['Neighborhood Latitude'] = lat
            results['Neighborhood Longitude'] = lng

            columns = ['Neighborhood', 
                        'Neighborhood Latitude', 
                        'Neighborhood Longitude', 
                        'Venue ID',
                        'Venue', 
                        'Venue Latitude', 
                        'Venue Longitude',
                        'Venue Address',
                        'Venue Postal Code',
                        'Venue Category']

            nearby_venues = nearby_venues.append(results[columns], ignore_index=True)
            pbar.update()
            
    return nearby_venues

In [5]:
# Retrieve all the venues in Toronto
toronto_venues = getNearbyVenues(fsa_df['PostalCode'], fsa_df['Latitude'], fsa_df['Longitude'])

# Quick cleanup
toronto_venues.drop_duplicates(subset=['Venue ID'], inplace=True, ignore_index=True)
toronto_venues.dropna(axis='index', subset=['Venue Category'], inplace=True)
toronto_venues.reset_index(drop=True, inplace=True)
toronto_venues['Venue Postal Code'] = toronto_venues['Venue Postal Code'].str.upper()
toronto_venues

100%|██████████| 103/103 [00:19<00:00,  5.19it/s]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category
0,M3A,43.753259,-79.329656,4e8d9dcdd5fbbbb6b3003c7b,Brookbanks Park,43.751976,-79.332140,Toronto,,Park
1,M3A,43.753259,-79.329656,4f3a69f9e4b024185be5a99b,17 Brookbanks Drive,43.752266,-79.332322,15 Brookbanks Dr.,M3A 2S9,Residential Building (Apartment / Condo)
2,M3A,43.753259,-79.329656,4dcc586845dd853165f01864,Tailor Made,43.741513,-79.319707,,,Laundry Service
3,M3A,43.753259,-79.329656,5e111e7e9316a70007fb9653,Subway,43.760334,-79.326906,"1277 York Mills Road, Unit F1-2, Bldg F",M3A 1Z5,Sandwich Place
4,M3A,43.753259,-79.329656,4bda3d363904a59320d5459e,Joey,43.753441,-79.321640,,,Burger Joint
...,...,...,...,...,...,...,...,...,...,...
8695,M8Z,43.628841,-79.520999,4b4a2c3ff964a520507d26e3,Jim & Maria's No Frills,43.631152,-79.518617,1020 Islington Ave,M8Z 6A4,Grocery Store
8696,M8Z,43.628841,-79.520999,5766bbe1498e063d79da076f,Pavao Meats,43.626930,-79.527209,16 Jutland road,M8C 2G9,Butcher
8697,M8Z,43.628841,-79.520999,4c34d41a16adc928b6d3c59c,Food Depot International,43.627208,-79.527310,14 Jutland Road,M8Z 2G9,Food & Drink Shop
8698,M8Z,43.628841,-79.520999,51e5696b498e9ff78a14be19,Torque Barbell,43.632061,-79.525625,253 Norseman St,M8Z 2R4,Gym / Fitness Center


In [6]:
# Explore the types of restaurants
restaurants = toronto_venues.loc[ toronto_venues['Venue Category'].str.contains('restaurant', case=False, regex=True) ].copy()
types = restaurants['Venue Category'].str.replace("Restaurant", "")
types = sorted(types.unique())
types

['',
 'Afghan ',
 'African ',
 'American ',
 'Argentinian ',
 'Asian ',
 'Bangladeshi ',
 'Belgian ',
 'Burmese ',
 'Cajun / Creole ',
 'Cantonese ',
 'Caribbean ',
 'Chinese ',
 'Comfort Food ',
 'Cuban ',
 'Dim Sum ',
 'Dumpling ',
 'Eastern European ',
 'English ',
 'Ethiopian ',
 'Falafel ',
 'Fast Food ',
 'Filipino ',
 'French ',
 'German ',
 'Gluten-free ',
 'Greek ',
 'Hakka ',
 'Halal ',
 'Hungarian ',
 'Indian ',
 'Italian ',
 'Japanese ',
 'Jewish ',
 'Korean ',
 'Korean BBQ ',
 'Latin American ',
 'Mediterranean ',
 'Mexican ',
 'Middle Eastern ',
 'Modern European ',
 'Moroccan ',
 'New American ',
 'North Indian ',
 'Pakistani ',
 'Peking Duck ',
 'Persian ',
 'Portuguese ',
 'Ramen ',
 'Seafood ',
 'South Indian ',
 'Sri Lankan ',
 'Sushi ',
 'Swiss ',
 'Szechuan ',
 'Tapas ',
 'Thai ',
 'Theme ',
 'Tibetan ',
 'Turkish ',
 'Vegetarian / Vegan ',
 'Vietnamese ']

In [7]:
# Set the Asian categories based on the above and create a cleaned Asian restaurants dataframe
asian_categories = [ 'Asian', 'Cantonese', 'Chinese', 'Dim Sum', 'Dumpling', 'Filipino', 'Japanese', 'Korean', 'Korean BBQ', 'Peking Duck', 'Ramen', 'Sushi', 'Szechuan', 'Taiwanese', 'Thai', 'Tibetan', 'Vietnamese']
asian_restaurants = toronto_venues.loc[ toronto_venues['Venue Category'].str.contains("|".join(asian_categories) + ' restaurant', case=False, regex=True) ].copy()
asian_restaurants.reset_index(drop=True, inplace=True)
asian_restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category
0,M5A,43.654260,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,,Chinese Restaurant
1,M7A,43.662301,-79.389494,5ab3d9f875a6ea3a7ddc4d2b,Thai Express,43.661630,-79.387340,76 Grenville St,M5S 1B2,Thai Restaurant
2,M7A,43.662301,-79.389494,59a86be58d1070397a5101be,Sushi Shop,43.661620,-79.387636,"76 Grenville St, Woman's College Hospital",M5S 1B2,Sushi Restaurant
3,M1B,43.806686,-79.194353,4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,43.800982,-79.200233,3-1158 Morningside Ave,M1B 3A4,Chinese Restaurant
4,M3B,43.745906,-79.352188,53bafb4f498eb927faa3cd9e,Matsuda Japanese Cuisine & Teppanyaki,43.745494,-79.345821,1300 Don Mills Rd #2,M3B 2W6,Japanese Restaurant
...,...,...,...,...,...,...,...,...,...,...
242,M8X,43.653654,-79.506944,4aee0654f964a5206ad121e3,Sushi 2 Go,43.647875,-79.509427,2976 Bloor Street West,M8X 1B9,Sushi Restaurant
243,M8X,43.653654,-79.506944,4b11a52ef964a5204b8123e3,Momiji Sushi Bar & Grill,43.647843,-79.508534,2955 Bloor St. W.,M8X 1B8,Sushi Restaurant
244,M4Y,43.665860,-79.383160,4e36063c8877beb5e9b29c87,Bowl,43.665443,-79.382027,,,Asian Restaurant
245,M4Y,43.665860,-79.383160,5c7a1d6f5bc27d00254a87e1,Dakgogi,43.665093,-79.383521,25 Wellesley St E,M4Y 2S9,Korean Restaurant


### Pull Full Data from Foursquare for Each Asian Restaurant
NOTE: The `/venue` endpoint is a premium API. As a result, I cached results to file. 


In [8]:
# Read in venue data either from cache or from Foursquare
asian_venue_data = {}
if path.exists('asian_venue_data.pkl'):
    with open('asian_venue_data.pkl', 'rb') as f:
        asian_venue_data = pickle.load(f)
else:
    payload = {
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'v': VERSION
        }

    with tqdm(total=len(asian_restaurants['Venue ID'])) as pbar:
        for venue_id in asian_restaurants['Venue ID']:
            try:
                url = 'https://api.foursquare.com/v2/venues/{}'.format(venue_id)
                r = requests.get(url, params=payload)
                r.raise_for_status()

                asian_venue_data[venue_id] = r.json()['response']
            except:
                pass
            pbar.update()
       
    with open('asian_venue_data.pkl', 'wb') as f:
        pickle.dump(asian_venue_data, f, pickle.HIGHEST_PROTOCOL)


In [9]:
# Create a dataframe for the venue data
asian_venue_data_df = pd.DataFrame()
for id in tqdm(asian_venue_data):
    asian_venue_data_df = asian_venue_data_df.append(pd.json_normalize(asian_venue_data[id]), ignore_index=True)
asian_venue_data_df.set_index('venue.id', inplace=True)
asian_venue_data_df

100%|██████████| 242/242 [00:09<00:00, 26.59it/s]


Unnamed: 0_level_0,venue.name,venue.contact.phone,venue.contact.formattedPhone,venue.location.address,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,venue.location.postalCode,venue.location.cc,venue.location.city,...,venue.page.user.firstName,venue.page.user.countryCode,venue.page.user.type,venue.page.user.tips.count,venue.page.user.lists.groups,venue.page.user.bio,venue.location.neighborhood,venue.parent.location.neighborhood,venue.storeId,venue.page.pageInfo.description
venue.id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5ab3d9f875a6ea3a7ddc4d2b,Thai Express,4169219222,(416) 921-9222,76 Grenville St,43.661630,-79.387340,"[{'label': 'display', 'lat': 43.66163, 'lng': ...",M5S 1B2,CA,Toronto,...,,,,,,,,,,
59a86be58d1070397a5101be,Sushi Shop,4169299777,(416) 929-9777,"76 Grenville St, Woman's College Hospital",43.661620,-79.387636,"[{'label': 'display', 'lat': 43.66162, 'lng': ...",M5S 1B2,CA,Toronto,...,,,,,,,,,,
4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,4162828608,(416) 282-8608,3-1158 Morningside Ave,43.800982,-79.200233,"[{'label': 'display', 'lat': 43.80098159718747...",M1B 3A4,CA,Toronto,...,,,,,,,,,,
4dd05e15ae603b786d5f1a34,977 Cafe,,,,43.706999,-79.310286,"[{'label': 'display', 'lat': 43.70699928709498...",,CA,Toronto,...,,,,,,,,,,
4f949f04e4b03c10544badb2,New East Garden,,,"2889 St Clair Ave E, East York, ON M4B 1N5",43.707505,-79.303303,"[{'label': 'display', 'lat': 43.70750479649608...",M4B 1N5,CA,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4aee0654f964a5206ad121e3,Sushi 2 Go,4162365909,(416) 236-5909,2976 Bloor Street West,43.647875,-79.509427,"[{'label': 'display', 'lat': 43.64787495176972...",M8X 1B9,CA,Toronto,...,,,,,,,,,,
4b11a52ef964a5204b8123e3,Momiji Sushi Bar & Grill,4162322320,(416) 232-2320,2955 Bloor St. W.,43.647843,-79.508534,"[{'label': 'display', 'lat': 43.64784304189297...",M8X 1B8,CA,Etobicoke,...,,,,,,,,,,
4e36063c8877beb5e9b29c87,Bowl,,,,43.665443,-79.382027,"[{'label': 'display', 'lat': 43.66544342041015...",,CA,,...,,,,,,,,,,
5c7a1d6f5bc27d00254a87e1,Dakgogi,,,25 Wellesley St E,43.665093,-79.383521,"[{'label': 'display', 'lat': 43.665093, 'lng':...",M4Y 2S9,CA,Toronto,...,,,,,,,,,,


In [10]:
#
# Join key data fields to the restaurants data frame and identify the dupes
#
key_venue_cols = [
    'venue.stats.tipCount',
    'venue.price.tier',
    'venue.rating',
    'venue.likes.count'
]
asian_restaurants = asian_restaurants.join(asian_venue_data_df[key_venue_cols], on='Venue ID')

# Cleanup
asian_restaurants.rename(columns={
    'venue.stats.tipCount': 'Venue Tip Count',
    'venue.price.tier': 'Venue Price Tier',
    'venue.rating': 'Venue Rating',
    'venue.likes.count': 'Venue Likes'
}, inplace=True)

asian_restaurants['Venue Price Tier'] = asian_restaurants['Venue Price Tier'].astype(pd.Int64Dtype())
asian_restaurants['Venue Tip Count'] = asian_restaurants['Venue Tip Count'].astype(pd.Int64Dtype())
asian_restaurants['Venue Likes'] = asian_restaurants['Venue Likes'].astype(pd.Int64Dtype())
asian_restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M5A,43.654260,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,,Chinese Restaurant,,,,
1,M7A,43.662301,-79.389494,5ab3d9f875a6ea3a7ddc4d2b,Thai Express,43.661630,-79.387340,76 Grenville St,M5S 1B2,Thai Restaurant,0,2,6.4,0
2,M7A,43.662301,-79.389494,59a86be58d1070397a5101be,Sushi Shop,43.661620,-79.387636,"76 Grenville St, Woman's College Hospital",M5S 1B2,Sushi Restaurant,0,2,,1
3,M1B,43.806686,-79.194353,4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,43.800982,-79.200233,3-1158 Morningside Ave,M1B 3A4,Chinese Restaurant,1,1,,1
4,M3B,43.745906,-79.352188,53bafb4f498eb927faa3cd9e,Matsuda Japanese Cuisine & Teppanyaki,43.745494,-79.345821,1300 Don Mills Rd #2,M3B 2W6,Japanese Restaurant,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242,M8X,43.653654,-79.506944,4aee0654f964a5206ad121e3,Sushi 2 Go,43.647875,-79.509427,2976 Bloor Street West,M8X 1B9,Sushi Restaurant,7,2,6.5,2
243,M8X,43.653654,-79.506944,4b11a52ef964a5204b8123e3,Momiji Sushi Bar & Grill,43.647843,-79.508534,2955 Bloor St. W.,M8X 1B8,Sushi Restaurant,14,2,7.8,36
244,M4Y,43.665860,-79.383160,4e36063c8877beb5e9b29c87,Bowl,43.665443,-79.382027,,,Asian Restaurant,0,2,,0
245,M4Y,43.665860,-79.383160,5c7a1d6f5bc27d00254a87e1,Dakgogi,43.665093,-79.383521,25 Wellesley St E,M4Y 2S9,Korean Restaurant,0,2,,0


In [11]:
# 
# Create masks to filter the dataframe
#

postal_code_mask = pd.notna(asian_restaurants['Venue Postal Code'])
postal_code_match_mask = (asian_restaurants['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == asian_restaurants['Neighborhood'])
address_mask = pd.notna(asian_restaurants['Venue Address'])

# Venues with mismatched postal codes
pd.set_option('max_rows', 200)
asian_restaurants[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[~postal_code_match_mask]

Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
0,Huayu Kitchen,M5A,,
1,Thai Express,M7A,M5S 1B2,76 Grenville St
2,Sushi Shop,M7A,M5S 1B2,"76 Grenville St, Woman's College Hospital"
5,977 Cafe,M4B,,
7,Michi Roll and Sushi,M5B,,113 Bond Street
9,Miyako Sushi Restaurant,M6B,,572 Marlee Ave
10,Li Cheng Restaurant,M6B,,529 Marlee Avenue
11,Tambayan,M6B,,541 Marlee Ave
12,Miyako sushi,M6B,,
13,Spoon and Fork,M9B,M9C 5M1,5555 Eglinton Ave W


As we can see there are a lot of mismatches in the data. The FSAs that we initially set as `Neighborhood` don't match the offical FSA in `Postal Code`. The good news is that most of them were pretty close so that bodes well for me to start adjusting the `Neighborhood` column to reflect the "true" FSA. The first thing to do is adjust the known FSAs.

In [12]:
# Adjust Neighborhood to known FSA

mask = postal_code_mask & ~postal_code_match_mask
asian_restaurants.loc[ mask, 'Neighborhood'] = asian_restaurants.loc[mask, 'Venue Postal Code'].str.extract('^(\w{3})', expand=False)

# Reset masks
postal_code_mask = pd.notna(asian_restaurants['Venue Postal Code'])
postal_code_match_mask = (asian_restaurants['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == asian_restaurants['Neighborhood'])
address_mask = pd.notna(asian_restaurants['Venue Address'])

# Check results
asian_restaurants[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[~postal_code_match_mask]

Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
0,Huayu Kitchen,M5A,,
5,977 Cafe,M4B,,
7,Michi Roll and Sushi,M5B,,113 Bond Street
9,Miyako Sushi Restaurant,M6B,,572 Marlee Ave
10,Li Cheng Restaurant,M6B,,529 Marlee Avenue
11,Tambayan,M6B,,541 Marlee Ave
12,Miyako sushi,M6B,,
14,Far East Chinese Food,M9B,,137 Martin Grove Road
16,Taste Buddies,M1C,,5532 Lawrence Ave
17,W Sushi,M1C,,235 Edinburgh rd


Now we have venues with no postal code but we _do_ have an address. Time to engage Nominatum to see if the geocoder can get us the postal code.

In [13]:
# Fix a spurious row
asian_restaurants.loc[ asian_restaurants['Venue'] == 'Lucky Hakka', 'Venue Postal Code'] = np.NaN

# Use the Nominatum geocoder to get the postal code data
geolocator = Nominatim(user_agent="robs_ba_explorer")
mask = ~postal_code_mask & ~postal_code_match_mask & address_mask
postal_codes = pd.Series(dtype='object')
for address in tqdm(asian_restaurants.loc[ mask, 'Venue Address']):
    location = geolocator.geocode(address + ", Toronto, ON")
    if location != None:
        postal_codes[address] = location.address
postal_codes = postal_codes.str.extract("Ontario, (\w{3}(?:\s{1}\w{3})?).*$")   
postal_codes

100%|██████████| 79/79 [00:39<00:00,  2.01it/s]


Unnamed: 0,0
113 Bond Street,M5B 1Y2
572 Marlee Ave,M6B 2A2
529 Marlee Avenue,M6B 2A2
541 Marlee Ave,M6B 2A2
137 Martin Grove Road,M9B 4N3
5532 Lawrence Ave,M6B 2A2
900 Don Mills Road,M3C 2H2
120 Church St.,M5C 2G3
120 Church St,M5C 2G3
961 Eglington West,M6E 2H8


In [14]:
# Loop through and set the Postal Code 
for index in tqdm(asian_restaurants.loc[mask].index):
    try:
        asian_restaurants.loc[index, 'Venue Postal Code'] = postal_codes.loc[asian_restaurants.loc[index, 'Venue Address'], 0]
    except KeyError:
        pass

# Reset masks
postal_code_mask = pd.notna(asian_restaurants['Venue Postal Code'])
postal_code_match_mask = (asian_restaurants['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == asian_restaurants['Neighborhood'])
address_mask = pd.notna(asian_restaurants['Venue Address'])
mask = ~postal_code_match_mask & address_mask

# Check results
asian_restaurants[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[mask]

100%|██████████| 79/79 [00:00<00:00, 3432.83it/s]


Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
16,Taste Buddies,M1C,M6B 2A2,5532 Lawrence Ave
17,W Sushi,M1C,,235 Edinburgh rd
26,The Thai Grill,M6C,M6E 2H8,961 Eglington West
34,Korean Grill House,M1G,M5B 1R8,369 Yonge Street
36,Lucky Hakka,,,"3774 Lawrence Avenue East,"
40,ND sushi and grill,M4G,,101-214 laird dr
47,Bungeoppang Stall,M6G,,"PAT Central, 675 Bloor Street"
70,Aji Sushi,M3J,,1325 Finch Street West
77,Bazara,M6J,M6G 1M4,188 Ossington
81,Ikki sushi,M1K,M1N 1T9,2328 Kingston rd


In [15]:
# Update the Neighborhood to match the FSA in the Postal Code
mask = postal_code_mask & ~postal_code_match_mask & address_mask
asian_restaurants.loc[ mask, 'Neighborhood'] = asian_restaurants.loc[mask, 'Venue Postal Code'].str.extract('^(\w{3})', expand=False) 
asian_restaurants[['Venue','Neighborhood','Venue Postal Code', 'Venue Address']].loc[mask]

Unnamed: 0,Venue,Neighborhood,Venue Postal Code,Venue Address
16,Taste Buddies,M6B,M6B 2A2,5532 Lawrence Ave
26,The Thai Grill,M6E,M6E 2H8,961 Eglington West
34,Korean Grill House,M5B,M5B 1R8,369 Yonge Street
77,Bazara,M6G,M6G 1M4,188 Ossington
81,Ikki sushi,M1N,M1N 1T9,2328 Kingston rd
87,Tokyo Sushi (on Bayview),M4G,M4G 3B5,1614 Bayview Ave.
108,Yogi Noodle Delight,M1W,M1W 3Y1,325 Bamburgh Circle
110,Pho Tien Phat,M3M,M3M 1V1,2133 Jane Street
115,O Sushi,M4C,M4C 3J6,6 Coxwell
118,Bento Nouveau,M5H,M5H 1H1,40 King St West


In [16]:
# Find remaining venues that need to be adjusted
postal_code_mask = pd.notna(asian_restaurants['Venue Postal Code'])
postal_code_match_mask = (asian_restaurants['Venue Postal Code'].str.extract('^(\w{3})', expand=False) == asian_restaurants['Neighborhood'])
address_mask = pd.notna(asian_restaurants['Venue Address'])
mask = ~postal_code_match_mask & ~postal_code_mask

asian_restaurants.loc[mask]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M5A,43.65426,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,,Chinese Restaurant,,,,
5,M4B,43.706397,-79.309937,4dd05e15ae603b786d5f1a34,977 Cafe,43.706999,-79.310286,,,Asian Restaurant,1.0,2.0,,0.0
12,M6B,43.709577,-79.445073,51a8f217498e902d972c817e,Miyako sushi,43.707717,-79.447597,,,Asian Restaurant,0.0,2.0,,0.0
17,M1C,43.784535,-79.160497,506481cae4b01f36bfda395f,W Sushi,43.77567,-79.16444,235 Edinburgh rd,,Japanese Restaurant,0.0,2.0,,0.0
19,M3C,43.7259,-79.340923,5686b77338faf7478eb6c6aa,Asian Legend,43.726591,-79.342188,,,Dim Sum Restaurant,4.0,2.0,6.5,13.0
36,,43.770992,-79.216917,4b64765ff964a52028b52ae3,Lucky Hakka,43.76247,-79.214164,"3774 Lawrence Avenue East,",,Chinese Restaurant,9.0,1.0,6.6,4.0
40,M4G,43.70906,-79.363452,4bf2c38577b4c92887a26a1c,ND sushi and grill,43.711486,-79.363887,101-214 laird dr,,Sushi Restaurant,4.0,2.0,,1.0
42,M4G,43.70906,-79.363452,4f9483bce4b0ab5f0acfe249,Mikado,43.70924,-79.36398,,,Japanese Restaurant,0.0,2.0,,0.0
46,M6G,43.669542,-79.422564,4d2d046a853ff04de86ec5da,Gobo sushi,43.670783,-79.421287,,,Japanese Restaurant,3.0,2.0,,2.0
47,M6G,43.669542,-79.422564,4f063b710e61b14c291f5fe6,Bungeoppang Stall,43.668123,-79.420242,"PAT Central, 675 Bloor Street",,Korean Restaurant,1.0,2.0,,0.0


In [17]:
# Do a reverse geocode lookup and extract postal code from address found
p = re.compile(r"Ontario, (\w{3}(?:\s{1}\w{3})?).*$")
for index in tqdm(asian_restaurants[mask].index):
    location = geolocator.reverse("{}, {}".format(asian_restaurants.loc[index, 'Venue Latitude'], asian_restaurants.loc[index, 'Venue Longitude']))
    m = p.search(location.address)
    asian_restaurants.loc[index, 'Venue Postal Code'] = m.group(1)
    asian_restaurants.loc[index, 'Neighborhood'] = m.group(1)[0:3]
asian_restaurants[mask]

100%|██████████| 55/55 [00:27<00:00,  2.02it/s]


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M5A,43.65426,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,M5A 1H7,Chinese Restaurant,,,,
5,M4B,43.706397,-79.309937,4dd05e15ae603b786d5f1a34,977 Cafe,43.706999,-79.310286,,M4B 2V7,Asian Restaurant,1.0,2.0,,0.0
12,M6B,43.709577,-79.445073,51a8f217498e902d972c817e,Miyako sushi,43.707717,-79.447597,,M6B 3L3,Asian Restaurant,0.0,2.0,,0.0
17,M1E,43.784535,-79.160497,506481cae4b01f36bfda395f,W Sushi,43.77567,-79.16444,235 Edinburgh rd,M1E 2P9,Japanese Restaurant,0.0,2.0,,0.0
19,M3C,43.7259,-79.340923,5686b77338faf7478eb6c6aa,Asian Legend,43.726591,-79.342188,,M3C 2H2,Dim Sum Restaurant,4.0,2.0,6.5,13.0
36,M1G,43.770992,-79.216917,4b64765ff964a52028b52ae3,Lucky Hakka,43.76247,-79.214164,"3774 Lawrence Avenue East,",M1G 1R6,Chinese Restaurant,9.0,1.0,6.6,4.0
40,M4G,43.70906,-79.363452,4bf2c38577b4c92887a26a1c,ND sushi and grill,43.711486,-79.363887,101-214 laird dr,M4G 3W2,Sushi Restaurant,4.0,2.0,,1.0
42,M4G,43.70906,-79.363452,4f9483bce4b0ab5f0acfe249,Mikado,43.70924,-79.36398,,M4G 3W2,Japanese Restaurant,0.0,2.0,,0.0
46,M6G,43.669542,-79.422564,4d2d046a853ff04de86ec5da,Gobo sushi,43.670783,-79.421287,,M6G 3B9,Japanese Restaurant,3.0,2.0,,2.0
47,M6G,43.669542,-79.422564,4f063b710e61b14c291f5fe6,Bungeoppang Stall,43.668123,-79.420242,"PAT Central, 675 Bloor Street",M6G 3B9,Korean Restaurant,1.0,2.0,,0.0


In [18]:
# Fix the lat, long of the Neighborhoods
pd.reset_option("max_rows")
fix_lat_long = fsa_df.set_index('PostalCode')[ ['Latitude', 'Longitude'] ]
asian_restaurants = asian_restaurants.join(fix_lat_long, on='Neighborhood')
asian_restaurants = asian_restaurants.drop(columns=[ 'Neighborhood Latitude', 'Neighborhood Longitude'])
asian_restaurants = asian_restaurants.rename(columns={
    'Latitude': 'Neighborhood Latitude',
    'Longitude': 'Neighborhood Longitude'
})
cols = asian_restaurants.columns.to_list()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
asian_restaurants = asian_restaurants[cols]
asian_restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M5A,43.654260,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,M5A 1H7,Chinese Restaurant,,,,
1,M5S,43.662696,-79.400049,5ab3d9f875a6ea3a7ddc4d2b,Thai Express,43.661630,-79.387340,76 Grenville St,M5S 1B2,Thai Restaurant,0,2,6.4,0
2,M5S,43.662696,-79.400049,59a86be58d1070397a5101be,Sushi Shop,43.661620,-79.387636,"76 Grenville St, Woman's College Hospital",M5S 1B2,Sushi Restaurant,0,2,,1
3,M1B,43.806686,-79.194353,4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,43.800982,-79.200233,3-1158 Morningside Ave,M1B 3A4,Chinese Restaurant,1,1,,1
4,M3B,43.745906,-79.352188,53bafb4f498eb927faa3cd9e,Matsuda Japanese Cuisine & Teppanyaki,43.745494,-79.345821,1300 Don Mills Rd #2,M3B 2W6,Japanese Restaurant,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242,M8X,43.653654,-79.506944,4aee0654f964a5206ad121e3,Sushi 2 Go,43.647875,-79.509427,2976 Bloor Street West,M8X 1B9,Sushi Restaurant,7,2,6.5,2
243,M8X,43.653654,-79.506944,4b11a52ef964a5204b8123e3,Momiji Sushi Bar & Grill,43.647843,-79.508534,2955 Bloor St. W.,M8X 1B8,Sushi Restaurant,14,2,7.8,36
244,M4Y,43.665860,-79.383160,4e36063c8877beb5e9b29c87,Bowl,43.665443,-79.382027,,M4Y 1H1,Asian Restaurant,0,2,,0
245,M4Y,43.665860,-79.383160,5c7a1d6f5bc27d00254a87e1,Dakgogi,43.665093,-79.383521,25 Wellesley St E,M4Y 2S9,Korean Restaurant,0,2,,0


In [19]:
# Check for the dupes
asian_restaurants.loc[asian_restaurants.duplicated(subset='Venue ID', keep=False)].sort_values(by='Venue ID')

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes


In [20]:
# Final cleaned dataframe
asian_restaurants.drop_duplicates(ignore_index=True, inplace=True)
asian_restaurants

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue ID,Venue,Venue Latitude,Venue Longitude,Venue Address,Venue Postal Code,Venue Category,Venue Tip Count,Venue Price Tier,Venue Rating,Venue Likes
0,M5A,43.654260,-79.360636,4f73a473e4b0c1f445d21c78,Huayu Kitchen,43.654148,-79.357826,,M5A 1H7,Chinese Restaurant,,,,
1,M5S,43.662696,-79.400049,5ab3d9f875a6ea3a7ddc4d2b,Thai Express,43.661630,-79.387340,76 Grenville St,M5S 1B2,Thai Restaurant,0,2,6.4,0
2,M5S,43.662696,-79.400049,59a86be58d1070397a5101be,Sushi Shop,43.661620,-79.387636,"76 Grenville St, Woman's College Hospital",M5S 1B2,Sushi Restaurant,0,2,,1
3,M1B,43.806686,-79.194353,4c706524df6b8cfab244b84d,Charley's Exotic Cuisine,43.800982,-79.200233,3-1158 Morningside Ave,M1B 3A4,Chinese Restaurant,1,1,,1
4,M3B,43.745906,-79.352188,53bafb4f498eb927faa3cd9e,Matsuda Japanese Cuisine & Teppanyaki,43.745494,-79.345821,1300 Don Mills Rd #2,M3B 2W6,Japanese Restaurant,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242,M8X,43.653654,-79.506944,4aee0654f964a5206ad121e3,Sushi 2 Go,43.647875,-79.509427,2976 Bloor Street West,M8X 1B9,Sushi Restaurant,7,2,6.5,2
243,M8X,43.653654,-79.506944,4b11a52ef964a5204b8123e3,Momiji Sushi Bar & Grill,43.647843,-79.508534,2955 Bloor St. W.,M8X 1B8,Sushi Restaurant,14,2,7.8,36
244,M4Y,43.665860,-79.383160,4e36063c8877beb5e9b29c87,Bowl,43.665443,-79.382027,,M4Y 1H1,Asian Restaurant,0,2,,0
245,M4Y,43.665860,-79.383160,5c7a1d6f5bc27d00254a87e1,Dakgogi,43.665093,-79.383521,25 Wellesley St E,M4Y 2S9,Korean Restaurant,0,2,,0


In [21]:
# Save the data to file
with open('asian_restaurants.pkl', 'wb') as f:
        pickle.dump(asian_restaurants, f, pickle.HIGHEST_PROTOCOL)