<font color=red><b>Tom's Slow Smokes BBQ: Where in Toronto?</b></font>

<u>Introduction</u><br>
Tom has always had a dream of opening his very own barbeque restaurant.  He has contracted with my company to manage the selection process for the eventual location. We've been told that his desired location is Toronto, Canada due to some family connections (and funding) available there. Tom is mindful of wanting to avoid stiff competition in an area, such as opening his restaurant two blocks away from another similar restaurant, but he also does not want to open his restaurant in a space that is not very "hip" or "cool" -- he wants to make a profit, afterall.

We've worked with Tom to develop a high-level assessment of the Toronto area. First, we will look for neighborhoods with a high number of BBQ restaurants; this will tell us not only neighborhoods to avoid, but also will shed light on community characteristics to seek out elsewhere. We will then match these "BBQ Communities" with similar communities that do not have a lot of BBQ restaurants, and finally presenting the potential neighborhoods to Tom for final selection.

<u>Data</u><br>
The data used in this report will be threefold:<br>
 - <b>Wikipedia</b>: Used to supply the definitions of neighborhoods and their respective buroughs within Toronto
 - <b>Google Geocoder</b>: Used to ascribe latitude and longitude coordinates to neighborhoods for use in Foursquare (NOTE: Coursera has provided latitude & longitude results for Toronto. Used for convenience here)
 - <b>Foursquare</b>: Will provide the ability to search for BBQ restaurants, as well as other community characteristics when it comes time to match

<u>Methodology</u><br>
Please see the commented Python code below for detailed methodology explanation and replicable process.

In [1]:
#Import required libraries
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import json
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
from bs4 import BeautifulSoup

In [2]:
#Collect Wikipedia table of Boroughs and Neighborhoods in Toronto and convert into convenient DataFrame

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

rows = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        rows.append(row)
        
df = pd.DataFrame(rows, columns=["Post Code", "Borough", "Neighborhood"])
df = df.drop(df[df['Borough']=='Not assigned'].index)
df.loc[df['Neighborhood']=='Not assigned', 'Neighborhood'] = df['Borough']

def squish(group):
     return pd.Series(dict(Borough = group['Borough'].max(), 
                        Neighborhood =  ', '.join(group['Neighborhood'])))

squished = df.groupby('Post Code').apply(squish).reset_index()
squished.head()

Unnamed: 0,Post Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [3]:
#Append latitude & longitude values to Neighborhoods; use a Geocoder for other geographies
geocodes = pd.read_csv('http://cocl.us/Geospatial_data')
squished = squished.merge(geocodes, left_on = 'Post Code', right_on='Postal Code')
squished.head(1)

Unnamed: 0,Post Code,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353


In [4]:
#Get latitude and longitude for Toronto overall
address = 'Toronto, Canada'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [5]:
#Prepare Foursquare credentials; this step may be skipped if you are not publically sharing your notebook
with open(r'C:\Users\Tom\Documents\foursquare_cred.txt') as file:
    lines = file.readlines()
    CLIENT_ID, CLIENT_SECRET = lines[0], lines[1]

VERSION = '20180605' # Foursquare API version

In [6]:
#We will now leverage Foursquare to provide location results (restaurants, theaters, etc.) within each Neighborhood
LIMIT = 100 # limit of number of venues returned by Foursquare API
def getNearbyVenues(names, latitudes, longitudes, radius=800):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
toronto_venues = getNearbyVenues(names=squished['Neighborhood'],
                                   latitudes=squished['Latitude'],
                                   longitudes=squished['Longitude']
                                  )

In [9]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa
1,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
2,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.802008,-79.19808,Fast Food Restaurant
3,"Rouge, Malvern",43.806686,-79.194353,Harvey's,43.800106,-79.198258,Fast Food Restaurant
4,"Rouge, Malvern",43.806686,-79.194353,Tim Hortons,43.802,-79.198169,Coffee Shop


In [10]:
# We will first process the data, and transform the categorical variables into dummy 0/1's in order to accomodate the K Nearest
# neighbors clustering
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head(1)

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#This dataframe will be aggregated by Neighborhood so that we can view the relative popularity of location types
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head(3)

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# We will now further process this dataframe to make it more easily interpratable, by evaluating the top location types in each
# Neighborhood.  Ultimately, this will provide us context into the top BBQ communities.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']

for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(3)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Café,Coffee Shop,American Restaurant,Steakhouse,Hotel,Gastropub,Concert Hall,Theater,Japanese Restaurant,Bar
1,Agincourt,Chinese Restaurant,Restaurant,Motorcycle Shop,Malay Restaurant,Skating Rink,Mediterranean Restaurant,Shanghai Restaurant,Breakfast Spot,Discount Store,Sandwich Place
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Chinese Restaurant,Pizza Place,Park,Hobby Shop,Bubble Tea Shop,Caribbean Restaurant,Korean Restaurant,Noodle House,Fast Food Restaurant,BBQ Joint


In [13]:
# Before turning our attention to BBQ, we will finish clustering neighborhoods based on location characteristcs

kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(toronto_grouped_clustering)
kmeans.labels_[0:10] 

array([1, 4, 4, 4, 4, 4, 4, 1, 1, 1])

In [14]:
# We bring together the cluster labels, latitude, longitude, as well as the top common venues

toronto_merged = squished.loc[squished['Neighborhood'].isin(neighborhoods_venues_sorted['Neighborhood']),:]
toronto_merged['Cluster Labels'] = kmeans.labels_
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood').reset_index().drop('index',axis=1)
toronto_merged.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,Post Code,Borough,Neighborhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353,1,Fast Food Restaurant,Coffee Shop,Auto Workshop,Spa,African Restaurant,Filipino Restaurant,Paper / Office Supplies Store,Hobby Shop,Chinese Restaurant,Construction & Landscaping
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497,4,Breakfast Spot,Italian Restaurant,Burger Joint,Bar,Women's Store,Doner Restaurant,Dim Sum Restaurant,Diner,Discount Store,Dive Bar
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711,4,Pizza Place,Coffee Shop,Fast Food Restaurant,Beer Store,Rental Car Location,Fried Chicken Joint,Sports Bar,Supermarket,Medical Center,Pharmacy


In [15]:
# Investigate if there are any neighborhoods that have a 1st - 10th Common Venue of "BBQ Joint". Manual iteration/exploration
# shows that record 91 (Kingsway Park SW, Mimico NW...) is the most popular
toronto_merged[toronto_merged['8th Most Common Venue']=='BBQ Joint']['Neighborhood']

91    Kingsway Park South West, Mimico NW, The Queen...
Name: Neighborhood, dtype: object

In [16]:
# Next we will construct a dataframe limited to only the cluster type belonging to record 91 (Kingsway Park SW, Mimico NW...)
# We will also drop record 91 to ensure  we do not recommend a location that is already saturated with BBQ
potential_locations = toronto_merged.drop(91,axis=0)[toronto_merged['Cluster Labels']==toronto_merged.iloc[91]['Cluster Labels']]
potential_locations.shape

  This is separate from the ipykernel package so we can avoid doing imports until


(32, 17)

In [17]:
# We will create a map to review options
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(potential_locations['Latitude'], potential_locations['Longitude'], potential_locations['Neighborhood'], potential_locations['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [18]:
# In reviewing this map with Tom, we learned of his interest to be centrally located, and not too close to the water.
# Emery & Humberlea provide a central location near the airports. We will do one last check to ensure there are limited BBQ joints
# nearby
radius=2000 #2km radius
latitude = toronto_merged[toronto_merged['Neighborhood']=='Emery, Humberlea']['Latitude'].mean()
longitude = toronto_merged[toronto_merged['Neighborhood']=='Emery, Humberlea']['Longitude'].mean()
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, 'BBQ', radius, LIMIT)

In [19]:
#Note that no results matching "BBQ" are found in a 2KM radius. It would appear we have found our winner.
results = requests.get(url).json() 
results

{'meta': {'code': 200, 'requestId': '5bb902ed1ed21942883915ed'},
 'response': {'venues': []}}

<u>Results/Discussion/Conclusion</u><br>
As shown directly through the methodology portion, Tom has arrived at the Emery & Humberlea neighborhood as the ideal location for his BBQ spot. This neighborhood shares a lot of similarities with another neighborhood that is popular for BBQs, is located centrally within Toronto, and has minimal competition within a 2km radius. Future analysis might take into account availability of building space and the associated prices of those locations.

For now, Tom has what he needs to not only inquire after this location, but also has a number of back-up locations if his first selection does not pan out. On a personal note, I look forward to eating some BBQ from Tom just as soon as his restaurant opens!