# Categorising Toronto food and drink establishments according to predicted price brackets

## Introduction
_____________________
The aim of my project submission is to enrich Toronto's neighbourhood venues data with price estimates using machine learning techniques learned in the courses as part of IBM Data Science Professional Certification. Indicative pricing for eateries around the world is available on platforms such as [TripAdvisor](https://www.Tripadvisor.co.uk) and may be available either readily or mined from menus using either the websites directly or the Premium API calls feature on [Foursquare](https://www.foursquare.com/). But these options are not chosen for the following reasons:
    1. It may involve proprietary data and or algorithms (in the case of TripAdvisor)
    2. It involves premium calls which are limited in number 
    3. It is too tedious and time consuming to mine data from individual menus  
    4. Most importantly what is the fun in that!

With this project I hope to categorise food and drink establishments in Toronto namely restaurants, pubs, coffee shops, cafes and dessert shops using $, $$ and $$$ categories which stand for bargain, medium price and premium outlets. Having such information alongside a grid location (courtesy Foursquare) would obviously be beneficial to consumers who are looking for eateries as they can first check if a venue fits in their budget or occassion before delving deeper and reading reviews about the place. Travel and tourism websites such as TripAdvisor, tourist information services, hotel price aggregators, home delivery mobile applications, etc would be interested in such a product. 

## Data and assumptions
__________________________________
I searched on the internet for data to support such a tool development and chanced upton this [website](https://open.toronto.ca/) which has a wide variety of data available on Toronoto and the surrounding areas. The following datasets have been sourced on this webiste:
1. [Neighbourhoods](https://open.toronto.ca/dataset/neighbourhoods/) - containing WSG84 coordinates (latitude and longitude) for all the 140 neighbourhoods in Toronto
2. [Neighbourhood profiles 2016](https://open.toronto.ca/dataset/neighbourhood-profiles/) - containing demographic information including neighbourhood-wise data spread across various ethnicities and age-groups relating to population, education, income, benefits claimed, etc.  
  
A snippet of these datasets is shown below:

In [20]:
import pandas as pd
import numpy as np

In [21]:
df_geo = pd.read_csv('Neighbourhoods.csv')
df_geo = df_geo[['AREA_NAME', 'LATITUDE', 'LONGITUDE']]
df_geo.sort_values(by=['AREA_NAME'], axis=0, inplace=True)
df_geo.reset_index(inplace=True, drop=True)
df_geo.head()

Unnamed: 0,AREA_NAME,LATITUDE,LONGITUDE
0,Agincourt North (129),43.805441,-79.266712
1,Agincourt South-Malvern West (128),43.788658,-79.265612
2,Alderwood (20),43.604937,-79.541611
3,Annex (95),43.671585,-79.404001
4,Banbury-Don Mills (42),43.737657,-79.349718


In [22]:
df_demog = pd.read_csv('neighbourhood-profiles-2016-csv.csv')
df_demog.drop(['_id', 'Data Source', 'Category', 'Topic', 'City of Toronto'], axis=1, inplace=True)
df_demog.head()

Unnamed: 0,Characteristic,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Number,129,128,20,95,42,34,76,52,49,...,37,7,137,64,60,94,100,97,27,31
1,TSNS2020 Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,"Population, 2016",29113,23757,12054,30526,27695,15873,25797,21396,13154,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,"Population, 2011",30279,21988,11904,29177,26918,15434,19348,17671,13530,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,Population Change 2011-2016,-3.90%,8.00%,1.30%,4.60%,2.90%,2.80%,33.30%,21.10%,-2.80%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


There isn't much to say about the first dataset apart from that it complements the main dataset (\#2) by giving the latitude and longitude values for each neighbourhood in the latter. This dataset will be handy querying Foursquare as both geocoder and geocode packages proved to be wholly unreliable in my trials. 

The fields of interest for this study are contained in the 2016 profiles dataset and are shown below:

In [23]:
columns = [7, 2354] # corresponds to population density (people per sq KM) and average after-tax income
df_demog = df_demog.iloc[columns]
# We are going to transpose this dataset in a minute. So renaming the column name in preparation
df_demog.rename(columns={"Characteristic":"Neighbourhood"}, inplace=True)
df_demog.set_index('Neighbourhood', inplace=True)
df_demog

Unnamed: 0_level_0,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Population density per square kilometre,3929,3034,2435,10863,2775,3377,14097,4195,3240,4209,...,5820,4007,4345,7838,6722,8541,7162,10708,2086,2451
After-tax income: Average amount ($),26955,27928,39159,80138,51874,37927,43427,41440,38196,85678,...,36093,36713,27341,44594,39565,43054,65356,80555,26651,32904


Now transposing the dataset to align with the geographical dataset, casting the data from object to int32 and simplifying column names for ease of use, we get:

In [24]:
df_demog = df_demog.T
columns = ['Population density', 'Average after-tax income']
df_demog.columns = columns
for col in columns:
    df_demog[col]=df_demog[col].str.replace(',','')
    df_demog[col]=df_demog[col].astype(int)
df_demog.head()

Unnamed: 0,Population density,Average after-tax income
Agincourt North,3929,26955
Agincourt South-Malvern West,3034,27928
Alderwood,2435,39159
Annex,10863,80138
Banbury-Don Mills,2775,51874


We are now in position to get some descriptive statistics about this dataset, namely:

In [25]:
df_demog.describe()

Unnamed: 0,Population density,Average after-tax income
count,140.0,140.0
mean,6261.135714,43346.564286
std,4840.359075,24094.18223
min,1040.0,23786.0
25%,3595.25,29305.75
50%,5071.5,36538.5
75%,7621.25,44805.75
max,44321.0,193454.0


## Assumptions
______________________
The following important assumptions have been made:
1. Average after-tax income is a good indicator of disposable income which in turn determines how much people are willing to spend on dining and drinking outside. A more accurate indicator of disposable income would have been median after-tax income minus the costs (including rental and living costs). But since that data wasn't available, I settled for this average after-tax income. 
2. It is a fair assumption (and this dataset does support it) that well to-do neighbourhoods have high average after-tax income and low population density. Poorer neighbourhoods in contrast have low average after-tax income and higher population density.
3. Another important assumption made is that establishments tend to serve their local population closely. Therefore bargain or cheaper eats (or watering holes) are found nearer poorer neighbourhoods and premium outlets near rich neighbourhoods. I don't have data to support this assumption. But the point of this project is to hypothesise so and corroborate or reject based on TripAdvisor data or Foursquare premium API calls. However this validation is not done rigourously. Sample data with supporting and unsupporting points may be showcased. 

Due to the assumptions (severe assumption in case of \#1) made above, I have decided to classify establishments as $, $$, $$$ corresponding to bargain, medium-priced and premium-priced respectively. Also, note that since this sort of classification doesn't apply to certain venues such as parks, playgrounds, etc, I have restricted myself to restaurants, pubs, cafes, coffee and dessert shops. 

Evidence for assumption \#2 given below. Notice how if income greater 100K results in marked decrease in mean population density in the neighbourhoods and income below 40K shows greater mean population density.

In [30]:
df_demog[df_demog["Average after-tax income"] > 100000].mean()

Population density            4057.6
Average after-tax income    139513.0
dtype: float64

In [32]:
df_demog[df_demog["Average after-tax income"] < 40000].mean()

Population density           5947.588235
Average after-tax income    31296.152941
dtype: float64

## Approach
**Step 1**: The above data set will be scaled and clustered using K-Means clustering into 3 clusters. Results will be inspected to ensure that we have cluster means corresponding to bargain, medium and premium prices. If these are found, I will introduce a new column called _Price estimate_ which contains one of $, $$, $$$ as values for each Toronto neighbourhood.  

**Step 2**: 10 most commonly found venues in each neighbourhood are queried using Foursquare Regular API calls. Venues of type restaurants, pubs, cafes, coffee and dessert shops are filtered and the $, $$, $$$ labels from above will be applied to these establishments based on their neighbourhoods. 

**Step 3**: A few of the establishments will be put through TripAdvisor search and some qualitative comparisons will be made. Again this exercise is primarily for quenching curiorsity than for formal validation and improving the predictions. 

Concluding remarks and recommendations will be made based on findings in step \#3. 

### Finding the _Price estimate_ clusters
#### Scaling the features

In [35]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_demog)
X = scaler.transform(df_demog)

#### Finding the _Price estimate_ clusters
Instead of using the elbow method to find the optimal number of clusters, I am fixing the number of clusters to 3 to support the above narrative of finding bargain, medium and premium priced brackets. If you are curious, the elbow method shows either 2 or 3 clusters is good choice in this case. I am not including the code and graph for smooth flow of this report.  

In [42]:
from sklearn.cluster import KMeans
# Number of clusters fixed to 3 to be able to fit bargain, medium and premium priced brackets
K = 3

kmeanModel = KMeans(n_clusters=K)
kmeanModel.fit(X)

# Inverse transforming the centers to original coordinate system for better interpretability
cluster_centers = scaler.inverse_transform(kmeanModel.cluster_centers_)
print(cluster_centers)
print(kmeanModel.labels_)

[[  4790.85714286 101459.57142857]
 [  5536.93277311  36850.17647059]
 [ 21513.14285714  37559.14285714]]
[1 1 1 0 1 1 2 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 2 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1
 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 0 2 1 1 1 0 1 1 1 1 1 1
 0 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1]


From the above results, the cluster centers and labels can be interpreted as follows:

| Label | Cluster center | Neighbourhood category|
| ------------|:-----:|:-------|
| 0 | (4791, 101460) | Affluent|
| 1 | (5537, 36850) | Middle income|
| 2 | (21513, 37559) | Low income|


Assigning the _Price estimate_ based on these below:

In [43]:
est = np.empty([140, 1], dtype=object)
indx = 0
for label in kmeanModel.labels_:
        est[indx] = "$"*(3-label)
        indx = indx+1
df_demog['Price estimate'] = est
df_demog.head()

Unnamed: 0,Population density,Average after-tax income,Price estimate
Agincourt North,3929,26955,$$
Agincourt South-Malvern West,3034,27928,$$
Alderwood,2435,39159,$$
Annex,10863,80138,$$$
Banbury-Don Mills,2775,51874,$$


Couple of sanity checks

In [44]:
df_demog[df_demog["Average after-tax income"] > 100000]

Unnamed: 0,Population density,Average after-tax income,Price estimate
Bridle Path-Sunnybrook-York Mills,1040,193454,$$$
Casa Loma,5683,115033,$$$
Forest Hill South,4380,142627,$$$
Lawrence Park South,4685,111586,$$$
Rosedale-Moore Park,4500,134865,$$$


In [45]:
df_demog[df_demog["Average after-tax income"] < 40000].head()

Unnamed: 0,Population density,Average after-tax income,Price estimate
Agincourt North,3929,26955,$$
Agincourt South-Malvern West,3034,27928,$$
Alderwood,2435,39159,$$
Bathurst Manor,3377,37927,$$
Bayview Woods-Steeles,3240,38196,$$


All looks as expected. Let us examine the count of neighbourhoods of each type

In [46]:
df_demog.groupby(['Price estimate']).count()

Unnamed: 0_level_0,Population density,Average after-tax income
Price estimate,Unnamed: 1_level_1,Unnamed: 2_level_1
$,7,7
$$,119,119
$$$,14,14


#### Visualizing the clusters
Merging the two datasets for plotting and downstream work

In [51]:
#First removing the Area Code from Area name
df_geo.replace({'AREA_NAME': r'\(.*\)$'}, {'AREA_NAME': ''}, regex=True, inplace=True)
df_geo.head()


Unnamed: 0,AREA_NAME,LATITUDE,LONGITUDE
0,Agincourt North,43.805441,-79.266712
1,Agincourt South-Malvern West,43.788658,-79.265612
2,Alderwood,43.604937,-79.541611
3,Annex,43.671585,-79.404001
4,Banbury-Don Mills,43.737657,-79.349718


In [81]:
#Strip leading and trailing whitespace from both datasets in preparation for merging
df_geo['AREA_NAME'] = df_geo['AREA_NAME'].str.strip()
df_demog.index = df_geo['AREA_NAME']

In [89]:
df = df_demog.join(df_geo.set_index('AREA_NAME'))
df.reset_index(inplace=True)
df.rename(columns={'AREA_NAME':'Neighbourhood', 'LATITUDE':'Latitude', 'LONGITUDE':'Longitude'}, inplace=True)
df['Label'] = kmeanModel.labels_
df.head()

Unnamed: 0,Neighbourhood,Population density,Average after-tax income,Price estimate,Latitude,Longitude,Label
0,Agincourt North,3929,26955,$$,43.805441,-79.266712,1
1,Agincourt South-Malvern West,3034,27928,$$,43.788658,-79.265612,1
2,Alderwood,2435,39159,$$,43.604937,-79.541611,1
3,Annex,10863,80138,$$$,43.671585,-79.404001,0
4,Banbury-Don Mills,2775,51874,$$,43.737657,-79.349718,1


In [86]:
df[df['Latitude'].isnull()]


Unnamed: 0,Neighbourhood,Population density,Average after-tax income,Price estimate,Latitude,Longitude


#### Visualising the neighbourhoods as Choropleth maps

In [61]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [117]:
toronto_geo = r'Neighbourhoods.geojson' # geojson file

# create a plain world map
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11, tiles='Mapbox Bright')
# generate choropleth map using the total immigration of each country to Canada from 1980 to 2013
toronto_map.choropleth(
    geo_data=toronto_geo,
    data=df,
    columns=['Neighbourhood', 'Label'],
    key_on='feature.properties.AREA_NAME',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Average after-tax income clusters in Toronto (0-High, 1-Medium and 2-Low)'
)

# display map
toronto_map

[logo]:choropleth.jpeg

[logo]:choropleth.jpeg

![alt text][logo]

[logo]:https://github.com/suryam08/Coursera_Capstone/blob/master/choropleth.JPG "IBM Watson"

In [None]:
[logo]:https://upload.wikimedia.org/wikipedia/en/0/00/IBM_Watson_Logo_2017.png "IBM Watson"