<h1>Looking for place to open a cafe in Toronto</h1>

<h2>Introduction:</h2>

Let's imagine that you are a sucsesful buisnessman and you have several cafes in Toronto. You have several cafes opened and you want to expand and build cafes in other parts of the city. So let's find out best places to open a new cafe.

To do that we need:

<b>Number of cafes in each neighbourhood</b>

<b>Distance between neighbourhood and City Center</b>

<b> Average ratings of cafes in Neighbourhood</b>


<h2>Data requirements</h2>

We will look for places which are near to city center, because there are more people who want to sit in a cafe obviously. To ensure good activity around cafe we will look for places with low number of cafes in neighborhood and for places where average rating of them is low. We will use the data of Toronto neighborhoods from Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). Also Forsquare API will be used to get data on ratings and numbers of cafes.

<h2>Methodology</h2>

Here we will get data, clean it, shape it in order to get dataframe which we will use in analysis.

<h3>Creating pandas dataframe</h3>

In [1]:
# Import Libraries

# library to handle data in a vectorized manner
import numpy as np 

# library for data analsysis
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# library to handle JSON files
import json 

# For Latitudes and Longitudes
!conda install -c conda-forge geopy --yes 
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# library to handle requests
import requests 

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#For Map
!conda install -c conda-forge folium=0.5.0 --yes 
# map rendering library
import folium 

#For Tables
!pip install lxml
import lxml

import seaborn as sns

import csv
! pip install BeautifulSoup4
from bs4 import BeautifulSoup
import itertools
import os

print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Libraries imported.


In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

df=pd.read_html(url, header=0)[0]

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h2>Data cleaning<h2>

Let's replace "Not assigned" with "NaN"

In [3]:
df['Borough'].replace('Not assigned', np.NaN, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Dropping values

In [4]:
df.dropna(inplace=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


Now let's make "Not assigned" neighborhood column be equal to borough column

In [5]:
for i in range(len(df)):
    if df.iloc[i,2]=='Not assigned':
        df.iloc[i,2]=df.iloc[i,1]

See the result

In [6]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


Let's group by postalcode

In [7]:
df.groupby(['Postcode'])
df.head(50)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Queen's Park,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


Now we will combine neighborhood and postcode columns and get rid of repeated meanings

In [8]:
df=df[['Postcode','Borough','Neighbourhood']].groupby('Postcode',as_index=False).agg(','.join)
col=df['Borough'].str.split(',').apply(set).str.join(',')
df.update(col)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [9]:
df.shape

(103, 3)

<h2>Geocoder</h2>

Now, when we have dataframe, let's find latitudes and longitudes for neighborhoods

Let's import geocode library and start getting latitudes and longitudes

In [10]:
!pip install pgeocode
import pgeocode as pgeo



In [11]:
# creat an object
nomi=pgeo.Nominatim("ca")

#post query
nomi.query_postal_code("M1B")

postal_code                                       M1B
country code                                       CA
place_name        Scarborough (Malvern / Rouge River)
state_name                                    Ontario
state_code                                         ON
county_name                               Scarborough
county_code                                       NaN
community_name                                    NaN
community_code                                    NaN
latitude                                      43.8113
longitude                                     -79.193
accuracy                                            6
Name: 0, dtype: object

In [12]:
# save the output of the query in a variable
a=nomi.query_postal_code("M1C")
# convert to pandas dataframe
df_2=pd.DataFrame(a)
df_2.head()

Unnamed: 0,0
postal_code,M1C
country code,CA
place_name,Scarborough (Rouge Hill / Port Union / Highlan...
state_name,Ontario
state_code,ON


Let's see the result

In [13]:
print("Latitude  :" + str(a.latitude))
print("Longitude :" + str(a.longitude))

Latitude  :43.7878
Longitude :-79.1564


Now we will add additional columns to our main dataframe

In [14]:
df.insert(3, "Latitude","")
df.insert(4, "Longitude","")

In [15]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",,
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,
5,M1J,Scarborough,Scarborough Village,,
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",,
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",,
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",,
9,M1N,Scarborough,"Birch Cliff,Cliffside West",,


Now let's add latitudes and longitudes to dataframe

In [16]:
for i in range(len(df)):
    A=nomi.query_postal_code(df.iloc[i,0])
    df.iloc[i,3]=A.latitude
    df.iloc[i,4]=A.longitude

Let's watch what wee got

In [17]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.193
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7878,-79.1564
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389
5,M1J,Scarborough,Scarborough Village,43.7464,-79.2323
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.7298,-79.2639
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.7122,-79.2843
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.7247,-79.2312
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.6952,-79.2646


<h2>Using of FourSquare API</h2>

Here we will start getting information about existing cafes in different Toronto neighborhoods

Let's start with defying of function witch will give us latitudes and longitudes according to postal codes

In [18]:
def get_coordinates(postal_code):
    nomi=pgeo.Nominatim("ca") #ca: Canada
    info=nomi.query_postal_code(postal_code)
    return [info.latitude , info.longitude]

We will use CN Tower as a Toronto city center. Let's get it's coordinations

In [19]:
CN_Tower = get_coordinates("M5V")
print("Coordinates of CN_Tower : ",CN_Tower)

Coordinates of CN_Tower :  [43.6404, -79.3995]


Next step is to define a function which will convert longitudes and latitudes into distance

In [20]:
#!conda install -c conda-forge shapely
#!pip install shapely
import shapely.geometry

import math

!pip install pyproj
import pyproj

def lonlat_to_xy(lon,lat):
    proj_latlon=pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj='utm',zone=33,datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x,y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj='utm', zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x,y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)
    return dist
    
print('Coordinate transfomration check')
print('-------------------------------')
print('CN Tower longitude={}, latitude={}'.format(CN_Tower[1], CN_Tower[0]))
x, y = lonlat_to_xy(CN_Tower[0],CN_Tower[1])
print('CN Tower UTM X={}, Y={}'.format(x,y))
la,lo=xy_to_lonlat(x,y)
print('CN Tower longitude={}, latitude={}'.format(lo,la))

Coordinate transfomration check
-------------------------------
CN Tower longitude=-79.3995, latitude=43.6404
CN Tower UTM X=1065456.1759511982, Y=-8956640.641693817
CN Tower longitude=-79.39949999999999, latitude=43.640399999999964


<h2>Creating the map</h2>

Here we will use our data to show it on map of Toronto

In [21]:
map_Toronto = folium.Map(location=CN_Tower, zoom_start=12)
folium.Marker(CN_Tower, popup='CN Tower').add_to(map_Toronto)
map_Toronto

In [22]:
df.dropna(inplace=True)
df.shape

(102, 5)

In [23]:
Latitude=np.array(df['Latitude'])
Longitude=np.array(df['Longitude'])
Borough=np.array(df['Borough'])

for lat,long,name in zip(Latitude, Longitude, Borough):
        folium.CircleMarker([lat, long],
                       radius=5, color='red',
                       fill=True, fill_color='red',
                            popup=name,
                       fill_opacity=0.6).add_to(map_Toronto)
map_Toronto

Lets make some calculations of distance

In [24]:
from geopy import distance

dist_km=[]
for i in range(len(df)):
    coordinates=[df.iloc[i,3],df.iloc[i,4]]
    dist=distance.distance(coordinates,CN_Tower).km
    dist_km.append(dist)
df.insert(5,"Distance (km)",dist_km)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Distance (km)
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.193,25.24667
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7878,-79.1564,25.535002
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866,22.245142
3,M1G,Scarborough,Woburn,43.7712,-79.2144,20.827534
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,19.247252


<h2>API calls</h2>

let's start macing calls to forsquare

In [37]:
CLIENT_ID=''
CLIENT_SECRET=''
VERSION='20180602'

In [26]:
def to_string(lat,lon):
    return str(lat)+','+str(lon)

In [27]:

num_cafes=[]
avg_rating=[]

for i in range(len(df)):
    coordinates=[df.iloc[i,3],df.iloc[i,4]]
    
    url='https://api.foursquare.com/v2/venues/search'
    parameters={'client_id':CLIENT_ID,
        'client_secret':CLIENT_SECRET,
        'v': VERSION,
        'll':to_string(df.iloc[i,3],df.iloc[i,4]),
        'radius':'1000',   # Search for venues in 1000 m radius (1km)
        'query':'Cafe', # Search for cafes
        'limit':'10'
        }
    
    r=requests.get(url,params=parameters).json()
    a=r['response']['venues']
    data_frame=pd.DataFrame(a)
    
    ratings=[]
    
    for j in range(len(data_frame)):
        venue_id = data_frame['id'][j]
        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'\
        .format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
        
        result = requests.get(url).json()
        
        try:
            r=result['response']['venue']['rating']
            ratings.append(r)
        except:
            r=np.nan
            ratings.append(r)
    
    a=np.array(ratings)
    
    # drop any np.nan values from the dataframe
    a = a[~np.isnan(a)]
    
    avg_rating.append(a.mean())
    num_cafes.append(len(data_frame))

  ret = ret.dtype.type(ret / rcount)


Next we will ad ratings and numbers of cafes

In [28]:
df.insert(6,"Number of cafes",num_cafes)
df.insert(7,"Average Rating",avg_rating)

see the result

In [29]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Distance (km),Number of cafes,Average Rating
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.193,25.24667,0,
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7878,-79.1564,25.535002,1,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866,22.245142,1,
3,M1G,Scarborough,Woburn,43.7712,-79.2144,20.827534,1,
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,19.247252,5,
5,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,17.899617,1,
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.7298,-79.2639,14.771578,2,
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.7122,-79.2843,12.245001,3,
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.7247,-79.2312,16.4889,1,
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.6952,-79.2646,12.467774,2,


<h2>Analysis</h2>

Due to lack of ratings we should make some actions

In [30]:
df['Average Rating'].replace(np.NaN, 'Not Available', inplace=True)
df.loc[df['Average Rating'] != 'Not Available']

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Distance (km),Number of cafes,Average Rating


Let's find neighborhoods with more than 7 cafes

In [31]:
df.loc[df['Number of cafes'] > 7]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Distance (km),Number of cafes,Average Rating
22,M2N,North York,Willowdale South,43.7673,-79.4111,14.130379,10,Not Available
29,M3J,North York,"Northwood Park,York University",43.7694,-79.4921,16.159692,9,Not Available
36,M4C,East York,Woodbine Heights,43.6913,-79.3116,9.068859,10,Not Available
37,M4E,East Toronto,The Beaches,43.6784,-79.2941,9.492545,9,Not Available
40,M4J,East York,East Toronto,43.6872,-79.3368,7.253483,10,Not Available
41,M4K,East Toronto,"The Danforth West,Riverdale",43.6803,-79.3538,5.765507,10,Not Available
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.6693,-79.3155,7.49852,9,Not Available
43,M4M,East Toronto,Studio District,43.6561,-79.3406,5.06201,10,Not Available
45,M4P,Central Toronto,Davisville North,43.7135,-79.3887,8.168405,10,Not Available
46,M4R,Central Toronto,North Toronto West,43.7143,-79.4065,8.23011,10,Not Available


Here we got information about different cafes. As we want our new cafe to succeed we should look for a place nearest to the center and which has low number of existing cafes and see the potential neighborhoods which will be good for our purposes 

In [32]:
mean_dist = df['Distance (km)'].mean()
avg_num_cafes = df['Number of cafes'].mean()

print('Average distance to Neighborhoods : ' + str(mean_dist))
print('Average number of cafes : '+ str(avg_num_cafes))

Average distance to Neighborhoods : 10.447683555299843
Average number of cafes : 5.313725490196078


In [34]:
df.loc[(df['Distance (km)'] < mean_dist) & (df['Number of cafes'] < avg_num_cafes)]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Distance (km),Number of cafes,Average Rating
35,M4B,East York,"Woodbine Gardens,Parkview Hill",43.7063,-79.3094,10.315336,5,Not Available
38,M4G,East York,Leaside,43.7124,-79.3644,8.485618,4,Not Available
39,M4H,East York,Thorncliffe Park,43.7059,-79.3464,8.443861,4,Not Available
44,M4N,Central Toronto,Lawrence Park,43.7301,-79.3935,9.977952,4,Not Available
50,M4W,Downtown Toronto,Rosedale,43.6827,-79.373,5.163019,2,Not Available
59,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.623,-79.3936,1.990998,2,Not Available
62,M5M,North York,"Bedford Park,Lawrence Manor East",43.7335,-79.4177,10.447546,2,Not Available
63,M5N,Central Toronto,Roselawn,43.7113,-79.4195,8.040821,5,Not Available
64,M5P,Central Toronto,"Forest Hill North,Forest Hill West",43.6966,-79.412,6.325005,2,Not Available
73,M6C,York,Humewood-Cedarvale,43.6915,-79.4307,6.21019,4,Not Available


In [35]:
data_frame=df.loc[(df['Distance (km)'] < mean_dist) & (df['Number of cafes'] < avg_num_cafes)]
a=data_frame['Borough']
b=np.array(a)
set(a)

{'Central Toronto',
 'Downtown Toronto',
 'East York',
 'Etobicoke',
 'North York',
 'York'}

In [36]:
x=data_frame['Neighbourhood']

for i in x:
    i.split(',')
    
set(x)

{'Bedford Park,Lawrence Manor East',
 'Caledonia-Fairbanks',
 'Del Ray,Keelesdale,Mount Dennis,Silverthorn',
 'Forest Hill North,Forest Hill West',
 'Harbourfront East,Toronto Islands,Union Station',
 'Humber Bay Shores,Mimico South,New Toronto',
 "Humber Bay,King's Mill Park,Kingsway Park South East,Mimico NE,Old Mill South,The Queensway East,Royal York South East,Sunnylea",
 'Humewood-Cedarvale',
 'Lawrence Park',
 'Leaside',
 'Rosedale',
 'Roselawn',
 'The Kingsway,Montgomery Road,Old Mill North',
 'Thorncliffe Park',
 'Woodbine Gardens,Parkview Hill'}

<h2>Conclusion</h2>

The neighborhoods and boroughs which you can see above are potential areas where the new cafe will possibly have a success