# Coursera Applied Data Science 

## Capstone Project

### Introduction / Business Problem

Among all the factors that make or break a restaurant, location is definitely one of the most important. Prime location is deemed such importance as it guarantees the entrepreneur one less factor to worry about : Customer Traffic. Nevertheless, in reality there are only so much prime location in an area, which therefore came the term "prime". Although there are many aspects that constitutes a good location, one cannot neglect the fact that an unsuitable location will cause unnecessary loss, and in worse case scenario causing the restaurant unsustainable. 

This project aim to provide some analysis of a good location through the usage of existing restaurant distribution and existing population distribution. In this project, I choose Singapore as current target study location and I'll be providing some analysis based on available data.

A suitable location will consist of 2 factors. The first factor will of course being having adequate space for the desired restaurant business model. The second factor of a suitable location will be the one that have a lot of customer traffic flow around the location. This study will revolves around the second factor. 

One of the indicator that helps identify the second factor of a location is the current distribution of restaurant around the area. Restaurant distribution not only shows the competition around the area, but it can also assume that current customer traffic around the area is able to sustain these restaurant.

The second indicator that will help will be the population distribution around the area. The probability of customer traffic flow is higher as the population density around the area increases. Albeit there are also cases like CBD where there will be high traffic flow due to offices but population is actually non-existant.

Other than using basic mapping and graphs for visualisation, I'll also be using correlation to confirm the relationship of  these indicators with the amount of restaurant in its respective area. The assumption made in this study will be that the amount of restaurant in an area will correlate the suitability of a location for a new restaurant, which means it will be a good location. Therefore this study will determine the relationship between the amount of restaurant in an area with other indicators. This study will also determine a suitable location for the next restaurant using the correlation result. 

### Data

The first prerequisite required to conduct this study will be to identify the areas available in Singapore. The Year 2014 Singapore Planning Area and Boundaries are used in this study to divide Singapore map into 55 areas. The quantity and distribution of restaurant over Singapore are collected through Foursquare. Due to the limitation of 50 results per call, the datas are compiled by searching restaurant within 6km radius of every MRT & LRT station in Singapore. Next, the data of the resident population by planning area and ethnic group is used to identify the correlation with different ethnicity. Lastly, the resident population by planning area and types of dwelling is used to identify the correlation with different income level.  For the datas used that are not obtained from Foursquare, they are all obtained from Singapore publicly-available dataset, [Singapore Data](https://data.gov.sg/).   

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import json
import folium # plotting library
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib as mpl

from IPython.display import Image # libraries for displaying images
from IPython.core.display import HTML 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from pykml.factory import KML_ElementMaker as KML
from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library
from shapely.geometry import Point, Polygon
from matplotlib.ticker import PercentFormatter
from folium import plugins

print('Libraries imported.')

Libraries imported.


The credentials are for preparing setup to send query to Foursquare API. 

In [173]:
CLIENT_ID = '2PXGMUJF0IOVFDI2WCRNW5BCL325Y3GRQNRJMFKH4JWLDE4Z' # your Foursquare ID
CLIENT_SECRET = 'C4WKMAS3XK2KUGN42N2D5WNLN1LOH3VEB4Z1V44GX0JATRZD' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

search_query = 'restaurant'
radius = 6000
print(search_query + ' .... OK!')


Your credentials:
CLIENT_ID: 2PXGMUJF0IOVFDI2WCRNW5BCL325Y3GRQNRJMFKH4JWLDE4Z
CLIENT_SECRET:C4WKMAS3XK2KUGN42N2D5WNLN1LOH3VEB4Z1V44GX0JATRZD
restaurant .... OK!


The coordinates of each MRT Station is obtained from one of the provider from Kaggle. The website is available in this link : [Kaggle](https://www.kaggle.com/yxlee245/singapore-train-station-coordinates).  The coordinates data is later used as the input for Foursquare API query.

In [4]:
df = pd.read_csv('mrt_lrt_data.csv')
df.head()

Unnamed: 0,station_name,type,lat,lng
0,Jurong East,MRT,1.333207,103.742308
1,Bukit Batok,MRT,1.349069,103.749596
2,Bukit Gombak,MRT,1.359043,103.751863
3,Choa Chu Kang,MRT,1.385417,103.744316
4,Yew Tee,MRT,1.397383,103.747523


As mentioned, restaurant distribution data is obtain by query through Foursquare API. It is set to search for the restaurant at 6km radius of each MRT & LRT station. 

In [6]:
test2=[]

for lat, long,label in zip(df['lat'],df['lng'],df['station_name']):
    rngurl = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, long, VERSION, search_query, radius, LIMIT)
    results = requests.get(rngurl).json()
    #print("Completed : %s " % label)
    test1=results['response']['venues']
    test2+=test1

Completed : Jurong East 
Completed : Bukit Batok 
Completed : Bukit Gombak 
Completed : Choa Chu Kang 
Completed : Yew Tee 
Completed : Kranji 
Completed : Marsiling 
Completed : Woodlands 
Completed : Admiralty 
Completed : Sembawang 
Completed : Yishun 
Completed : Khatib 
Completed : Yio Chu Kang 
Completed : Ang Mo Kio 
Completed : Bishan 
Completed : Braddell 
Completed : Toa Payoh 
Completed : Novena 
Completed : Newton 
Completed : Orchard 
Completed : Somerset 
Completed : Dhoby Ghaut 
Completed : City Hall 
Completed : Raffles Place 
Completed : Marina Bay 
Completed : Marina South Pier 
Completed : Tuas Link 
Completed : Tuas West Road 
Completed : Tuas Crescent 
Completed : Gul Circle 
Completed : Joo Koon 
Completed : Pioneer 
Completed : Boon Lay 
Completed : Lakeside 
Completed : Chinese Garden 
Completed : Clementi 
Completed : Dover 
Completed : Buona Vista 
Completed : Commonwealth 
Completed : Queenstown 
Completed : Redhill 
Completed : Tiong Bahru 
Completed : Outra

The obtained data is changed into dataframe and cleaned for further analysis. Duplicated datas are cleaned and restaurant not in Singapore are filtered. Categories and addresses of the restaurant are further processed for the ease of further analysis.

In [252]:
test3=pd.DataFrame(test2)
test3=test3.drop_duplicates(subset='id') # to remove any duplicate restaurant. ID number is unique for each venue
test5=test3.location
test5=test5.to_dict()
test6=pd.DataFrame(test5)
test6=test6.T
test6.head()

Unnamed: 0,address,lat,lng,labeledLatLngs,distance,cc,city,country,formattedAddress,crossStreet,neighborhood,postalCode,state
0,135 Jurong Gateway Rd #02-337,1.33394,103.74,"[{'label': 'display', 'lat': 1.333944846554801...",234,SG,Singapore,Singapore,"[135 Jurong Gateway Rd #02-337, Singapore]",,,,
1,IMM Building,1.33541,103.746,"[{'label': 'display', 'lat': 1.335406010196333...",501,SG,Singapore,Singapore,"[IMM Building (2 Jurong East Street 21), Singa...",2 Jurong East Street 21,Jurong East,,
2,303 Jurong East St. 32 #01-96,1.34478,103.735,"[{'label': 'display', 'lat': 1.344776588452771...",1512,SG,Singapore,Singapore,"[303 Jurong East St. 32 #01-96, 600303, Singap...",,Jurong East,600303.0,
3,202 Jurong East St. 21. #01-113,1.33648,103.743,"[{'label': 'display', 'lat': 1.336477183856219...",372,SG,Singapore,Singapore,"[202 Jurong East St. 21. #01-113, 600202, Sing...",,,600202.0,
4,#01-101B IMM Building,1.33439,103.746,"[{'label': 'display', 'lat': 1.334392759401934...",435,SG,Singapore,Singapore,"[#01-101B IMM Building (2 Jurong East St. 21),...",2 Jurong East St. 21,,,


In [251]:
test7 = pd.concat([test3,test6],axis=1)

test8 = test7.drop(['referralId','categories','hasPerk','venuePage','address','location','country','distance',
                    'crossStreet','state','labeledLatLngs','neighborhood','postalCode','city','cc'],axis=1)
test8.head()

Unnamed: 0,id,name,lat,lng,formattedAddress
0,55a24cb7498ea5a1aef817cc,Beng Hiang Restaurant,1.33394,103.74,"[135 Jurong Gateway Rd #02-337, Singapore]"
1,5a2cca826eda024462c994fe,White Restaurant 三巴旺白米粉 (White Restaurant),1.33541,103.746,"[IMM Building (2 Jurong East Street 21), Singa..."
2,4ba5d8cef964a5208c2539e3,Enaq Restaurant,1.34478,103.735,"[303 Jurong East St. 32 #01-96, 600303, Singap..."
3,4fb72ed2e4b07fc43111e478,Xiang Ji Seafood and Steamboat Restaurant,1.33648,103.743,"[202 Jurong East St. 21. #01-113, 600202, Sing..."
4,51304889e4b0a7dd557d360f,Soup Restaurant Teahouse 三盅兩件茶楼,1.33439,103.746,"[#01-101B IMM Building (2 Jurong East St. 21),..."


In [245]:
#Extract the category section within the categories list
test9=test7.categories
test9=test9.to_dict()
test10 = pd.DataFrame.from_dict(test9, orient='index')
a=test10[0].to_dict()
test11=pd.DataFrame(a)
test11=test11.T
test11=test11.drop(['id','name','icon','primary','pluralName'],axis=1)
test12 = pd.concat([test8,test11],axis=1)
test14=test12.formattedAddress

#to change the address from list into string
for a in test14.index :
    b=test14[a]
    b=str(b).strip("[]")
    b=str(b).replace("'","")
    test14[a]=b

In [248]:
test16=test12.drop('formattedAddress',axis=1)
test16 = pd.concat([test16,test14],axis=1)
test17=test16[~test16.formattedAddress.str.contains('Malaysia')]
test17.head()

Unnamed: 0,id,name,lat,lng,shortName,formattedAddress
0,55a24cb7498ea5a1aef817cc,Beng Hiang Restaurant,1.33394,103.74,Chinese,"135 Jurong Gateway Rd #02-337, Singapore"
1,5a2cca826eda024462c994fe,White Restaurant 三巴旺白米粉 (White Restaurant),1.33541,103.746,Chinese,"IMM Building (2 Jurong East Street 21), Singapore"
2,4ba5d8cef964a5208c2539e3,Enaq Restaurant,1.34478,103.735,Indian,"303 Jurong East St. 32 #01-96, 600303, Singapore"
3,4fb72ed2e4b07fc43111e478,Xiang Ji Seafood and Steamboat Restaurant,1.33648,103.743,Chinese,"202 Jurong East St. 21. #01-113, 600202, Singa..."
4,51304889e4b0a7dd557d360f,Soup Restaurant Teahouse 三盅兩件茶楼,1.33439,103.746,Chinese,"#01-101B IMM Building (2 Jurong East St. 21), ..."


The data is cleaned using manual checking of the restaurant under respective category for a more accurate representation of the type of restaurant. The restaurant are mainly classified according to their own cultural influence instead of the type of food they offer.

In [249]:
d = {'Turkish':'Middle Eastern','Modern European':'European','German':'European','Portuguese':'European','Seafood':'Chinese',
     'French':'European','BBQ':'Chinese','Cocktail':'European','Cuban':'Caribbean','Cantonese':'Chinese','Cafeteria':'Food Court',
     'Cajun / Creole':'Food Court','Deli / Bodega':'European','Italian':'Mediterranean','Spanish':'Mediterranean','Pub':'Bar',
     'South Indian':'Indian','Lebanese':'Middle Eastern','Dim Sum':'Chinese','Falafel':'Middle Eastern', 'Seafood':'Chinese',
     'Molecular Gastronomy':'Middle Eastern','Comfort Food':'European','Tea Room':'Food Court','Shabu-Shabu':'Japanese',
     'Other Outdoors':'Food Court','Food & Drink':'Chinese','Halal':'Malay','Gourmet':'Malay','Fried Chicken':'Filipino',
     'Hotpot':'Chinese','North Indian':'Indian','Indian Chinese':'Manchu','Beijing': 'Chinese','Cha Chaan Teng':'Chinese',
     'Hainan':'Chinese','Arcade':'Indian','Winery':'Bar','Hong Kong':'Chinese','Shanghai':'Chinese','Peking Duck':'Chinese',
     'Szechuan':'Chinese','Dongbei':'Chinese','Dumplings':'Chinese','Sushi':'Japanese','Japanese Curry':'Japanese',
     'Bistro':'Bar','Beer Garden':'Bar','Brewery':'Bar','Pub':'Bar','Café':'European','Noodles':'Japanese','Soup':'Chinese',
     'Coffee Shop':'Chinese','Breakfast':'Chinese','Restaurant':'European','Afghan':'European','Diner':'Korean','Food':'Malay',
     'Asian':'Chinese','Pakistani':'Middle Eastern','Scandinavian':'European','Swiss':'European','American':'European',
     'Steakhouse':'Bar'}
test19 = test17.replace(d)
test20=test19[(test19.shortName!='Arepas')&(test19.shortName!='Burgers')&(test19.shortName!='Eastern European')&
              (test19.shortName!='Gay Bar')&(test19.shortName!='Mac & Cheese')&(test19.shortName!='New American')&
              (test19.shortName!='Event Space')&(test19.shortName!='Office')&(test19.shortName!='Park')&
              (test19.shortName!='Entertainment')&(test19.shortName!='Parking')&(test19.shortName!='Metro')&
              (test19.shortName!='Factory')&(test19.shortName!='Technology')&(test19.shortName!='Fast Food')&
              (test19.shortName!='African')&(test19.shortName.notna())]

test20.head()

Unnamed: 0,id,name,lat,lng,shortName,formattedAddress
0,55a24cb7498ea5a1aef817cc,Beng Hiang Restaurant,1.333945,103.740333,Chinese,"135 Jurong Gateway Rd #02-337, Singapore"
1,5a2cca826eda024462c994fe,White Restaurant 三巴旺白米粉 (White Restaurant),1.335406,103.746238,Chinese,"IMM Building (2 Jurong East Street 21), Singapore"
2,4ba5d8cef964a5208c2539e3,Enaq Restaurant,1.344777,103.735175,Indian,"303 Jurong East St. 32 #01-96, 600303, Singapore"
3,4fb72ed2e4b07fc43111e478,Xiang Ji Seafood and Steamboat Restaurant,1.336477,103.743038,Chinese,"202 Jurong East St. 21. #01-113, 600202, Singa..."
4,51304889e4b0a7dd557d360f,Soup Restaurant Teahouse 三盅兩件茶楼,1.334393,103.746033,Chinese,"#01-101B IMM Building (2 Jurong East St. 21), ..."


Overall count of restaurant is calculated for further processing types of restaurant available in Singapore.

In [250]:
test21=test20.groupby(['shortName']).count()
test21.head()

Unnamed: 0_level_0,id,name,lat,lng,formattedAddress
shortName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bar,29,29,29,29,29
Buffet,6,6,6,6,6
Caribbean,2,2,2,2,2
Chinese,499,499,499,499,499
European,72,72,72,72,72


Some basic clustering of the restaurant location marked on the map for simple visualisation

In [202]:
address = 'Singapore'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# let's start again with a clean copy of the map of San Francisco
venues_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# instantiate a mark cluster object for the incidents in the dataframe
summary = plugins.MarkerCluster().add_to(venues_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(test20.lat, test20.lng, test20.shortName):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(summary)

# display map
venues_map