# Capstone Project - Chinese Restaurants in Sydney (Week 5)
### Applied Data Science Capstone by IBM/Coursera
<img src = "https://upload.wikimedia.org/wikipedia/commons/5/51/Sydney_skyline_from_the_north_August_2016_%2829009142591%29.jpg" >

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>

Sydney is a gateway to Australia for many international visitors. It has hosted over 2.8 million international visitors in 2013, or nearly half of all international visits to Australia. These visitors spent 59 million nights in the city and a total of $5.9 billion. (Wikipedia)

In this project we will try to analyse the distribution of Chinese restaurants in Sydney and try to find a possible optimal location for a new Chinese restaurant. 

This report is expected to provide information on **Chinese restaurant** in **Sydney**, Australia.

Sydney is the largest and the most populous city of Australia, and also the capital of NewSouth Wales. The population of today's Sydney is approaching to 5 million people. Since there are lots of Chinese restaurants in Sydney, we will only focus on the area with postal code from 1000 to 2249 in Sydney. We will use Foursquare to detect **Chinese restaurant** and its **locations**.


## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing Chinese restaurants in the neighborhood
* distance to Chinese restaurants in the neighborhood
* distribution of Chinese restaurants in Sydney

We downloaded Australian postcodes from Github. https://github.com/matthewproctor/australianpostcodesWe  This csv file contains 

Fieldname Description

postcode   The postcode in numerical format - 0000 to 9999

locality   The locality of the postcode - typically the city/suburb or postal distribution centre

state   The Australian state in which the locality is situated

long   The longitude of the locality - defaults to 0 when not available

lat   The latitude of the locality - defaults to 0 when not available

dc1   The Australia Post distribution Centre servicing this postcode - defaults to blank when not available

type1   The of locality, such as a delivery area, post office or a "Large Volume Recipient" such as a GPO, defaults to blank when not available

status   A note indicating whether the data is new, removed or updated - new column Nov 2018
decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

We will only focus on postcodes from 1000 to 2249

Sydney, NSW, Australia
Latitude and longitude coordinates are: -33.865143, 151.209900.

Following data sources will be needed to extract/generate the required information:
* longitude and latitude of candidate areas will be used to get Chinese restaurant list from Foursquare
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**

## Methodology <a name="methodology"></a>
* Review our sources of data 
We will review our data source and see what data preparation process is needed. We have two main sources of data:
a) one CSV file gathering all zip codes in Australia and each corresponding latitude, longitude coordinates. This database is coming from https://github.com/matthewproctor/australianpostcodesWe
b) we will gather information by Fouresquare API to get neighborhood information for each (latitude, longitude) postal area in Sydney.

* Collect Data

* Explore and Understand Data

* Data Preparation and Preprocessing

* Modeling

#### Collect Data

In [1]:
import requests
import pandas as pd
import numpy as np

In [2]:
# import csv file
df_au_postal = pd.read_csv('australian_postcodes.csv')

In [3]:
df_au_postal.head()

Unnamed: 0,postcode,locality,State,long,lat,id,dc,type,status
0,6532,CARRARANG,WA,115.004595,-28.440886,10861,GERALDTON DC,Delivery Area,
1,6532,COBURN,WA,115.004595,-28.440886,10862,GERALDTON DC,Delivery Area,
2,6532,COOLCALALAYA,WA,115.004595,-28.440886,10863,GERALDTON DC,Delivery Area,
3,6532,DARTMOOR,WA,115.004595,-28.440886,10864,GERALDTON DC,Delivery Area,
4,6532,DEEPDALE,WA,115.004595,-28.440886,10865,GERALDTON DC,Delivery Area,


In [4]:
# check type
df_au_postal.dtypes

postcode      int64
locality     object
State        object
long        float64
lat         float64
id            int64
dc           object
type         object
status       object
dtype: object

In [38]:
# clear data
df_sydney = df_au_postal[ (df_au_postal['postcode'] >=1000 ) & (df_au_postal['postcode'] <=2249)  ]
df_sydney.drop(['State','id','type','status'], axis=1, inplace=True)
df_sydney.drop( df_sydney[ (df_sydney['long'] <100 ) | (df_sydney['lat'] >-10 )  ].index, inplace=True)
df_sydney.dropna(subset=['long', 'lat'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [46]:
# check again
df_sydney['lat'].isnull().sum()

0

In [48]:
df_sydney.head()

Unnamed: 0,postcode,locality,long,lat,dc
14279,2205,ARNCLIFFE,151.147956,-33.937551,ROCKDALE DC
14280,2205,TURRELLA,151.147956,-33.937551,ROCKDALE DC
14281,2205,WOLLI CREEK,151.147956,-33.937551,ROCKDALE DC
14282,2206,CLEMTON PARK,151.122881,-33.926056,KINGSGROVE DC
14283,2206,EARLWOOD,151.122881,-33.926056,KINGSGROVE DC


In [49]:
df_sydney.shape

(592, 5)

In [47]:
# save to df_sydney.csv
df_sydney.to_csv('df_sydney.csv')

let's visualizat Sydney and our check points.

In [50]:
import folium

In [52]:
# create map of Sydney using latitude and longitude values
map_sydney = folium.Map(location=[-33.865143, 151.209900], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df_sydney['lat'], df_sydney['long'], df_sydney['dc']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sydney)  
    
map_sydney

#### Next, we are going to start utilizing the Foursquare API to explore Chinese restaurant and segment them.

In [56]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'FUEW23VONO3RORWYVJ5M5QSXMJEOBYUHOFMMOQFUCKOAGA2A' # your Foursquare ID
CLIENT_SECRET = '4HEVVH0BOGMVSCQWPNLQSLAYXR2ZD2FQD5Z5ET3ZWMOWHVYF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FUEW23VONO3RORWYVJ5M5QSXMJEOBYUHOFMMOQFUCKOAGA2A
CLIENT_SECRET:4HEVVH0BOGMVSCQWPNLQSLAYXR2ZD2FQD5Z5ET3ZWMOWHVYF


In [57]:
# Category IDs corresponding to Chinese restaurants were taken from Foursquare web site 
#(https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

chinese_restaurant_categories = [
    '4bf58dd8d48988d145941735',
    '52af3a5e3cf9994f4e043bea',
    '52af3a723cf9994f4e043bec',
    '52af3a7c3cf9994f4e043bed',
    '58daa1558bbb0b01f18ec1d3',
    '52af3a673cf9994f4e043beb',
    '52af3a903cf9994f4e043bee',
    '4bf58dd8d48988d1f5931735',
    '52af3a9f3cf9994f4e043bef',
    '52af3aaa3cf9994f4e043bf0',
    '52af3ab53cf9994f4e043bf1',
    '52af3abe3cf9994f4e043bf2',
    '52af3ac83cf9994f4e043bf3',
    '52af3ad23cf9994f4e043bf4',
    '52af3add3cf9994f4e043bf5',
    '52af3af23cf9994f4e043bf7',
    '52af3ae63cf9994f4e043bf6',
    '52af3afc3cf9994f4e043bf8',
    '52af3b053cf9994f4e043bf9',
    '52af3b213cf9994f4e043bfa',
    '52af3b293cf9994f4e043bfb',
    '52af3b343cf9994f4e043bfc',
    '52af3b3b3cf9994f4e043bfd',
    '52af3b463cf9994f4e043bfe',
    '52af3b633cf9994f4e043c01',
    '52af3b513cf9994f4e043bff',
    '52af3b593cf9994f4e043c00',
    '52af3b6e3cf9994f4e043c02',
    '52af3b773cf9994f4e043c03',
    '52af3b813cf9994f4e043c04',
    '52af3b893cf9994f4e043c05',
    '52af3b913cf9994f4e043c06',
    '52af3b9a3cf9994f4e043c07',
    '52af3ba23cf9994f4e043c08',
]

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Sydney', '')
    address = address.replace(', Australia', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

#### Now, let's get Chinese restaurant list in check point within a radius of 500 meters.

In [64]:
# Let's now go over our check points and get nearby restaurants; 
# we'll also maintain a dictionary of all found restaurants and all found chinese restaurants


def get_restaurants(lats, lons):
    restaurants = {}
    italian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=500 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, CLIENT_ID, CLIENT_SECRET, radius=500, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_italian = is_restaurant(venue_categories, specific_filter=chinese_restaurant_categories)
            if is_res:
                #x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, venue_latlon[1], venue_latlon[0])
                if venue_distance<=450:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_italian:
                    italian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, italian_restaurants, location_restaurants



In [65]:
# Try to load from local file system in case we did this before
restaurants, chinese_restaurants, location_restaurants = get_restaurants(
                                   df_sydney['lat'], df_sydney['long']
                                  )


Obtaining venues around candidate locations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [68]:
# save data
cache = (restaurants, chinese_restaurants, location_restaurants)
 
import pickle

f = open("file.pkl","wb")
pickle.dump(cache,f)
f.close()

In [66]:
chinese_restaurants

{'4b15d748f964a520efb423e3': ('4b15d748f964a520efb423e3',
  'Cams Best & Less Restaurant',
  -33.92728,
  151.124333,
  '377 Homer St (at Hartill Law Ave), Earlwood NSW 2206',
  191,
  True,
  151.124333,
  -33.92728),
 '4d54f814611aa35dcf792f39': ('4d54f814611aa35dcf792f39',
  'Kingsgrove Chinese Restaurant',
  -33.94197446311799,
  151.1013392018479,
  '270 Kingsgrove Rd, Kingsgrove NSW',
  316,
  True,
  151.1013392018479,
  -33.94197446311799),
 '4bc19e0df8219c74aca1b310': ('4bc19e0df8219c74aca1b310',
  'House Of Lee',
  -33.95805874372357,
  151.03435144951462,
  'New South Wales',
  417,
  True,
  151.03435144951462,
  -33.95805874372357),
 '55892297498ef02cb1650c20': ('55892297498ef02cb1650c20',
  'wokkin tuckshop',
  -33.958483,
  151.151043,
  'Australia',
  432,
  True,
  151.151043,
  -33.958483),
 '4ba5cad8f964a520932239e3': ('4ba5cad8f964a520932239e3',
  'Canton Noodle House',
  -33.96714448419305,
  151.10518388273002,
  'Rose St (Forest Rd), Hurstville Grove NSW 2220',
 

In [None]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)