Capstone Project - The Battle of Neighbourhoods 
============

## Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

My stakeholders are those who plan to move house to Yongin city, Kyounggi province, South Korea. Since it borders Seongnam City's Bundang District and Suwon City, two more well developed area, Yongin city should be a good option  target to find the optimal wine bar location using the skills that I learned from the IBM Data Science course. Briefly thinking, I should consider some factors below to increase the possibility of business success.

  * 1. Crowded Area 
       : Near Subway, City Center, office district
  * 2. Less Competitive Area 
       : Consider the number of and distance from competitive places. Regarding competitor categories, I should consider that there are not many wine bars in Korea, still 0 or 1 wine bar in the neibourhoods around downtown and instead of wine bar, Korean people go to italian/spanish restaurant or hotel/Whiskey bars and lounge for drinking wine. So I will consider these places as competitor categories - Wine bars, Italian/spanish restaurants, hotel/whiskey bars, lounge.

FYI, Here is the link of wikipedia about Suji-gu. https://en.wikipedia.org/wiki/Suji-gu

## Data <a name="data"></a>

1) Get the location coordinates of Suji-gu office using Google geocode api.

In [1]:
import requests
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]

google_api_key = 'AIzaSyBoF5cKq8jauHleQ3YzDalgzrLa9KsOKSg'
address = 'Suji-gu Office, Yongin-si, Gyeonggi-do'
suji_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, suji_center))

Coordinate of Suji-gu Office, Yongin-si, Gyeonggi-do: [37.32107999999999, 127.097029]


2) Create neighbourhoods that are equally spaced, centered around city center and within ~6km from Suji-gu Office. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart. Belows are funtions to create neighbourhoods and their center coordinates (With reference to Notebook: https://cocl.us/coursera_capstone_notebook)

In [3]:
!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Suji center longitude={}, latitude={}'.format(suji_center[1], suji_center[0]))
x, y = lonlat_to_xy(suji_center[1], suji_center[0])
print('Suji center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Suji center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Suji center longitude=127.097029, latitude=37.32107999999999
Suji center UTM X=6520267.347462841, Y=12918112.680821314
Suji center longitude=127.09702899999955, latitude=37.32107999999955


3) Create neighbourhoods and visualize them in folium map.

In [4]:
suji_center_x, suji_center_y = lonlat_to_xy(suji_center[1], suji_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = suji_center_x - 6000
x_step = 600
y_min = suji_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(suji_center_x, suji_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

364 candidate neighborhood centers generated.


In [5]:
!pip install folium

import folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 10.5MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0


In [6]:
map_suji = folium.Map(location=suji_center, zoom_start=13)
folium.Marker(suji_center, popup='ttt').add_to(map_suji)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_suji)
    #folium.Marker([lat, lon]).add_to(map_suji)
map_suji

4) Define the function to create the address of neighbourhoods

In [7]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key,suji_center[0], suji_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(suji_center[0], suji_center[1], addr))

Reverse geocoding check
-----------------------
Address of [37.32107999999999, 127.097029] is: Suji-gu Office, Pungdeokcheon 2(i)-dong, Yongin-si, South Korea


5) Get the addresses of all neighbourhoods

In [8]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', Suji', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [9]:
addresses[150:170]

['373-9 Sanghyeon-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '497-6 Sanghyeon-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '678 Sanghyeon 1(il)-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 'Migeum-ro, Gumi-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, South Korea',
 '63 Gumi-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, South Korea',
 '128-4 Gumi 1(il)-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, South Korea',
 '151-1 Gumi 1(il)-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, South Korea',
 '836-2 Gumi 1(il)-dong, Bundang-gu, Seongnam-si, Gyeonggi-do, South Korea',
 '995-4 Jukjeon-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '757 Pungdeokcheon 1(il)-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '13-1 Pungdeokcheon-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '700-2 Pungdeokcheon 1(il)-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '115-18 Pungdeokcheolro, Pungdeokcheon 1(il)-dong-gu, Yongin-si, Gyeonggi-do, South Korea',
 '1105 Pungdeokcheon 2(i)-dong-gu, Yongin-si, Gyeonggi

6) Create Dataframe that includes address, latitudes, longitudes of all neighbourhoods in Sujigu Area (It also includes adjacent neighborhoods of Suji-gu).

In [10]:
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head()

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"산11-15 Gogi-dong-gu, Yongin-si, Gyeonggi-do, S...",37.349542,127.068296,6518467.0,12912400.0,5992.495307
1,"산50-1 Dongcheon-dong-gu, Yongin-si, Gyeonggi-d...",37.346494,127.065757,6519067.0,12912400.0,5840.3767
2,"604-4 Dongcheon-dong-gu, Yongin-si, Gyeonggi-d...",37.343446,127.063219,6519667.0,12912400.0,5747.173218
3,"산90-1 Dongcheon-dong-gu, Yongin-si, Gyeonggi-d...",37.340398,127.060681,6520267.0,12912400.0,5715.767665
4,"산69 Sinbong-dong-gu, Yongin-si, Gyeonggi-do, S...",37.33735,127.058143,6520867.0,12912400.0,5747.173218


In [89]:
df_locations.to_pickle('./locations.pkl')    

7) Get the venue info using Foursquare API. Below is the Foursquare Credential.  

In [11]:
client_id = 'FQIQDSZ2JNHV2YB4MGT0DK4SAHODBFZHSLJY35WM4TEQQEAJ' # your Foursquare ID
client_secret = 'THCXEARMB11F2KDNRUDU0GU35PKVT4MVFSP1CAJGWQET1CBP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
#Get Request URL
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

In [14]:
# Category IDs corresponding to Italian restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

bar_category = '4bf58dd8d48988d116941735' # 'Root' category for all food-related venues

def is_bar(categories, specific_filter=None):
    bar_words = ['bar', 'wine', 'wine bar']
    bar = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in bar_words:
            if r in category_name:
                bar = True
        if 'fast food' in category_name:
            bar = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            bar = True
    return bar, specific
def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Suji', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [15]:
# Let's now go over our neighborhood locations and get nearby restaurants; we'll also maintain a dictionary of all found restaurants and all found italian restaurants

import pickle

def get_bars(lats, lons):
    bars = {}
    location_bars = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, bar_category,client_id, client_secret, radius=350, limit=100)
        area_bars = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_ba = is_bar(venue_categories, specific_filter=None)
            if is_ba:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                bar = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_bar, x, y)
                if venue_distance<=300:
                    area_bars.append(bar)
                bars[venue_id] = bar                
        location_bars.append(area_bars)
        print(' .', end='')
    print(' done.')
    return bars, location_bars

# Try to load from local file system in case we did this before
bars = {}
location_bars = []
loaded = False
try:
    with open('bar2.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('location2_bar', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('bar data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    bars, location_bars = get_bars(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('bar2.pkl', 'wb') as f:
        pickle.dump(bars, f)
    with open('location_bar2.pkl', 'wb') as f:
        pickle.dump(bars, f)
        

Obtaining venues around candidate locations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [16]:
import numpy as np

print('Total number of bars:', len(bars))
print('Total number of Italian restaurants:', len(italian_restaurants))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of bars: 27


NameError: name 'italian_restaurants' is not defined

In [17]:
print('List of all restaurants')
print('-----------------------')
for r in list(bars.values())[:10]:
    print(r)
print('...')
print('Total:', len(bars))

List of all restaurants
-----------------------
('5083acb9e4b0b8e844bb943f', '와인365', 37.353983539243046, 127.10152954628367, '분당구 대왕판교로 109, 금곡동, 성남시, 경기도, 13552, 대한민국', 344, <function is_bar at 0x7f516c49e6a8>, 6515452.84928742, 12915613.269474773)
('57f506d4498e538603899b6e', 'HAZE the coffee bar', 37.337702, 127.087492, '수지구 동천로 137, 111호, 용인시, 경기도, 대한민국', 323, <function is_bar at 0x7f516c49e6a8>, 6518691.596467971, 12915563.302647974)
('4ee734ace30005f8ba64397d', '안주 (安酒)', 37.33544398590668, 127.08733699391671, '수지구 동천로153번길 7, 동천동, 용인시, 경기도, 16822, 대한민국', 308, <function is_bar at 0x7f516c49e6a8>, 6519010.917800356, 12915751.517015597)
('4b7e86d9f964a52008f12fe3', '이까', 37.35105985003611, 127.1097687554201, '분당구 성남대로172번길 12, 성남시, 경기도, 대한민국', 209, <function is_bar at 0x7f516c49e6a8>, 6515254.091641588, 12916775.725755235)
('4e57b3f2227131507c90bebb', '나루', 37.351541, 127.110061, '분당구 성남대로172번길 12, 성남시, 경기도, 대한민국', 209, <function is_bar at 0x7f516c49e6a8>, 6515167.293386067, 12916

In [None]:
# Category IDs corresponding to Italian restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

italian_restaurant_categories = ['4bf58dd8d48988d110941735','55a5a1ebe4b013909087cbb6','55a5a1ebe4b013909087cb7c',
                                 '55a5a1ebe4b013909087cba7','55a5a1ebe4b013909087cba1','55a5a1ebe4b013909087cba4',
                                 '55a5a1ebe4b013909087cb95','55a5a1ebe4b013909087cb89','55a5a1ebe4b013909087cb9b',
                                 '55a5a1ebe4b013909087cb98','55a5a1ebe4b013909087cbbf','55a5a1ebe4b013909087cb79',
                                 '55a5a1ebe4b013909087cbb0','55a5a1ebe4b013909087cbb3','55a5a1ebe4b013909087cb74',
                                 '55a5a1ebe4b013909087cbaa','55a5a1ebe4b013909087cb83','55a5a1ebe4b013909087cb8c',
                                 '55a5a1ebe4b013909087cb92','55a5a1ebe4b013909087cb8f','55a5a1ebe4b013909087cb86',
                                 '55a5a1ebe4b013909087cbb9','55a5a1ebe4b013909087cb7f','55a5a1ebe4b013909087cbbc',
                                 '55a5a1ebe4b013909087cb9e','55a5a1ebe4b013909087cbc2','55a5a1ebe4b013909087cbad']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [None]:
# Let's now go over our neighborhood locations and get nearby restaurants; we'll also maintain a dictionary of all found restaurants and all found italian restaurants

import pickle

def get_restaurants(lats, lons):
    restaurants = {}
    italian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category,client_id, client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_italian = is_restaurant(venue_categories, specific_filter=italian_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_italian:
                    italian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, italian_restaurants, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
italian_restaurants = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_351.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('italian_restaurants_351.pkl', 'rb') as f:
        italian_restaurants = pickle.load(f)
    with open('location_restaurants_351.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, italian_restaurants, location_restaurants = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_351.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('italian_restaurants_351.pkl', 'wb') as f:
        pickle.dump(italian_restaurants, f)
    with open('location_restaurants_351.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
        