Capstone Project - The Battle of Neighbourhoods (Week1)
============

## Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)

## Introduction: Business Problem <a name="introduction"></a>

##### My stakeholders are those who plan to move to new neighbourhood close to the Samsung Digital City in Suwon, South Korea. They work for Samsung Electronics and want to find the best residential district around office. I target to find the optimal house location using the skills that I learned from the IBM Data Science course. Briefly thinking, I should consider some factors below to increase the possibility of business success.

  * 1. Near Samsung Digital City at Suwon 
  * 2. Close to Park or Green place 
  * 3. Easy to access basic amenities

## Data <a name="data"></a>

##### I mainly collected two types of data sets using Google geocode API and Foursqure API. First of all, I've got Samsung Digital City (SDC) Neighbourhood data that includes address, latitude, longitude using Google geocode API. Then, I've got Venues data of all neighbourhoods within 20km from SDC using Foursquare API. These two data will be used to analyze and cluster neighbourhoods. And I will finally suggest the best neighbourhood option to people who look for their new house location around SDC. 

1) Get the location coordinates of Samsung office using Google geocode api.

In [1]:
import requests
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]

google_api_key = 'AIzaSyBoF5cKq8jauHleQ3YzDalgzrLa9KsOKSg'
address = 'Samsung Digital City, Suwon-si, Gyeonggi-do'
sdc_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, sdc_center))

Coordinate of Samsung Digital City, Suwon-si, Gyeonggi-do: [37.2539047, 127.0485106]


2) Create neighbourhoods that are equally spaced, centered around Samsung Digital City and within ~20km from Samsung Digital City. Our neighborhoods will be defined as circular areas with a radius of 1000 meters, so our neighborhood centers will be 2km apart. Belows are funtions to create neighbourhoods and their center coordinates (With reference to Notebook: https://cocl.us/coursera_capstone_notebook)

In [2]:
!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Samsung Digital City center longitude={}, latitude={}'.format(sdc_center[1], sdc_center[0]))
x, y = lonlat_to_xy(sdc_center[1], sdc_center[0])
print('Samsung Digital City center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('SDC center longitude={}, latitude={}'.format(lo, la))

Collecting shapely
[?25l  Downloading https://files.pythonhosted.org/packages/38/b6/b53f19062afd49bb5abd049aeed36f13bf8d57ef8f3fa07a5203531a0252/Shapely-1.6.4.post2-cp36-cp36m-manylinux1_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 1.9MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.6.4.post2
Coordinate transformation check
-------------------------------
Samsung Digital City center longitude=127.0485106, latitude=37.2539047
Samsung Digital City center UTM X=6532964.978720854, Y=12918928.066697462
SDC center longitude=127.04851059999953, latitude=37.25390469999955


3) Create neighbourhoods and visualize them in folium map.

In [3]:
sdc_center_x, sdc_center_y = lonlat_to_xy(sdc_center[1], sdc_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = sdc_center_x - 10000
x_step = 1000
y_min = sdc_center_y - 10000 - (int(21/k)*k*1000 - 20000)/2
y_step = 1000 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 500 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(sdc_center_x, sdc_center_y, x, y)
        if (distance_from_center <= 20001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

504 candidate neighborhood centers generated.


In [4]:
!pip install folium

import folium



In [5]:
map_sdc = folium.Map(location=sdc_center, zoom_start=12)
folium.Marker(sdc_center, popup='ttt').add_to(map_sdc)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=500, color='blue', fill=False).add_to(map_sdc)
    #folium.Marker([lat, lon]).add_to(map_suji)
map_sdc

4) Define the function to create the address of neighbourhoods

In [6]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key,sdc_center[0], sdc_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(sdc_center[0], sdc_center[1], addr))

Reverse geocoding check
-----------------------
Address of [37.2539047, 127.0485106] is: 416 Sin-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea


5) Get the addresses of all neighbourhoods

In [7]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', suwon', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [8]:
addresses[150:170]

['506-4 Iui-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea',
 '807-4 Iui-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea',
 '1171-1 Iui-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea',
 '산7-5 Uman-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea',
 '12 Paldalmun-ro 163beon-gil, Uman-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea',
 '121-10 Jungbu-daero, Ji-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea',
 '258-1 Ingye-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea',
 '899-1 Ingye-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea',
 '138-122 Seryu-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '526-5 Seryu 3(sam)-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '546-8 Seryu-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '1039 Seryu 2(i)-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '422-9 Jangji-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '334-3 Pyeongni-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea',
 '600 Gosaek-dong, Gwonse

6) Create Dataframe that includes address, latitudes, longitudes of all neighbourhoods in Sujigu Area (It also includes adjacent neighborhoods of Suji-gu).

In [9]:
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head()

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeon...",37.337181,127.0226,6523465.0,12908540.0,14080.12784
1,"80 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeong...",37.3321,127.018378,6524465.0,12908540.0,13425.721582
2,"409-25 Sanggwanggyo-dong, Jangan-gu, Suwon, Gy...",37.32702,127.014157,6525465.0,12908540.0,12816.005618
3,"산99-6 Sanggwanggyo-dong, Jangan-gu, Suwon, Gye...",37.32194,127.009937,6526465.0,12908540.0,12257.650672
4,"364-1 Pajang-dong, Jangan-gu, Suwon, Gyeonggi-...",37.316861,127.005718,6527465.0,12908540.0,11757.976016


7) Foresquare Credential

In [13]:
CLIENT_ID = 'FQIQDSZ2JNHV2YB4MGT0DK4SAHODBFZHSLJY35WM4TEQQEAJ' # your Foursquare ID
CLIENT_SECRET = 'THCXEARMB11F2KDNRUDU0GU35PKVT4MVFSP1CAJGWQET1CBP' # your Foursquare Secret
VERSION = '20190605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FQIQDSZ2JNHV2YB4MGT0DK4SAHODBFZHSLJY35WM4TEQQEAJ
CLIENT_SECRET:THCXEARMB11F2KDNRUDU0GU35PKVT4MVFSP1CAJGWQET1CBP


### Explore first neighbourhood in Data frame

In [14]:
# first neighbourhood
df_locations.loc[0, 'Address']

'산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea'

In [15]:

# lang, long of first neighbourhood
neighbourhood_latitude = df_locations.loc[0, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = df_locations.loc[0, 'Longitude'] # neighborhood longitude value

neighbourhood_name = df_locations.loc[0, 'Address'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of 산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea are 37.337181039087454, 127.02259953357414.


In [16]:
#Get Request URL
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius
 # create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=FQIQDSZ2JNHV2YB4MGT0DK4SAHODBFZHSLJY35WM4TEQQEAJ&client_secret=THCXEARMB11F2KDNRUDU0GU35PKVT4MVFSP1CAJGWQET1CBP&v=20190605&ll=37.337181039087454,127.02259953357414&radius=500&limit=100'

In [17]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5dae9f88cf72a000397da32c'},
  'headerLocation': 'Current map view',
  'headerFullLocation': 'Current map view',
  'headerLocationGranularity': 'unknown',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 37.34168104358746,
    'lng': 127.02824877904665},
   'sw': {'lat': 37.33268103458745, 'lng': 127.01695028810163}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '514ec16ae4b08debeac5bc71',
       'name': '광동농원',
       'location': {'lat': 37.3358154296875,
        'lng': 127.02411651611328,
        'labeledLatLngs': [{'label': 'display',
          'lat': 37.3358154296875,
          'lng': 127.02411651611328}],
        'distance': 202,
        'cc': 'KR',
        'country': '대한민국',
        'formattedAddress': ['대한민국']},
      

In [18]:
# function that extracts the category of the venue (borrow from Foursquare)
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [20]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,광동농원,Korean Restaurant,37.335815,127.024117
1,상광교마을회관,Bus Station,37.338909,127.025948
2,오리농장,Korean Restaurant,37.334618,127.018562


In [21]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


### Explore all the neighbourhoods around Samsung Digital City

In [22]:
#Let's create a function to repeat the same process to all the neighborhoods in Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
#Create SDC Neighbourhoods venues using above function
SDC_venues = getNearbyVenues(names=df_locations['Address'],
                             latitudes=df_locations['Latitude'],
                             longitudes=df_locations['Longitude']
                             )

산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
80 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
409-25 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
산99-6 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
364-1 Pajang-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
산121-1 Pajang-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
608 Pajang-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
46-15 Jeongja-ro 144beon-gil, Jeongja 1(il)-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
870-2 Jeongja-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
922-1 Jeongja-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea
723-5 Hwaseo 2(i)-dong, Paldal-gu, Suwon, Gyeonggi-do, South Korea
463 Guun-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea
535-8 Guun-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea
209 Tap-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea
140 Geumho-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea
1095-5 Homaesil-do

In [25]:
print(SDC_venues.shape)
SDC_venues.head()

(3139, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeon...",37.337181,127.0226,광동농원,37.335815,127.024117,Korean Restaurant
1,"산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeon...",37.337181,127.0226,상광교마을회관,37.338909,127.025948,Bus Station
2,"산34 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeon...",37.337181,127.0226,오리농장,37.334618,127.018562,Korean Restaurant
3,"80 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeong...",37.3321,127.018378,광교산입구,37.334322,127.017133,Trail
4,"80 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeong...",37.3321,127.018378,폭포농원,37.334326,127.018045,Korean Restaurant


In [26]:
SDC_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"1-1 Yeongtong 1(il)-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea",10,10,10,10,10,10
"1-11 Bora-dong, Giheung-gu, Yongin-si, Gyeonggi-do, South Korea",3,3,3,3,3,3
"1-68 Seodun-dong, Gwonseon-gu, Suwon, Gyeonggi-do, South Korea",8,8,8,8,8,8
"10-105 Annyeong-dong, Hwaseong-si, Gyeonggi-do, South Korea",6,6,6,6,6,6
"101-23 Jowon-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea",4,4,4,4,4,4
...,...,...,...,...,...,...
"산86-4 Iui-dong, Yeongtong-gu, Suwon, Gyeonggi-do, South Korea",2,2,2,2,2,2
"산89 Sema-dong, Osan, Gyeonggi-do, South Korea",2,2,2,2,2,2
"산95 Yeongdeok-dong, Giheung-gu, Yongin-si, Gyeonggi-do, South Korea",8,8,8,8,8,8
"산99-6 Sanggwanggyo-dong, Jangan-gu, Suwon, Gyeonggi-do, South Korea",1,1,1,1,1,1


In [27]:
print('There are {} uniques categories.'.format(len(SDC_venues['Venue Category'].unique())))

There are 199 uniques categories.


In [None]:
#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

## End of document