<h2><center> The Effects of Spatial Agglomeration on Relative Net Profitability: a prospective look at Seoul’s retail market sector using machine learning </center></h2>

# Introduction

 Stakeholders looking to open new businesses will often dispute which part of a city they should open in. The rise of specialized localities in Seoul has made this decision process harder since entrepreneurs now have to decide whether to open business in a fiercely competitive neighborhood or risk defaulting in a less famous location. Thus, this research will use Seoul’s retail markets to assess whether clustering markets holds an inherent economic advantage over individual markets.
 
 In particular, this research will mainly focus on restaurants, cafe's, and late-night food stores (such as chicken or pizza).

## Definitions

- **Retail Type**: is defined by a certain types of retail stores that share characteristics with other stores. This will be found using the unsupervised K-means algorithm. The focus of this research will be mainly on food and drinks.

- **Cluster**: Using DBSCAN clustering algorithms, I will find the locations in which stores of the same *retail type* are clustered together while others of the same *retail type* are considered *outliers* based on the algorithm.


## Data

Based on the definition of our problem, some factors that will influence our discussion are:
- locations of popular districts in the city of Seoul
- types of retail stores, may it be cafe's restauraunts, conveinence stores, etc.


# Research
First, let's import all the necessary libraries for this project:

In [1]:
# Exporatory Data Analysis
import pandas as pd
import numpy as np

# Url requests & parse
import urllib.request, requests
from bs4 import BeautifulSoup

# Data Visualization
import matplotlib.pyplot as plt
import folium

# File Management
import re, os, sys
import json
from pandas.io.json import json_normalize

## Retrieve Locations of Interest

Here, I will create a function that will utilize Kakao's map API to return *latitude* and *longitude* coordinates based on the centroids of our candidate neighborhoods. The function calls a REST API that communicates with Kakao map's servers and returns a json file containing information about the query. Here, since I am only interested in the specific location, I will return only the coordinates.

Eventually, these coordinates will give us an area where we will run K-means clustering algorithms to find the *types* of stores using their most defining characteristics.

In [2]:
Kakao_appkey = 'KakaoAK 1dd987b2b0f2925d8b7ea121281310c8'

def getLatLng(addr):
    url = 'https://dapi.kakao.com/v2/local/search/address.json?query='+addr
    headers = {"Authorization": Kakao_appkey}
    result = json.loads(str(requests.get(url,headers=headers).text))
    try:
        match_first = result['documents'][0]['address']
        return float(match_first['y']),float(match_first['x'])
    except IndexError:
        print("Not documented in API")
        
    return None

def queryAddress(query, lat='37.5665', lng='126.9780', radius='20000'):
    # This function will return the locations with query within a radius of this.radius meters (default 20km)
    # Default coordinates centered in Seoul
    url = 'https://dapi.kakao.com/v2/local/search/keyword.json?y={}&x={}&radius={}&query={}'.format(lat, lng, radius, query)
    headers = {"Authorization": Kakao_appkey}
    result = json.loads(str(requests.get(url,headers=headers).text))
    try:
        return float(result['documents'][0]['y']), float(result['documents'][0]['x'])
    except IndexError:
        return None

Seoul_lat, Seoul_lng = getLatLng("서울")

Now that we have a function for returning the coordinates of a specific address, let's use an excel file containing all the subway stations and create a DataFrame.

**Note**: *We are doing this assuming that the area surrounding the subway stations are usually neighborhoods of high interest*

In [3]:
df = pd.read_excel("Data/subwayStationNames.xlsx")
df = df.drop(['연번', '한자', '중국어', '일본어'], axis=1)
df.head(3)

Unnamed: 0,호선,역명,영문
0,1호선,서울역,Seoul Station
1,1호선,시청,City Hall
2,1호선,종각,Jonggak


Next, we find the locations for each of the stations and use the above functions to find the coordinates and append to the DataFrame. At the same time, we drop any row in our DataFrame that contains locations that are not within a 20km radius based on the center of Seoul.

In [4]:
# First store the station names in query format in an array
stations = []

for idx in df.index:
    name = (''.join(df['역명'][idx].split(' '))).split('(', 1)[0]
    if name[-1] != '역':
        name = name + '역'
    stations.append(name)

# Then we iterate over the array and find the coordinates for each element

station_lat = []
station_lng = []

for idx in range(len(stations)):
    try:
        lat, lng = queryAddress(stations[idx])
        station_lat.append(lat)
        station_lng.append(lng)
    except TypeError:
        print("{} not found within map radius".format(stations[idx]))
        df.drop([idx], axis=0, inplace=True) # drop the row containing unneeded locations
    

df['Latitude'] = station_lat
df['Longitude'] = station_lng
df = df.reset_index()

삼산체육관역 not found within map radius
굴포천역 not found within map radius
부평구청역 not found within map radius
남한산성입구역 not found within map radius
단대오거리역 not found within map radius
신흥역 not found within map radius


Finally let's clean up our DataFrame so that it is ready to be used for our analysis

In [5]:
df = df.drop(['index'], axis=1)
df = df.rename(columns={'호선': 'LineNumber', '역명' : 'StationKor', '영문' : 'StationEng'})
df.head()

Unnamed: 0,LineNumber,StationKor,StationEng,Latitude,Longitude
0,1호선,서울역,Seoul Station,37.554679,126.970607
1,1호선,시청,City Hall,37.565344,126.977199
2,1호선,종각,Jonggak,37.570229,126.983152
3,1호선,종로3가,Jongno 3(sam)ga,37.570421,126.992153
4,1호선,종로5가,Jongno 5(o)ga,37.570976,127.001539


Now that we have all the locational data for our neighborhoods, let's visualize the data:

- Seoul center location
- Candidate neighborhoods based upon proximity to stations

In [6]:
map_Seoul = folium.Map(location=[Seoul_lat, Seoul_lng], zoom_start=12)
for index, row in df.iterrows():
    label = '{}, {}'.format(row[1], row[0])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row[3], row[4]],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#333333',
        fill_opacity=0.7,
        parse_html=False
    ).add_to(map_Seoul)

map_Seoul

## Data Exploration

Now that we have our neighborhoods, we can use either Kakao's inherent search API to find the nearest retail stores of interest, or we can also use FourSquare's API to accomplish the same task. In my research, I will use FoureSquare's venue API to look specifically for restauraunts, cafe's and late-night food stores.

### Nearby retail stores using FourSquare API

Foursquare credentials are in the cell below

In [7]:
CLIENT_ID = 'RHLWHDGP2UUV5VAZOUNVEZOC3G5WQ00DPXYVYQIZIQ1BUY1G' # your Foursquare ID
CLIENT_SECRET = 'DAMEJ4SSMIYJ2UXOBS1I33K1DR0VG42RGQHTLBKKZSKPLRZZ' # your Foursquare Secret
VERSION = '20200228' # Foursquare API version

Create a "filter" of sorts that will find which type of restaurant a certain venue is.

In [8]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

korean_restaurant_categories = [
    '4bf58dd8d48988d113941735','56aa371be4b08b9a8d5734e4','56aa371be4b08b9a8d5734f0',
    '56aa371be4b08b9a8d5734e7','56aa371be4b08b9a8d5734ed','56aa371be4b08b9a8d5734ea',
    '52af0bd33cf9994f4e043bdd','4bf58dd8d48988d145941735','4bf58dd8d48988d111941735',
    '55a59bace4b013909087cb24','55a59bace4b013909087cb15','55a59bace4b013909087cb27',
    '4bf58dd8d48988d1d2941735','55a59bace4b013909087cb2a','4bf58dd8d48988d1d1941735',
    '4bf58dd8d48988d149941735','4bf58dd8d48988d14a941735','4bf58dd8d48988d1df931735',
    '52e81612bcbc57f1066b79f4','4bf58dd8d48988d16c941735','4bf58dd8d48988d108941735',
    '4bf58dd8d48988d109941735','52e81612bcbc57f1066b7a05','4bf58dd8d48988d10c941735',
    '52e81612bcbc57f1066b79ff','4bf58dd8d48988d10f941735','4bf58dd8d48988d110941735',
    '4bf58dd8d48988d1c1941735','4bf58dd8d48988d153941735','4bf58dd8d48988d151941735',
    '4bf58dd8d48988d1c4941735','4bf58dd8d48988d1ce941735','4bf58dd8d48988d1cc941735',
    '56aa371be4b08b9a8d573538','4bf58dd8d48988d1d3941735'
]
# Further research can also yield the best places to open a bar

cafe_categories = [
    '4bf58dd8d48988d1dc931735','4bf58dd8d48988d1c5941735','4bf58dd8d48988d1bd941735','4bf58dd8d48988d112941735',
    '4bf58dd8d48988d148941735','52e81612bcbc57f1066b7a0a','5744ccdfe4b0c0459246b4e2','4bf58dd8d48988d1c9941735',
    '512e7cae91d4cbb4e5efe0af','4bf58dd8d48988d1bc941735','4bf58dd8d48988d1d0941735','4bf58dd8d48988d146941735',
    '4bf58dd8d48988d1e0931735','4bf58dd8d48988d16d941735','52e81612bcbc57f1066b7a0c','4bf58dd8d48988d16a941735',
    '4bf58dd8d48988d179941735', '4bf58dd8d48988d143941735','4bf58dd8d48988d14f941735'
]

late_night_categories = [
    '52e81612bcbc57f1066b7a06','4bf58dd8d48988d1ca941735','4bf58dd8d48988d14c941735','4d4ae6fc7a7b7dea34424761'
]

Now, implement functions required to filter the data and GET the venues using the API.

In [17]:
# Use the filters I defined above to categorize the venue into either a restauraunt, cafe, or late night joint
def is_venue(categories, specific_filter=None):
    specific = False
    for c in categories:
        category_id = c[1]
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
    return specific

def get_categories(categories):
    return [(a['name'], a['id']) for a in categories]

def getNearbyVenues(lat, lon, category, radius, limit=100):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, lat, lon, category, radius, limit)
    #url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    #CLIENT_ID, CLIENT_SECRET, VERSION, lat, lon, radius, limit)
    results = requests.get(url).json()['response']['groups'][0]['items']
    venues = [(item['venue']['id'],
                item['venue']['name'],
                get_categories(item['venue']['categories']),
                (item['venue']['location']['lat'], item['venue']['location']['lng']),
                item['venue']['location']['formattedAddress'][0]) for item in results]        
    return venues

def createTable(names, lats, lngs, radius=500):
    restaurants = {}
    korean_restaurants = {}
    cafes = {}
    night_snacks = {}
    all_venues = []
    print("loading data", end=' ')
    for name, lat, lon in zip(names, lats, lngs):
        venues = getNearbyVenues(lat, lon, food_category, radius)
        area = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_coord = venue[3]
            venue_address = venue[4]
            restaurant = (name, venue_categories[0][0], venue_name, venue_coord[0], venue_coord[1], venue_address)
            area.append(restaurant)
            restaurants[venue_id] = restaurant
            
            if is_venue(venue_categories, specific_filter=korean_restaurant_categories):
                korean_restaurants[venue_id] = restaurant
            elif is_venue(venue_categories, specific_filter=cafe_categories):
                cafes[venue_id] = restaurant
            elif is_venue(venue_categories, specific_filter=late_night_categories):
                night_snacks[venue_id] = restaurant
        if len(area) == 0:
            area.append((name, 'none', 'none', 'none', 'none', 'none'))
        all_venues.append(area)
        print('.', end='')
    print('\nFinished loading data.')
    return restaurants, korean_restaurants, cafes, night_snacks, all_venues

Now we just need to load all of our data into the main memory

In [18]:
restaurants = {}
korean_restaurants = {}
cafes = {}
night_snacks = {}
all_venues = []

restaurants, korean_restaurants, cafes, night_snacks, all_venues = createTable(df['StationEng'], df['Latitude'], df['Longitude'])

loading data .............................................................................................................................................................................................................................................................................................
Finished loading data.


In [60]:
venues = []
for area in all_venues:
    for location in area:
        venues.append(location)
df_venues = pd.DataFrame(venues, columns=['Station', 'Retail Type', 'Name', 'Latitude', 'Longitude', 'Address'])
df_venues = df_venues[df_venues.Latitude != 'none']
df_venues.head()

Unnamed: 0,Station,Retail Type,Name,Latitude,Longitude,Address
0,Seoul Station,Café,THE HOUSE 1932 (더하우스 1932),37.5555,126.967,만리재로35길 5
1,Seoul Station,Korean Restaurant,금자네생등심,37.5574,126.972,중구 통일로 22-4
2,Seoul Station,Bakery,Paris Baguette (파리바게뜨),37.555,126.972,중구 통일로 1 (PB서울역사점)
3,Seoul Station,Ramen Restaurant,유즈라멘,37.5569,126.968,중구 만리재로 217
4,Seoul Station,French Restaurant,Seasons (시즌즈),37.5552,126.975,중구 소월로 50


## K-Means Clustering the Data

Now that we have our data and retail stores in Seoul, we can quickly run a K-Means algorithm to cluster our data. The following code will train our data with various *k* values in order to find the optimal *k*.

Since the purpose of this research is to see how clustering affects a store's ability to earn money, we start clustering the stores based on their retail type relative to the nearby station in which they are located.

First, define all the remaining DataFrames from the dictionaries we created in the code above.

In [36]:
df_korean = pd.DataFrame.from_dict(korean_restaurants).transpose().reset_index()
df_cafes = pd.DataFrame.from_dict(cafes).transpose().reset_index()
df_night = pd.DataFrame.from_dict(night_snacks).transpose().reset_index()

### One-Hot Encoding the DataFrames

In order to implement a K-Means algorithm on the DataFrames, we need to first apply one-hot encoding on the DataFrames. The following function will take in a DataFrame as input and output a DataFrame that is one-hot encoded.

In [62]:
def one_hot_encode(dataframe):
    # one hot encoding
    onehot = pd.get_dummies(dataframe[[1]], prefix="", prefix_sep="")
    
    # add defining features back to the one hot dataframe
    onehot[['Neighborhood', 'Latitude', 'Longitude']] = dataframe[[0,3,4]]
    
    # move the defining columns back to the first three columns of the dataframe
    fixed_columns = [onehot.columns[-3],onehot.columns[-2],onehot.columns[-1]] + list(onehot.columns[:-3])
    onehot = onehot[fixed_columns]
    
    # return the one hot encoded DataFrame
    return onehot

korean_grouped = one_hot_encode(df_korean).groupby('Neighborhood').mean().reset_index()
cafes_grouped = one_hot_encode(df_cafes).groupby('Neighborhood').mean().reset_index()
night_grouped = one_hot_encode(df_night).groupby('Neighborhood').mean().reset_index()

### K-Means Clustering

In [73]:
from sklearn.cluster import KMeans

def K_means_cluster(kclusters, dataframe):
    # Set number of clusters with the kclusters argument
    cluster = dataframe.drop('Neighborhood', 1)
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster)
    
    merged_df = pd.DataFrame()
    merged_df['Cluster Label'] = kmeans.labels_
    merged_df['Station'] = dataframe['Neighborhood']
    
    merged = df
    merged = merged.join(merged_df.set_index('Station'), on='StationEng').dropna()
    
    return merged
    
    
korean_merged = K_means_cluster(10, korean_grouped)
korean_merged.head()

Unnamed: 0,LineNumber,StationKor,StationEng,Latitude,Longitude,Cluster Label
0,1호선,서울역,Seoul Station,37.554679,126.970607,5.0
1,1호선,시청,City Hall,37.565344,126.977199,4.0
2,1호선,종각,Jonggak,37.570229,126.983152,1.0
4,1호선,종로5가,Jongno 5(o)ga,37.570976,127.001539,4.0
5,1호선,동대문,Dongdaemun,37.571668,127.010632,1.0


In [70]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[Seoul_lat, Seoul_lng], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for index, row in korean_merged.iterrows():
    label = 'Cluster: {}, {}, {}'.format(row[5], row[2], row[0])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row[3], row[4]],
        radius=10,
        popup=label,
        color=rainbow[int(row[5]-1)],
        fill=True,
        fill_color=rainbow[int(row[5]-1)],
        fill_opacity=0.5,
        parse_html=False
    ).add_to(map_clusters)

map_clusters