In [1]:
# @hidden_cell
#
foursquare_credentials = {
    'client_id' : 'REMOVED_FOR_PUBLIC',
    'client_secret' : 'REMOVED_FOR_PUBLIC'
}

# Project Description

Find a good location to open hotel can be challenging. This project help stakeholders to evaluate surrounding environment to find some promising places as candidate.

The city to be evaluated is Xi'an, a city where I live with my family. Xi'an is a tourism city of China famous for its long-standing history. A lot of visitors come Xi'an and choose museum as a starting point to konw the city. So in this project, preferred locations for a new hotel should be close to the museums, but should not have similar hotels nearby. Furthermore, a place close to restaurants serving local food and close to metro stations is also desired for the sake of convenience.

I will use data science techniques to help stakeholders find a few such locations.

In [2]:
import requests
import folium
import json
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

The data I need to collect include location and category of venues in Xi'an. I will use Foursquare API to obtain the related data, including hotels, museums, restaurants, metro stations.

The id of four categories above can be found from Foursquare Developer's page (https://developer.foursquare.com/docs/resources/categories). For restaurant, I didn't choose generic category 'Food'. Instead, I use Shaanxi Restaurant because the local food here is almost as famous as the history it holds. There are many "must have" dishes so chooing a place closing to several local food restaurants can be an advantage. The id for each category is listed in below.

- History Museum (_**4bf58dd8d48988d190941735**_)

- Shaanxi Restaurant (_**52af3b633cf9994f4e043c01**_)

- Hotel (_**4bf58dd8d48988d1fa931735**_)

- Metro Station (_**4bf58dd8d48988d1fd931735**_)

I define some functions used to obtain data from Foursquare. The "Search for Venues" api (https://developer.foursquare.com/docs/api/venues/search) is used to get list of venues. Instead of using default center/radius parameter, I use a bounding box search, where I separate main Xi'an city into 10X10 grids and use each grid as a bounding box to search for venues of each category.

In [3]:
client_id=foursquare_credentials['client_id']
client_secret=foursquare_credentials['client_secret']

In [4]:
# browser within bounding box
def bb_search(sw_lat,sw_lng,ne_lat,ne_lng,
              categories,
              client_id=client_id,client_secret=client_secret,
              intent='browse',
              version='20190101', limit=5):
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&sw={},{}&ne={},{}&intent={}&categoryId={}&v={}&limit={}'.format(
        client_id, client_secret,sw_lat,sw_lng,ne_lat,ne_lng,intent,categories,version,limit)
    try:
        results = requests.get(url).json()
    except:
        results = []
    return results

def get_latitude(location):
    return location['lat']
def get_longitude(location):
    return location['lng']
def get_category(categories):
    return categories[0]['shortName']

def get_venues(res):
    df = pd.DataFrame.from_dict(res['response']['venues'])
    if df.empty:
        return df
    else:
        df1 = df[['location','name','categories']]
        df1.loc[:,'latitude'] = df1.apply(lambda venue: get_latitude(venue['location']), axis=1)
        df1.loc[:,'longitude'] = df1.apply(lambda venue: get_longitude(venue['location']), axis=1)
        df1.loc[:,'category'] = df1.apply(lambda venue: get_category(venue['categories']), axis=1)
        return df1[['name','category','latitude','longitude']]

In the code below I initialize 4 Pandas DataFrame object to hold the four categories of venues. I did some math to calculate offset and size of the grid. Within each grid, I use bb_search() function defined above to retrieve venues inside the grid and append them into the corresponding DataFrame.

In [5]:
columns=['name','category','latitude','longitude']
museums = pd.DataFrame(columns=columns)
restaurants = pd.DataFrame(columns=columns)
hotels = pd.DataFrame(columns=columns)
metros = pd.DataFrame(columns=columns)

#lat&lng for xi'an
lat=34.2
lng=108.85
delta=0.02
for i in range(0,10):
    for j in range(0,10):
        sw_lat = lat + delta*i
        sw_lng = lng + delta*j
        ne_lat = lat + delta*(i+1)
        ne_lng = lng + delta*(j+1)
        #History Museum (4bf58dd8d48988d190941735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d190941735')
        df = get_venues(res)
        if not df.empty:
            museums=pd.concat([museums,df],ignore_index=True)
        #Shaanxi Restaurant (52af3b633cf9994f4e043c01)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='52af3b633cf9994f4e043c01',limit=50)
        df = get_venues(res)
        if not df.empty:
            restaurants=pd.concat([restaurants,df],ignore_index=True)
        #Hotel (4bf58dd8d48988d1fa931735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d1fa931735',limit=50)
        df = get_venues(res)
        if not df.empty:
            hotels=pd.concat([hotels,df],ignore_index=True)
        #Metro Station (4bf58dd8d48988d1fd931735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d1fd931735')
        df = get_venues(res)
        if not df.empty:
            metros=pd.concat([metros,df],ignore_index=True)

Now let's take a look at the data. These data will be used to calculate location of interest.

In [6]:
museums.shape

(14, 4)

In [7]:
restaurants.shape

(147, 4)

In [8]:
hotels.shape

(413, 4)

In [9]:
metros.shape

(47, 4)

There are total 14 museums, 147 restaurants, 413 hotels and 47 metro stations returned from the query.
Let's overlay restaurants on the top of Xi'an map to have an overview of the distribution of Shaanxi food restaurants.

In [10]:
map = folium.Map(location=(34.251568, 108.940178), zoom_start=12)

for index, row in restaurants.iterrows():
    label = '{}'.format(row['name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['latitude'], row['longitude']],
        radius=3,
        popup=label,
        color='red',
        fill=True).add_to(map) 

display(map)

There are hundreds of restaurants and hotels so as a first step I choose to cluster them. Further investigations will be based on the clusters.
There are several clustering algorithms that I could choose from. For this project, I use DBSCAN over k-Means. k-Means is a partitional clustering technique that doesn't fit quite well for geospatial data. DBSCAN is a density based clustering technique that could help to show density of interested venues within regions.

In [11]:
restaurants_coords = restaurants.as_matrix(columns=['latitude','longitude'])
kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian
dc4r = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(restaurants_coords))
dc4r_labels = dc4r.labels_
dc4r_num_clusters = len(set(dc4r_labels))
restaurants_clusters = pd.Series([restaurants_coords[dc4r_labels == n] for n in range(dc4r_num_clusters)])
print('Number of clusters for restaurants: {}'.format(dc4r_num_clusters))

Number of clusters for restaurants: 9


Now let's do the same for hotels.

In [12]:
hotels_coords = hotels.as_matrix(columns=['latitude','longitude'])
dc4h = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(hotels_coords))
dc4h_labels = dc4h.labels_
dc4h_num_clusters = len(set(dc4h_labels))
hotels_clusters = pd.Series([hotels_coords[dc4h_labels == n] for n in range(dc4h_num_clusters)])
print('Number of clusters for hotels: {}'.format(dc4h_num_clusters))

Number of clusters for hotels: 11


I noticed that there are some duplicated records in museums as some of the museum were reported in both English name and Chinese name, but the geo location is pretty close. So as an easy step, I could also use DBSCAN clustering to merge and remove the duplicated records.

In [13]:
museums_coords = museums.as_matrix(columns=['latitude','longitude'])
dc4m = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(museums_coords))
dc4m_labels = dc4m.labels_
dc4m_num_clusters = len(set(dc4m_labels))
museums_clusters = pd.Series([museums_coords[dc4m_labels == n] for n in range(dc4m_num_clusters)])
print('Number of clustered museums: {}'.format(dc4m_num_clusters))

Number of clustered museums: 10


For each cluster, I'll need to find the centroid, so let's define a function to do so:

In [14]:
def get_cluster_center(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    center = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(center)

I'm using ***great_circle*** from ***geopy.distance*** package and ***MultiPoint*** from ***shapely.geometry*** package for the calculation. There are some other ways of calcuating the distance between two points denoted by latitude/longitude which is claimed to be more precise (http://www.movable-type.co.uk/scripts/latlong.html). For this study the requirements on the precision is not very strict so I choose to use great_circle.

Now let's calculate center of clusters for restaurants.

In [15]:
restaurants_centers = restaurants_clusters.map(get_cluster_center)
latitude, longitude = zip(*restaurants_centers)
restaurants_centers_df = pd.DataFrame({'latitude':latitude,'longitude':longitude})
restaurants_centers_df.head()

Unnamed: 0,latitude,longitude
0,34.20513,108.90791
1,34.245114,108.942401
2,34.24544,109.002978
3,34.257818,109.03029
4,34.269761,108.886546


Do the same for hotels.

In [16]:
hotels_centers = hotels_clusters.map(get_cluster_center)
latitude, longitude = zip(*hotels_centers)
hotels_centers_df = pd.DataFrame({'latitude':latitude,'longitude':longitude})
hotels_centers_df.head()

Unnamed: 0,latitude,longitude
0,34.256277,108.941265
1,34.206074,108.85174
2,34.200749,108.933558
3,34.286003,108.911545
4,34.311146,109.016194


And for museums

In [17]:
museums_centers = museums_clusters.map(get_cluster_center)
latitude, longitude = zip(*museums_centers)
museums_centers_df = pd.DataFrame({'latitude':latitude,'longitude':longitude})
museums_centers_df.head()

Unnamed: 0,latitude,longitude
0,34.20018,108.885365
1,34.216285,108.958955
2,34.240831,108.937348
3,34.246964,108.904415
4,34.253446,108.927099


Now let's use these clusters information to find proper locations for opening new hotels. To recap, the location will be close to the museum, close to restaurants with local food, close to metro stations, but keeping distance to existing hotels.

The idea for finding such locations is that for each museum, find the closest cluster of restaurants and the closest metro station, these three point will form a triangle on the map. The first step is to find a point in the triangle satisfying that the sum of the distances from that point to the vertices is the minimum. Such point is known as Fermat Point (https://en.wikipedia.org/wiki/Fermat_point).

There exist many implementations for finding the Fermat point from a given triangle, for example, number 13 (X(13) in Encyclopedia of Triangle Centers: http://faculty.evansville.edu/ck6/encyclopedia/ETC.html). In this study, I use a published solution created by **abybaddi009** on *stackexchange* (https://codegolf.stackexchange.com/questions/79691/calculate-the-fermat-point-of-a-triangle).

In [18]:
from math import *
d=lambda x,y:((x[0]-y[0])**2+(x[1]-y[1])**2)**0.5
s=lambda A,B,C:(d(B,C), d(C,A), d(A,B))
j=lambda a,b,c:acos((b*b+c*c-a*a)/(2*b*c))
t=lambda a,b,c:1/cos(j(a,b,c)-pi/6)
b=lambda A,B,C,p,q,r:[(p*A[i]+q*B[i]+r*C[i])/(p+q+r) for i in [0,1]] 
f=lambda A,B,C:A if j(*s(A,B,C)) >= 2*pi/3 else B if j(*s(B,C,A)) >= 2*pi/3 else C if j(*s(C,A,B)) >= 2*pi/3 else b(A,B,C,d(B,C)*t(*s(A,B,C)),d(C,A)*t(*s(B,C,A)),d(A,B)*t(*s(C,A,B)))

And let's define the function to find the closest cluster centroid of a given museum:

In [19]:
def get_nearest_location(source, cluster):
    min_latitude = cluster['latitude'].iloc[0]
    min_longitude = cluster['longitude'].iloc[0]
    min_distance = great_circle(source, (min_latitude,min_longitude)).m
    for index, row in cluster.iterrows():
        distance = great_circle(source, (row['latitude'],row['longitude'])).m
        if (distance < min_distance):
            min_latitude = row['latitude']
            min_longitude = row['longitude']
            min_distance = distance
    return [min_latitude,min_longitude]

With above functions, let's loop through all museums to find the optimal locations:

In [20]:
columns=['latitude','longitude','distance']
locations = pd.DataFrame(columns=columns)

for index, row in museums_centers_df.iterrows():
    x = [row['latitude'], row['longitude']]
    x1 = get_nearest_location(x, restaurants_centers_df)
    x2 = get_nearest_location(x, metros)
    x3 = get_nearest_location(x, hotels_centers_df)
    if (x[0] == x1[0] and x[1] == x1[1]):
        y = [((x[0] + x1[0])/2.0, (x[1] + x1[1])/2.0)]
        distance = great_circle(y,x3).m
        z = [(y[0],y[1],distance)]
        locations = pd.concat([locations,pd.DataFrame(z,columns)],ignore_index=True)
    elif (x[0] == x2[0] and x[1] == x2[1]):
        y = [((x[0] + x2[0])/2.0, (x[1] + x2[1])/2.0)]
        distance = great_circle(y,x3).m
        z = [(y[0],y[1],distance)]
        locations = pd.concat([locations,pd.DataFrame(z,columns)],ignore_index=True)
    else:
        y=f(x,x1,x2)
        distance = great_circle(y,x3).m
        z = [(y[0],y[1],distance)]
        locations = pd.concat([locations,pd.DataFrame(z,columns=columns)],ignore_index=True)
locations=locations.sort_values(['distance'],ascending=False).head(5)
print(locations)

    latitude   longitude     distance
7  34.274469  109.046455  4935.695032
0  34.207934  108.904283  4836.301236
8  34.295212  108.946036  4351.534043
1  34.224524  108.959999  3591.718837
3  34.246964  108.904415  3541.739580


As explained above, the desired location for new hotel need to keep distance to existing hotels. So I sort the distance between the calculated locations and the centroid of the closest hotel clusters in descending order and pick the top 5.

Now let's show these locations on the map of Xi'an. To carry more information to the map I use the name of closest museum as the label.

In [21]:
def get_closest_museum_name(source, cluster):
    min_latitude = cluster['latitude'].iloc[0]
    min_longitude = cluster['longitude'].iloc[0]
    name=cluster['name'].iloc[0]
    min_distance = great_circle(source, (min_latitude,min_longitude)).m
    for index, row in cluster.iterrows():
        distance = great_circle(source, (row['latitude'],row['longitude'])).m
        if (distance < min_distance):
            min_latitude = row['latitude']
            min_longitude = row['longitude']
            name=row['name']
            min_distance = distance
    return name

In [22]:
map = folium.Map(location=(34.251568, 108.940178), zoom_start=12)

for index, row in locations.iterrows():
    label = get_closest_museum_name([row['latitude'], row['longitude']],museums)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=label,
        icon=folium.Icon(color='green',icon='info-sign')
    ).add_to(map)

display(map)

With the most promising 5 recommendations shown above on the map, let's now conclude this project.