In [1]:
# @hidden_cell
#
foursquare_credentials = {
    'client_id' : 'ZVK1GWJKGA5LAXKY1KYUHPFUMV4ZTKTAQOJBV0LNRYID2PF5',
    'client_secret' : 'URXVFP0E22C4O33GPQ2K4MTLUGJ1TTDRIZ5R0YONBAIVPG1K'
}

# Project Description

Find a good location to open hotel can be challenging. This project help stakeholders to evaluate surrounding environment to find some promising places as candidate.

The city to be evaluated is Xi'an, a city where I live with my family. Xi'an is a tourism city of China famous for its long-standing history. A lot of visitors come Xi'an and choose museum as a starting point to konw the city. So in this project, preferred locations for a new hotel should be close to the museums, but should not have similar hotels nearby. Furthermore, a place close to restaurants serving local food and close to metro stations is also desired for the sake of convenience.

I will use data science techniques to help stakeholders find a few such locations.

In [2]:
import requests
import folium
import json
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

The data I need to collect include location and category of venues in Xi'an. I will use Foursquare API to obtain the related data, including hotels, museums, restaurants, metro stations.

The id of four categories above can be found from Foursquare Developer's page (https://developer.foursquare.com/docs/resources/categories). For restaurant, I didn't choose generic category 'Food'. Instead, I use Shaanxi Restaurant because the local food here is almost as famous as the history it holds. There are many "must have" dishes so chooing a place closing to several local food restaurants can be an advantage. The id for each category is listed in below.

- History Museum (_**4bf58dd8d48988d190941735**_)

- Shaanxi Restaurant (_**52af3b633cf9994f4e043c01**_)

- Hotel (_**4bf58dd8d48988d1fa931735**_)

- Metro Station (_**4bf58dd8d48988d1fd931735**_)

I define some functions used to obtain data from Foursquare. The "Search for Venues" api (https://developer.foursquare.com/docs/api/venues/search) is used to get list of venues. Instead of using default center/radius parameter, I use a bounding box search, where I separate main Xi'an city into 10X10 grids and use each grid as a bounding box to search for venues of each category.

In [3]:
client_id=foursquare_credentials['client_id']
client_secret=foursquare_credentials['client_secret']

In [4]:
# browser within bounding box
def bb_search(sw_lat,sw_lng,ne_lat,ne_lng,
              categories,
              client_id=client_id,client_secret=client_secret,
              intent='browse',
              version='20190101', limit=5):
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&sw={},{}&ne={},{}&intent={}&categoryId={}&v={}&limit={}'.format(
        client_id, client_secret,sw_lat,sw_lng,ne_lat,ne_lng,intent,categories,version,limit)
    try:
        results = requests.get(url).json()
    except:
        results = []
    return results

def get_latitude(location):
    return location['lat']
def get_longitude(location):
    return location['lng']
def get_category(categories):
    return categories[0]['shortName']

def get_venues(res):
    df = pd.DataFrame.from_dict(res['response']['venues'])
    if df.empty:
        return df
    else:
        df1 = df[['location','name','categories']]
        df1.loc[:,'latitude'] = df1.apply(lambda venue: get_latitude(venue['location']), axis=1)
        df1.loc[:,'longitude'] = df1.apply(lambda venue: get_longitude(venue['location']), axis=1)
        df1.loc[:,'category'] = df1.apply(lambda venue: get_category(venue['categories']), axis=1)
        return df1[['name','category','latitude','longitude']]

In the code below I initialize 4 Pandas DataFrame object to hold the four categories of venues. I did some math to calculate offset and size of the grid. Within each grid, I use bb_search() function defined above to retrieve venues inside the grid and append them into the corresponding DataFrame.

In [5]:
columns=['name','category','latitude','longitude']
museums = pd.DataFrame(columns=columns)
restaurants = pd.DataFrame(columns=columns)
hotels = pd.DataFrame(columns=columns)
metros = pd.DataFrame(columns=columns)

#lat&lng for xi'an
lat=34.2
lng=108.85
delta=0.02
for i in range(0,10):
    for j in range(0,10):
        sw_lat = lat + delta*i
        sw_lng = lng + delta*j
        ne_lat = lat + delta*(i+1)
        ne_lng = lng + delta*(j+1)
        #History Museum (4bf58dd8d48988d190941735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d190941735')
        df = get_venues(res)
        if not df.empty:
            museums=pd.concat([museums,df],ignore_index=True)
        #Shaanxi Restaurant (52af3b633cf9994f4e043c01)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='52af3b633cf9994f4e043c01',limit=50)
        df = get_venues(res)
        if not df.empty:
            restaurants=pd.concat([restaurants,df],ignore_index=True)
        #Hotel (4bf58dd8d48988d1fa931735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d1fa931735',limit=50)
        df = get_venues(res)
        if not df.empty:
            hotels=pd.concat([hotels,df],ignore_index=True)
        #Metro Station (4bf58dd8d48988d1fd931735)
        res = bb_search(sw_lat,sw_lng,ne_lat,ne_lng,categories='4bf58dd8d48988d1fd931735')
        df = get_venues(res)
        if not df.empty:
            metros=pd.concat([metros,df],ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Now let's take a look at the data. These data will be used to calculate location of interest.

In [6]:
museums.shape

(14, 4)

In [7]:
restaurants.shape

(147, 4)

In [8]:
hotels.shape

(413, 4)

In [9]:
metros.shape

(47, 4)

There are total 14 museums, 147 restaurants, 413 hotels and 47 metro stations returned from the query.
Let's overlay restaurants on the top of Xi'an map to have an overview of the distribution of Shaanxi food restaurants.

In [10]:
map = folium.Map(location=(34.251568, 108.940178), zoom_start=12)

for index, row in restaurants.iterrows():
    label = '{}'.format(row['name'])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [row['latitude'], row['longitude']],
        radius=3,
        popup=label,
        color='red',
        fill=True).add_to(map) 

display(map)

This concludes work of week 1, defining the business problem as well as the data necessary for the analysis. In the next week, I'll try to analyze these data and find proper locations for hotel.

I choose to cluster restaurants
and I choose DBSCAN over KMEANS because
KMEANS is a partitional clustering technique
DBSCAN is a density based clustering technique