# Title draft

## Business Case:

Companies operating in the tourism sector are interested in suggesting itineraries and destinations tailored to their audience’s interests.

When visiting a city, for example, some travellers may be more interested in visiting museums and art places. Some others may be more interested in a shopping kind of tourism. Some others may be interested in gastronomy, and so on. It is also realistic to think that tourists, when visiting a city, would want to do a mix of things, for example may want to visit art places in the morning, do some shopping in the afternoon, and go out in the evening to some cool night spots area.

Tourism agencies may therefore want to make use of machine learning, and specifically clustering, to provide customers with relevant suggestions as to what areas to go to.


## Data

As a test case, I will try to leverage Foursquare’s data to cluster neighbourhoods in London, United Kingdom. The idea here is to be able to classify each neighbourhood at an overall level in terms of the main attractions it has to offer, and possibly categorise each neighbourhood in main classes, e.g. predominantly ‘shopping area’, or ‘nightlife area’, and so on. (Obviously, some areas have more to offer than only one type of attraction, e.g. there are usually many restaurants around art places. Nevertheless, the clustering should be able to pick on these elements and return clusters which are ‘mixed’. Tourists only interested in food, for example, may still be interested in visiting a ‘mixed’ area which scores very high on ‘food’.)
The analysis will be run by employing the use of the k-means algorithm.

In order to create the cluster, I will build the underlying dataset so that:

•	Each venue category is recoded to its own macro category as per Foursquare category tree (see https://developer.foursquare.com/docs/resources/categories). For example, an Italian restaurant and a Chinese restaurant will be both recoded to their own Macro category ‘Food’. This will allow for the k-means algorithm to work with 9 aggregated macro variable (Food, Travel & Transport, Shop & Service, Arts & Entertainment, Nightlife Spot, Professional & Other Places, Outdoors & Recreation, College & University, Residence) rather than a hundreds of variables (i.e. the venue specific category)

•	Each venue will have its own ‘weight’ in terms of ‘likes’. For each venue, I will retrieve the count of ‘likes’ (i.e. how many people liked the particular venue). This will allow distinguishing between a major and a minor venue (e.g. a relatively unknown music venue ‘The Blue Studios’ has 6 likes, whereas an important music venue ‘The O2 Arena’ has 3154 likes) rather treating all venues as equal.

As for the geographical units to take into consideration, I have opted for the rail stations (overground and underground) in London zones 1, 2 and 3 (the central zones) as they are uniformly spread across the city, whereas the actual administrative units (London boroughs) vary quite dramatically in terms of size amongst themselves.


## Methododlogy

I will do the following:

1  retrieve the stations coordinates
2  build a look up table
    

1 retrieve the stations coordinates:

In [4]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.doogal.co.uk/london_stations.php")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = df[0]
print(df.shape)
df.head()

(641, 7)


Unnamed: 0,Station,Zone,Postcode,Latitude,Longitude,Easting,Northing
0,Abbey Road,3.0,E15 3NB,51.531952,0.003738,539081,183352
1,Abbey Wood,4.0,SE2 9RH,51.490784,0.120286,547297,179002
2,Acton Central,2.0,W3 6BH,51.508758,-0.263416,520613,180299
3,Acton Main Line,3.0,W3 9EH,51.516887,-0.267676,520296,181196
4,Acton Town,3.0,W3 8HN,51.503071,-0.280288,519457,179639


The query returned 641 stations. For ease of computation, let's now filter for the ones in zones 1, 2 and 3 only. These are the main areas where tourists are likely to visit. 

In [8]:
stations = df.loc[df['Zone'].isin(['1','2', '12', '3', '23', '34'])]
print(stations.shape)
stations.head()

(340, 7)


Unnamed: 0,Station,Zone,Postcode,Latitude,Longitude,Easting,Northing
0,Abbey Road,3.0,E15 3NB,51.531952,0.003738,539081,183352
2,Acton Central,2.0,W3 6BH,51.508758,-0.263416,520613,180299
3,Acton Main Line,3.0,W3 9EH,51.516887,-0.267676,520296,181196
4,Acton Town,3.0,W3 8HN,51.503071,-0.280288,519457,179639
8,Aldgate,1.0,EC3N 1AH,51.514342,-0.075613,533629,181246


We ended up with 340 stations. Chart them:

In [13]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library


# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map of London Stations using latitude and longitude values
map_stations = folium.Map(location=[51.49787, -0.04967], zoom_start=11)

# add markers to map
for lat, lng, label in zip(stations['Latitude'], stations['Longitude'], stations['Station']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_stations)  
    
map_stations

2. Forsquare data working

2.1 Create the look up table:

In [45]:
# The code was removed by Watson Studio for sharing.

In [37]:
# 1. Create Foursquare Parent Category Table
import requests
import json
url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION)
print(url)
resp = requests.get(url).json()
data = resp['response']

myList = []
def print_dict(v, prefix=''):
    
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            print_dict(v2, p2)
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            print_dict(v2, p2)
    else:
        if 'id' in prefix:
            curr_category_id = v
            root_category_index = int(prefix[15]) # position 15 of prefix holds parent category from 0 to 9
            root_category_id = data['categories'][root_category_index]['id']
            root_category_name = data['categories'][root_category_index]['name']
            myList.extend([root_category_id, curr_category_id, root_category_name])
            #print(root_category_index, root_category_id, curr_category_id, root_category_name)
        #if 'name' in prefix:
        #    curr_category_name = v
        #    myList.extend([curr_category_name])
        #print('{} = {}'.format(prefix, repr(v)))

print_dict(data)
#print(myList)

# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]

# Create a list that from the results of the function chunks:
#list(chunks(myList, 4))        
myCatTable = pd.DataFrame(chunks(myList, 3))

# test a cetegory id
#myCat = '52e81612bcbc57f1066b7a0d'
#print(myCatTable.loc[myCatTable[1]==myCat, 2].item())
myCatTable.head()
#print(myList)

https://api.foursquare.com/v2/venues/categories?client_id=X21YZNZBTI01KBV041CF0FIX4DAZLYBCB51BYAWXXKI0DLRO&client_secret=MOVZ5PJFDKOCHC2EWZZLC5DNYWKUWACIT3ZFPIX14YTIXNQM&v=20180604


Unnamed: 0,0,1,2
0,4d4b7104d754a06370d81259,4d4b7104d754a06370d81259,Arts & Entertainment
1,4d4b7104d754a06370d81259,56aa371be4b08b9a8d5734db,Arts & Entertainment
2,4d4b7104d754a06370d81259,4fceea171983d5d06c3e9823,Arts & Entertainment
3,4d4b7104d754a06370d81259,4bf58dd8d48988d1e1931735,Arts & Entertainment
4,4d4b7104d754a06370d81259,4bf58dd8d48988d1e2931735,Arts & Entertainment


In [42]:
# 1. Get Foursquare venues
import requests
import json
import math


# 3. Get the venue info from Foursquare
radius = 1000
LIMIT = 300
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    i = 1
    
    venues_list=[]
    #file = open('myStations.csv','a')
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        #print(i, name)
        i = i + 1
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],
            math.sqrt((v['venue']['location']['lat']-lat)**2 + (v['venue']['location']['lng']-lng)**2),
            v['venue']['categories'][0]['name'],
            v['venue']['categories'][0]['id'],
            myCatTable.loc[myCatTable[1] == v['venue']['categories'][0]['id'], 2].item()) for v in results])
        
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue Name',
                             'Venue ID',
                             'Venue Latitude', 
                             'Venue Longitude',
                             'Venue Distance from Neigh',
                             'Venue Category',
                             'Venue Category ID', 
                             'Macro Category Name']
    
    #file.close()
    return(nearby_venues)

london_venues = getNearbyVenues(names=stations['Station'],
                                latitudes=stations['Latitude'],
                                longitudes=stations['Longitude'])

london_venues.to_csv('stations.csv', mode='a', header=True, index=False)

print(london_venues.shape)
london_venues.head()


(14134, 11)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue ID,Venue Latitude,Venue Longitude,Venue Distance from Neigh,Venue Category,Venue Category ID,Macro Category Name
0,Abbey Road,51.531952,0.003738,The Greenway,4e0ef23022711665f619a691,51.53035,0.001172,0.003025,Trail,4bf58dd8d48988d159941735,Outdoors & Recreation
1,Abbey Road,51.531952,0.003738,Rial Lifestyle Café,4bafbc5bf964a5205f1c3ce3,51.527761,0.005202,0.004439,Café,4bf58dd8d48988d16d941735,Food
2,Abbey Road,51.531952,0.003738,Stratford Depot Staff Halt,4dcc4270d164ef21c4b7db03,51.533036,0.001917,0.002119,Platform,4f4531504b9074f6e4fb0102,Travel & Transport
3,Abbey Road,51.531952,0.003738,Abbey Mills Pumping Station,4bf524716a31d13a337c962e,51.53078,-0.001491,0.005359,Historic Site,4deefb944765f83613cdba6e,Arts & Entertainment
4,Abbey Road,51.531952,0.003738,Platform 1,4de267051fc7ca155a6e0c3b,51.528684,0.00584,0.003885,Platform,4f4531504b9074f6e4fb0102,Travel & Transport


The London venues dataset contains in excess of 14000 venues. Note that each venue category has been recoded to its macro category (see last column). 

Now, I check if there are duplicates - the same venue may have been picked up in different queries. Check for duplicates, e.g. venue '5c66c388947c05003948a783'

In [39]:
london_venues.loc[london_venues['Venue ID'] == '5c66c388947c05003948a783']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue ID,Venue Latitude,Venue Longitude,Venue Distance from Neigh,Venue Category,Venue Category ID,Macro Category Name
150,Aldgate,51.514342,-0.075613,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.005649,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation
4554,Fenchurch Street,51.511567,-0.07854,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.002321,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation
8678,Monument,51.51063,-0.086174,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.005573,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation
12369,Tower Gateway,51.510393,-0.074395,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.006628,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation
12433,Tower Hill,51.510394,-0.076687,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.004452,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation


Yes there are duplicates (the same venue has been picked by different queries). Deduplicate the dataframe only keeping one venue - the one with the shortest distance from its station:

In [43]:
# sort the df:
london_venues.sort_values(by =['Venue ID' , 'Venue Distance from Neigh'], inplace = True)

# drop duplicates but keep the first one:
london_venues.drop_duplicates(subset = 'Venue ID', keep = 'first', inplace = True)

# create a csv file - deduplicated venues
london_venues.to_csv('london_venues.csv', mode='a', header=True, index=False)

# see the results:
london_venues.shape

(10134, 11)

The df now contains 10000 rows. Check there are no duplicates:

In [44]:
london_venues.loc[london_venues['Venue ID'] == '5c66c388947c05003948a783']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Name,Venue ID,Venue Latitude,Venue Longitude,Venue Distance from Neigh,Venue Category,Venue Category ID,Macro Category Name
4554,Fenchurch Street,51.511567,-0.07854,The Garden at 120,5c66c388947c05003948a783,51.512101,-0.080799,0.002321,Garden,4bf58dd8d48988d15a941735,Outdoors & Recreation


2.2 second level - collect the like inforation