# Travel Recommendation 

This notebook presents a travel recommendation system POC where a user selects their prefferred areas/venues/locations in their hometown or recent destination and the system will suggest similar places that the user may like for their next destination. 

In the case of this POC, we will use **London** as recently visited destination and **San Francisco** as next destination and we will determine similar places of interest by locations (e.g. Camden, Dartford, etc).

The notebook is structured in the following 3 sections:
1. **Data collection, processing and analysis for the recently visited destination, i.e. London** 
        This section presents the processing and analysis functions used to treat the data and provide recommendations 
2. **Data collection, processing and analysis for the next, i.e. San Francisco** 
        This section follows the same process as the previous section to illustrate similar locations within San Francisco
3. **Travel recommendation** 
        This section allows you to provide the name of a place of interest either in the previous or future destination and illustrates similar locations in the future destination. 

## Imports

In [1]:
# import libraries
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import pandas as pd
import re
import requests
import numpy as np
from geopy.geocoders import Nominatim
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import io
from credentials import FoursquareCredentials

## 1. Data collection, processing and analysis for the recently visited destination

### 1.1. Data collection via web scraping

In [2]:
def getTableDataFromURL(url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'):
    
    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE

    # Ask for url, open it and parse html
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')

    # find table row tags
    table = soup.find('table', {"class": "wikitable sortable"})
    table_rows = table.find_all('tr')
    table_headers = [str(header.text).replace(u'\xa0', u' ').strip('\n') for header in table.find_all('th')]
    
    # extract rows to dataframe
    data = []
    for tr in table_rows:
        td = tr.find_all('td')
        row = [tr.text.strip() for tr in td if tr.text.strip()]
        if row:
            data.append(row)
    
    df = pd.DataFrame(data, columns=table_headers)
    
    return df

In [3]:
df = getTableDataFromURL(url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London')
df.head(10)

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728
5,Aldborough Hatch,Redbridge[9],ILFORD,IG2,20,TQ455895
6,Aldgate,City[10],LONDON,EC3,20,TQ334813
7,Aldwych,Westminster[10],LONDON,WC2,20,TQ307810
8,Alperton,Brent[11],WEMBLEY,HA0,20,TQ185835
9,Anerley,Bromley[11],LONDON,SE20,20,TQ345695


Next, let's remove the footnote symbols contained within brackets in the "London borough" column

In [4]:
df["London borough"] = df["London borough"].apply(lambda x: re.sub(r'\[[^\[]*\]', '', x))
df.head(10)

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon,CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon,CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728
5,Aldborough Hatch,Redbridge,ILFORD,IG2,20,TQ455895
6,Aldgate,City,LONDON,EC3,20,TQ334813
7,Aldwych,Westminster,LONDON,WC2,20,TQ307810
8,Alperton,Brent,WEMBLEY,HA0,20,TQ185835
9,Anerley,Bromley,LONDON,SE20,20,TQ345695


Let's now further clean the data by removing unneccessary columns and filtering out the rows whose "Post town" is not London.

In [5]:
# Keep only the datapoints where the post town is London and drop unneccessary columns
try:
    df = df[df["Post town"] == "LONDON"].reset_index()
    df.drop(columns=["Post town", "Dial code", "OS grid ref"], inplace = True)
except:
    print("The dataframe does not contain one or more of the following columns: Post town, Dial code, OS grid ref")
    raise

In [6]:
df.head()

Unnamed: 0,index,Location,London borough,Postcode district
0,0,Abbey Wood,"Bexley, Greenwich",SE2
1,1,Acton,"Ealing, Hammersmith and Fulham","W3, W4"
2,6,Aldgate,City,EC3
3,7,Aldwych,Westminster,WC2
4,9,Anerley,Bromley,SE20


In [7]:
print("The dataframe has a length of " + str(len(df)) + " but there are " + str(len(df["Postcode district"].unique())) + 
      " unique post code values")

The dataframe has a length of 299 but there are 151 unique post code values


To solve this issue, we will combine all locations with the same post code into one row. 

In [8]:
df = df.groupby("Postcode district").agg({"Location": ", ".join, "London borough": "first"}).reset_index()
print(df.shape)
df.head()

(151, 3)


Unnamed: 0,Postcode district,Location,London borough
0,DA5,Dartford,Dartford
1,E1,"Mile End, Ratcliff, Shadwell, Spitalfields, St...",Tower Hamlets
2,E10,Lea Bridge,Hackney
3,"E10, E15",Leyton,Waltham Forest
4,E11,"Cann Hall, Leytonstone, Snaresbrook, Wanstead",Waltham Forest


### 1.2. Mapping the postcodes to their respective coordinates

To determine the coordinates of the postcodes, we will use the *Outcode Area Postcodes* dataset taken from the following [link](https://www.freemaptools.com/download-uk-postcode-lat-lng.htm)

In [9]:
# import the csv file from the given link into a dataframe
url="https://www.freemaptools.com/download/outcode-postcodes/postcode-outcodes.csv"
url_content = requests.get(url).content
uk_postcodes_data =pd.read_csv(io.StringIO(url_content.decode('utf-8')), index_col = False)

In [10]:
# drop any unneccesary columns and set the postcode as the index of the dataframe
uk_postcodes_data.drop(columns = "id", inplace=True)
uk_postcodes_data = uk_postcodes_data.set_index('postcode')
print("The dataframe has the following shape: " +str(uk_postcodes_data.shape) + " and the columns are of the following types.\n")
print(uk_postcodes_data.dtypes)
uk_postcodes_data.head()

The dataframe has the following shape: (3003, 2) and the columns are of the following types.

latitude     float64
longitude    float64
dtype: object


Unnamed: 0_level_0,latitude,longitude
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
AB10,57.13514,-2.11731
AB11,57.13875,-2.09089
AB12,57.101,-2.1106
AB13,57.10801,-2.23776
AB14,57.10076,-2.27073


One thing to notice is that some postcodes return a value of 0 in terms of longitude and latitude which is possibly due to missing data (see below). This is impossible and will be treated later on.

In [11]:
nb_invalid = len(uk_postcodes_data[(uk_postcodes_data["latitude"] == 0.0) & (uk_postcodes_data["longitude"] == 0.0)])

print(str(nb_invalid) + " postcodes lead to a 0 0 coordinate, which is impossible as it's in the Atlantic Ocean.")
uk_postcodes_data[(uk_postcodes_data["latitude"] == 0.0) & (uk_postcodes_data["longitude"] == 0.0)].head()

28 postcodes lead to a 0 0 coordinate, which is impossible as it's in the Atlantic Ocean.


Unnamed: 0_level_0,latitude,longitude
postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
WF90,0.0,0.0
N81,0.0,0.0
SW95,0.0,0.0
RH77,0.0,0.0
WD99,0.0,0.0


Next we will determine the coordinates for the postcodes through the following function.

In [12]:
def mapPostcodeToCoordinates(data, postcode, outputColName = 'latitude'):
    
    '''
    This function returns either the latitude or the longitude of a 
    provided postcode. 
    
    In the case where more than one postcode is found, this function 
    will return the average value
    '''
    
    coord = 0
    codes = postcode.split(', ')
    
    for code in codes:
        try:
            coord += data[outputColName].loc[code]
        except:
            coord += 0.0
        
    return coord/len(codes)

In [13]:
# obtain the latitudes and longitudes corresponding to the postcodes
df["latitude"] = df["Postcode district"].apply(lambda x: mapPostcodeToCoordinates(uk_postcodes_data, x,
                                                                                  outputColName = 'latitude'))
df["longitude"] = df["Postcode district"].apply(lambda x: mapPostcodeToCoordinates(uk_postcodes_data, x,
                                                                                  outputColName = 'longitude'))

In [14]:
df.head()

Unnamed: 0,Postcode district,Location,London borough,latitude,longitude
0,DA5,Dartford,Dartford,51.44033,0.14698
1,E1,"Mile End, Ratcliff, Shadwell, Spitalfields, St...",Tower Hamlets,51.51766,-0.05841
2,E10,Lea Bridge,Hackney,51.56814,-0.01153
3,"E10, E15",Leyton,Waltham Forest,51.553625,-0.00423
4,E11,"Cann Hall, Leytonstone, Snaresbrook, Wanstead",Waltham Forest,51.56769,0.01443


After this stage, it would seem that we are lucky in the sense that no points lead us to the middle of the Atlantic Ocean! 

However, to make this code more robust, we will write a line of code to treat this in case it arises later on. In this case, we will simply drop such rows. 

In [15]:
len(df[(df["latitude"] == 0.0) & (df["longitude"] == 0.0)])

0

In [16]:
# remove any eventual rows that would lead us to the middle of the Atlantic Ocean
df = df[(df.latitude != 0.0) & (df.longitude != 0.0)]

### 1.3. Map of London with its various locations

Let's now display a map of London with the locations contained within our dataframe.

In [17]:
def createMapWithLocations(df, address = 'London, UK', labels = 'Location'):
    
    geolocator = Nominatim(user_agent="ny_explorer", timeout=None)
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(address, latitude, longitude))
    
    # create map using latitude and longitude values
    map_address = folium.Map(location=[latitude, longitude], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df['latitude'], df['longitude'], df[labels]):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_address)  
        
    return map_address, latitude, longitude

In [18]:
map_london, latitude, longitude = createMapWithLocations(df, address = 'London, UK')
map_london

The geograpical coordinates of London, UK are 51.5073219, -0.1276474.


### 1.4. Obtaining the venues within a given radius of a location and determine the most common categories of each location

To do this, we will use the Foursquare API. You will need to have an account and use your own credentials for the next step. 


In the case of this project, my credentials are contained within a class in a credentials.py script which is imported above in the following line: 

    from credentials import FoursquareCredentials

In [19]:
creds = FoursquareCredentials()

In [20]:
def getNearbyVenuesWithAPI(credentials, names, latitudes, longitudes, cityName, radius, colName):
    
    '''
    Function to obtain the venues within a given radius of a given location using the 
    FourSquare API
    '''
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL with credentials
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            credentials.CLIENT_ID, 
            credentials.CLIENT_SECRET, 
            credentials.VERSION, 
            lat, 
            lng, 
            radius, 
            credentials.LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [colName, 
                  colName + ' Latitude', 
                  colName +' Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    # save the data to a csv file
    nearby_venues.to_csv(cityName + '_' + str(radius) + '.csv', index = False)
    
    return(nearby_venues)


def getNearbyVenues(credentials, names, latitudes, longitudes, cityName, colName = "Location", radius=250):
    
    '''
    Function to load the nearby venues from a saved csv file if it exists
    (this is to avoid exceeding the quota from the FourSquare API
    if the event that the request has already been done previously), otherwise use the API
    '''
    
    try:
        nearby_venues = pd.read_csv(cityName + '_' + str(radius) + '.csv')
    except:
        nearby_venues = getNearbyVenuesWithAPI(credentials, names, latitudes, longitudes, cityName, radius, colName)
        
    return nearby_venues

Obtain up to LIMIT venues within a given radius of each given location. 

In [21]:
london_venues = getNearbyVenues(creds, names=df['Location'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude'], 
                                cityName = 'London')

In [22]:
def rateVenueVenueCategories(city_venues, colName = 'Location'):
    
    '''
    This function rates the importance of each venue category found in proximity to
    each location/neighborhood.
    '''
    
    # one hot encoding for the different venue categories
    city_onehot = pd.get_dummies(city_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    city_onehot[colName] = city_venues[colName] 

    # move location column to the first column
    col_order = [colName] + [col for col in city_onehot.columns if col != colName]
    city_onehot = city_onehot[col_order]
    
    city_grouped = city_onehot.groupby(colName).mean().reset_index()    
    
    return city_grouped

Determine the importance of each venue category for each given location.

In [23]:
london_grouped = rateVenueVenueCategories(london_venues, colName = 'Location')
london_grouped.head()

Unnamed: 0,Location,Accessories Store,African Restaurant,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Vietnamese Restaurant,Warehouse Store,Watch Shop,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio
0,"Abbey Wood, Crossness, West Heath",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Aldgate, Tower Hill",0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0
3,"Aldwych, Charing Cross, Covent Garden, St Giles",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
4,"Anerley, Penge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# function to determine the num_top_venues values in a row
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [25]:
def determineMostCommonVenues(data_grouped, colName = 'Location', num_top_venues = 10):

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = [colName]
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted[colName] = data_grouped[colName]

    for ind in np.arange(data_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(data_grouped.iloc[ind, :], num_top_venues)

    return neighborhoods_venues_sorted

Determine the N most common venues per location

In [26]:
london_neighborhoods_venues_sorted = determineMostCommonVenues(london_grouped, colName = 'Location')
london_neighborhoods_venues_sorted.head()

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Abbey Wood, Crossness, West Heath",Cosmetics Shop,Café,Platform,Yoga Studio,Exhibit,Electronics Store,English Restaurant,Ethiopian Restaurant,Event Space,Farmers Market
1,Acton,Gastropub,Café,Thai Restaurant,Yoga Studio,Eastern European Restaurant,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
2,"Aldgate, Tower Hill",Restaurant,Pub,Coffee Shop,Italian Restaurant,Speakeasy,Seafood Restaurant,Street Food Gathering,French Restaurant,Bookstore,Garden
3,"Aldwych, Charing Cross, Covent Garden, St Giles",Clothing Store,Theater,Cosmetics Shop,Deli / Bodega,Bakery,Coffee Shop,Dessert Shop,Cheese Shop,Chocolate Shop,Boutique
4,"Anerley, Penge",Pub,Hotel,Construction & Landscaping,Yoga Studio,Exhibit,Eastern European Restaurant,Electronics Store,English Restaurant,Ethiopian Restaurant,Event Space


### 1.5. Cluster the neighborhoods

In [27]:
def clusterDataAndAddLabel(data, data_grouped, neighborhoods_venues_sorted, colName = 'Location', nb_clusters = 5):
    
    '''
    Cluster the locations as the function of the rating given per venue type
    '''
    
    # Drop the 'Neighborhood' column
    data_clustering = data_grouped.drop(colName, 1).copy()

    # run k-means clustering
    kmeans = KMeans(n_clusters=nb_clusters, random_state=42).fit(data_clustering)
    
    # add clustering labels
    neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

    data_merged = data.copy()

    # merge data_grouped with data to add latitude/longitude for each neighborhood
    data_merged = data_merged.join(neighborhoods_venues_sorted.set_index(colName), on=colName)
    
    return data_merged

Cluster the locations as the function of the rating given per venue type

In [28]:
k_clusters = 10
london_merged = clusterDataAndAddLabel(df, london_grouped, london_neighborhoods_venues_sorted, nb_clusters = k_clusters)
london_merged.head()

Unnamed: 0,Postcode district,Location,London borough,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,DA5,Dartford,Dartford,51.44033,0.14698,1.0,Pub,Cricket Ground,Café,Trail,Train Station,Asian Restaurant,Italian Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant
1,E1,"Mile End, Ratcliff, Shadwell, Spitalfields, St...",Tower Hamlets,51.51766,-0.05841,1.0,Pub,Asian Restaurant,Coffee Shop,Hotel,Convenience Store,Bakery,Recreation Center,Market,Eastern European Restaurant,Electronics Store
2,E10,Lea Bridge,Hackney,51.56814,-0.01153,1.0,Park,Convenience Store,Cricket Ground,Flower Shop,Flea Market,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Dumpling Restaurant,Falafel Restaurant
3,"E10, E15",Leyton,Waltham Forest,51.553625,-0.00423,1.0,Chinese Restaurant,Pub,Fried Chicken Joint,Restaurant,Park,Fast Food Restaurant,Farmers Market,Fish & Chips Shop,Falafel Restaurant,Exhibit
4,E11,"Cann Hall, Leytonstone, Snaresbrook, Wanstead",Waltham Forest,51.56769,0.01443,1.0,Pub,Thai Restaurant,Café,Fast Food Restaurant,Vietnamese Restaurant,Event Space,Eastern European Restaurant,Electronics Store,English Restaurant,Ethiopian Restaurant


In [29]:
def createMapWithClusteredLocations(data_merged, k_clusters, latitude, longitude, colName = 'Location'):

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(k_clusters)
    ys = [i + x + (i*x)**2 for i in range(k_clusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(data_merged['latitude'], data_merged['longitude'], data_merged[colName], 
                                      london_merged['Cluster Labels']):

        if not np.isnan(cluster):
            label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
            folium.CircleMarker(
                [lat, lon],
                radius=5,
                popup=label,
                color=rainbow[int(cluster)-1],
                fill=True,
                fill_color=rainbow[int(cluster)-1],
                fill_opacity=0.7).add_to(map_clusters)
        
    return map_clusters

Generate a map indicating the cluster label of each location, feel free to zoom in or out and click on a label to reveal the name of the location(s) or neighborhood(s)

In [30]:
map_clusters = createMapWithClusteredLocations(london_merged, k_clusters, latitude, longitude)
map_clusters

## 2. Data collection, processing and analysis for the next destination

### 2.1. Data Collection for the Next Destination

In our case, the next destination will be San Francisco. 

**Note:** San Francisco does not seem to have the same notion of borough or neighborhood as is the case of London and New York. As such, we will just use the cluster and analyse the neighborhood in the following lines of code.

In [31]:
def getTableOfContentsFromURL(url = "https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco"):
    
    # Ignore SSL certificate errors
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE

    # Ask for url, open it and parse html
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    
    # extract table of contents to dataframe
    data = []
    table_of_contents = soup.find_all("span", {"class": "toctext"})
    
    # remove contents such as additional links, references, etc
    if "List_of_neighborhoods_in_San_Francisco" in url:
        table_of_contents = table_of_contents[:-4]

    for content in table_of_contents:
        row = content.text.strip()
    
        if row:
            data.append(row)
    
    df = pd.DataFrame(data, columns=["Neighborhood"])
    
    return df

In [32]:
sf_df = getTableOfContentsFromURL(url = "https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco")
sf_df.head()

Unnamed: 0,Neighborhood
0,Alamo Square
1,Anza Vista
2,Ashbury Heights
3,Balboa Park
4,Balboa Terrace


In [33]:
def getLocationCoordinates(arg, CityInfo = "San Francisco, CA, USA"):
    
    address = arg + ", " + CityInfo
    geolocator = Nominatim(user_agent="ny_explorer", timeout = 3)
    
    try:
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except:
        return np.nan, np.nan
    
    return latitude, longitude

In [34]:
sf_df["latitude"], sf_df["longitude"] = zip(*sf_df["Neighborhood"]
                                            .map(lambda x: getLocationCoordinates(x, CityInfo = "San Francisco, California")))

In [35]:
sf_df.head()

Unnamed: 0,Neighborhood,latitude,longitude
0,Alamo Square,37.77636,-122.434689
1,Anza Vista,37.780836,-122.443149
2,Ashbury Heights,,
3,Balboa Park,37.721427,-122.447547
4,Balboa Terrace,37.775406,-122.501415


As we can see, in some cases Nominatim is unable to determine the coordinates of the neighborhood. In the case of this project, we will simply drop these datapoints.

In [36]:
print(sf_df[["latitude", "longitude"]].isna().sum())
print(sf_df.shape)
sf_df[sf_df[["latitude", "longitude"]].isna().any(axis=1)][:5]

latitude     27
longitude    27
dtype: int64
(119, 3)


Unnamed: 0,Neighborhood,latitude,longitude
2,Ashbury Heights,,
9,Butchertown (Old and New),,
11,Cathedral Hill,,
12,Cayuga Terrace,,
16,Clarendon Heights,,


In [37]:
sf_df.dropna(axis=0, inplace = True)
print(sf_df.shape)

(92, 3)


### 2.2. Clustering the locations/neighborhoods of the next destination

In the following cells, we will repeat the same process that was done previously. The clustering could be used by people who are already visiting the next destination and just want to see if there are any areas that may interest them whilst on site.

**Note:** The cluster colors that are used have no relation to the cluster colors displayed in the map of the recently visited location. 

In [38]:
map_sf, sf_latitude, sf_longitude = createMapWithLocations(sf_df, address = 'San Francisco, California, USA', 
                                                           labels = 'Neighborhood')

The geograpical coordinates of San Francisco, California, USA are 37.7790262, -122.4199061.


In [40]:
sf_venues = getNearbyVenues(creds, sf_df["Neighborhood"], sf_df["latitude"], sf_df["longitude"], 
                            colName = "Neighborhood", cityName = 'SanFrancisco')

In [41]:
sf_grouped = rateVenueVenueCategories(sf_venues, colName = 'Neighborhood')

In [42]:
sf_neighborhoods_venues_sorted = determineMostCommonVenues(sf_grouped, colName = 'Neighborhood')
sf_neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Alamo Square,Park,Historic Site,Music Venue,Tennis Court,Dog Run,Coworking Space,Cycle Studio,Dance Studio,Food Stand,Credit Union
1,Anza Vista,Health & Beauty Service,Egyptian Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Eye Doctor,Farm,Farmers Market,Fast Food Restaurant,Yoga Studio
2,Balboa Park,Light Rail Station,Bus Station,Café,Burger Joint,Food Truck,Bus Line,BBQ Joint,Flower Shop,Asian Restaurant,Yoga Studio
3,Balboa Terrace,Dry Cleaner,Sporting Goods Shop,Egyptian Restaurant,Chinese Restaurant,Spa,Food & Drink Shop,Bus Station,Sushi Restaurant,Shopping Mall,Shipping Store
4,Bayview,Bakery,Mexican Restaurant,Light Rail Station,Piercing Parlor,Café,Pharmacy,Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Dance Studio


In [43]:
sf_merged = clusterDataAndAddLabel(sf_df, sf_grouped, sf_neighborhoods_venues_sorted, colName = 'Neighborhood',
                                   nb_clusters = k_clusters)
sf_merged.head()

Unnamed: 0,Neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Alamo Square,37.77636,-122.434689,6.0,Park,Historic Site,Music Venue,Tennis Court,Dog Run,Coworking Space,Cycle Studio,Dance Studio,Food Stand,Credit Union
1,Anza Vista,37.780836,-122.443149,1.0,Health & Beauty Service,Egyptian Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Eye Doctor,Farm,Farmers Market,Fast Food Restaurant,Yoga Studio
3,Balboa Park,37.721427,-122.447547,1.0,Light Rail Station,Bus Station,Café,Burger Joint,Food Truck,Bus Line,BBQ Joint,Flower Shop,Asian Restaurant,Yoga Studio
4,Balboa Terrace,37.775406,-122.501415,1.0,Dry Cleaner,Sporting Goods Shop,Egyptian Restaurant,Chinese Restaurant,Spa,Food & Drink Shop,Bus Station,Sushi Restaurant,Shopping Mall,Shipping Store
5,Bayview,37.728889,-122.3925,1.0,Bakery,Mexican Restaurant,Light Rail Station,Piercing Parlor,Café,Pharmacy,Restaurant,Coffee Shop,Southern / Soul Food Restaurant,Dance Studio


In [44]:
map_sf_clusters = createMapWithClusteredLocations(sf_merged, k_clusters, sf_latitude, sf_longitude, colName = 'Neighborhood')
map_sf_clusters

## 3. Travel recommendation

### 3.1. Based on location name

In this example, we will use "Camden Town" as location that we have liked on a previous holiday. 

**Note:** You are not required to provide the full name. For instance, you could just write "Camden"

In [45]:
interest_location = "Camden"

In order to find similar locations, we first need to determine the venue types that the 2 dataframes have in common. This is done through the following function.

In [52]:
def getCommonVenues(old_grouped, new_grouped, exception_cols = ["Neighborhood", "Location"]):
    
    '''
    This function determines the features (columns) that old_grouped and new_grouped dataframes have
    in common with the exception of the exception_cols.
    '''
    
    # determine the venue types other than the exception columns that the two locations have in common
    old_cols = old_grouped.columns.tolist()
    new_cols = new_grouped.columns.tolist()

    commonVenues =  [col for col in old_cols if (col in new_cols and col not in exception_cols)]
    print("There are " +str(len(commonVenues)) + " venue types in common.")
    
    return commonVenues

In [53]:
commonVenues = getCommonVenues(london_grouped, sf_grouped)

There are 189 venue types in common.


Once we know which features the two dataframes have in common, we can determine the similarity between our location of interest and those in the city we are going to visit. In the case of this project, the similarity score will simply be based on the Euclidean distance between the two datapoints. The datapoints that have the lowest score are then those that are most similar to our location of interest. 

In [57]:
def findNMostSimilarLocations(old_grouped, new_grouped, commonVenues, keyword, num_recommendations = 10):
    
    '''
    This function assigns a similarity score to the datapoints in the new_grouped dataframe as a 
    function of the datapoint in the old_grouped dataframe. The num_recomendations datapoints with
    the lowest scores are then returned.
    '''
    
    # create a new dataframe containing only the common venue type columns
    comparison_cols = ["Neighborhood"] + commonVenues
    comparison_matrix = new_grouped[comparison_cols].copy()
    
    # obtain the index for the location in the previous city.
    key_index = old_grouped[old_grouped["Location"].str.contains(keyword)][commonVenues].index[0]
    
    # assign a similarity score based on Euclidean distance. 
    comparison_matrix[commonVenues] = comparison_matrix[commonVenues].subtract(old_grouped[commonVenues].loc[key_index], 
                                                                           axis = 1)
    comparison_matrix["score"] = np.sqrt(np.square(comparison_matrix[commonVenues]).sum(axis=1))
    
    # determine the most similar locations, i.e. the locations with the lowest score values. 
    nSimilarLocs = comparison_matrix.nsmallest(num_recommendations,'score')["Neighborhood"].tolist()
    
    return nSimilarLocs

In [58]:
nSimilarLocs = findNMostSimilarLocations(london_grouped, sf_grouped, commonVenues, interest_location)

We can then display these locations on a map as done previously. 

In [56]:
map_similar, latitude_sf, longitude_sf = createMapWithLocations(sf_merged[sf_merged["Neighborhood"].isin(nSimilarLocs)], 
                                                                address = 'San Francisco, California, USA', 
                                                                labels = 'Neighborhood')
map_similar

The geograpical coordinates of San Francisco, California, USA are 37.7790262, -122.4199061.


### 3.2. Based on venue type

In this case, we want to recommend locations based on the types of venues that can be found. To do this, we will use a function that determines the N highest scores for a given venue type and displays their locations on the map.

In [62]:
venueType = "Museum"

In [80]:
def displayMostRelevantToVenueType(data_grouped, data_merged, venueType = "Museum", address = 'San Francisco, California, USA', 
                                   num_recommendations = 10):
    
    try:
        assert venueType in data_grouped.columns.tolist()[1:]
        nRelevantLocs = data_grouped.nsmallest(num_recommendations, venueType)[data_grouped.columns[0]]
        
        map_relevant, latitude_relevant, longitude_relevant = createMapWithLocations(data_merged[data_merged[data_grouped.columns[0]].isin(nRelevantLocs)], 
                                                                address = address, 
                                                                labels = data_grouped.columns[0])
        
    except AssertionError:
        print("This venue type is not valid, please try another!")
        
    return map_relevant

In [81]:
displayMostRelevantToVenueType(sf_grouped, sf_merged, venueType = venueType, address = 'San Francisco, California, USA')

The geograpical coordinates of San Francisco, California, USA are 37.7790262, -122.4199061.
