# <u> Activities one can do while travelling -  Recommendation Engine </u>

#### By Team Weekenders: _Elena Harda, Ritu Pardasani, Rucha Kulkarni, Sanjana Thakur_

Deciding "where to go" on a weekend is always a task, moreover "what to do" is a much more important task. We usually decide travel destinations based on what activities do we want to do. 

With machine learning (ML) its easier to find insights and correlations. Hence we decided to use **Collaboative Filtering (CF)** algorithm to make a **recommendation engine** for recommending activities along with cities & cost of that activity on the basis of user selected activity. 

So basically, we'll recommend activities that are similar to the user's selected activity. 



## A. <u> Data Set Description</u>
<ul>
We collected our data from - https://www.viator.com/. We scrapped the data using Beautiful Soup library. The website blocked us after a few pages so we collected some of the data manually. We prepared a excel file which was later converted to csv, after a few alterations. 

Our final dataset consists of a single comma seperated file (Travel_data.csv) that contains information about leisure activities that one can do in different cities of California. We have five main categories of activities: 

- Water Ativities
- Food, Wine and Nightlife
- Outdoor ativities 
- Walking & biking 
- Tours

<ul>
<ul>
<ul>
<b>There are 11 columns in our dataset and their description is as follows:</b>

<li>1. Activity : Detailed name of the activity </li>
<li>2. City : Name of the city where that particular activity happens </li>
<li>3. Price : Cost in USD of performing that particular activity </li>
<li>4. Region: This has 3 main values - North, Central and South </li>
<li>5. Rating : User rating for that particular activity </li>
<li>6. Category : This has the name of the category (mentioned above) </li>
<li>7. Water Activities: Value of 0 or 1, if its a water activity it will 1 otherwise 0 (binary)</li>
<li>8. Food, Wine and Nightlife : Value of 0 or 1, if its a food, wine and/or nightlife activity it will 1 otherwise 0 (binary)</li>
<li>9. Outdoor ativities: Value of 0 or 1, if its a outdoor activity it will 1 otherwise 0 (binary)</li>
<li>10. Walking & biking : Value of 0 or 1, if its a walking and/or biking activity it will 1 otherwise 0 (binary) </li>
<li>11. Tours: Value of 0 or 1, if its a tour (private or guided) it will 1 otherwise 0 (binary)</li>

## B. <u> Preparing the data for applying the Algorithm </u>

### <u><b>STEP 1: Importing all required libraries and reading the data</b></u>


In [1]:
from bs4 import BeautifulSoup
import requests as rq
import pandas as pd
import numpy as np
import re

### <u><b>STEP 2: Using Beautiful soup to scrap data from the website - https://www.viator.com </b></u>


In [2]:
url = input("https://www.viator.com/California/d272-ttd")
r  = rq.get(url)

data = r.text

soup = BeautifulSoup(data,"html5lib")

KeyboardInterrupt: 

In [3]:
cityActivityList1 = []
cityActivityList2 = []
cityActivityList3 = []

In [4]:
#finds the html for activity
for x in soup.find_all('h2',attrs={'man mtm product-title'}):
    text = x.text
    cityActivityList1.append(text)

In [5]:
#finds the html for location
for x in soup.find_all('p' ,attrs={'man mts note xsmall'}):
    text = x.text
    cityActivityList2.append(text)

In [6]:
#finds the html for price
for x in soup.find_all('span' ,attrs={'price-amount'}):
    text = x.text
    cityActivityList3.append(text)

In [7]:
#convert all from list to dataframes
df1 = pd.DataFrame({'Activity': cityActivityList1})
df2 = pd.DataFrame({'City': cityActivityList2})
df3 = pd.DataFrame({'Price': cityActivityList3})

In [8]:
#join together based on row number to get the activity, location and price aligned
cityActivities = df1.join(df2).join(df3)

In [9]:
#function to clean data to just get just the city. Don't need state since everything is in California
def getLocation(Location):
    City = (Location.split(", California",1)[0])
    return City

In [10]:
#apply the getLocation function to the dataframe
cityActivities['City'] = cityActivities['City'].apply(getLocation)

Next we will map the region to each city which will be used for a filter that the user can select. Regions will be North, Central and South.

In [11]:
#Dictionary to be used for tagRegion function below
mapRegion = {
    'Central': ['Cambria','Carmel','Los Olivos','Monterey','Oceano','Paso Robles', \
               'San Luis Obispo','Santa Barbara','Solvang'],
    'North': ['Berkeley','Cupertino','Fish Camp','Healdsburg','Inverness','Jenner', \
             'Lake Tahoe', 'Livermore','Los Gatos','Mammoth Lakes','Mill Valley', \
             'Napa','Novato','Oakhurst','Oakland','Occidental','Petaluma','Point Reyes', \
             'Redding','Rohnert Park','Sacramento','San Francisco','San Jose','Santa Cruz','Santa Rosa', \
             'Sausalito','Sonoma','Stockton','Tahoe City','Truckee'],
    'South': ['Anaheim','Beverly Hills','Carlsbad','Catalina Island','Dana Point','Fontana', \
             'Huntington Beach','Joshua Tree','La Jolla','Laguna Beach','Long Beach','Los Angeles', \
             'Malibu','Newport Beach','Oceanside','Palm Desert','Palm Springs','Pasadena', \
             'Riverside','San Diego','Santa Ana','Santa Monica','Temecula','Universal City', \
             'Venice','West Hollywood']
}

In [12]:
def tagRegion(x):
    for k,v in mapRegion.items():
        if x.strip() in v:
            return k
        else:
            pass

In [13]:
#Apply function to add the new column "Region" to our dataframe
cityActivities['Region'] = cityActivities['City'].apply(tagRegion)

In [14]:
cityActivities.head(10)

Unnamed: 0,Activity,City,Price,Region
0,Napa and Sonoma Wine Country Tour,San Francisco,$99.00,North
1,Big Bus San Francisco Sightseeing and Alcatra...,San Francisco,$110.00,North
2,"Muir Woods, Giant Redwoods and Sausalito Half...",San Francisco,$62.00,North
3,Yosemite National Park and Giant Sequoias Trip,San Francisco,$171.00,North
4,San Francisco Hop-on Hop-off Ticket and Alcat...,San Francisco,$111.50,North
5,Viator VIP: Early Access to Alcatraz and Excl...,San Francisco,$192.50,North
6,Yosemite National Park Day Trip from San Fran...,San Francisco,$139.00,North
7,San Francisco Super Saver: Muir Woods & Wine ...,San Francisco,$121.00,North
8,"Monterey, Carmel and 17-Mile Drive Day Trip f...",San Francisco,$89.00,North
9,Small-Group Napa and Sonoma Wine Country Tour...,San Francisco,$149.00,North


In [16]:
#Export to csv so that we can clean the data (add the category tags)
cityActivities = cityActivities.to_csv("cityActivities.csv")

In [10]:
#After cleaning the data, bring in the csv file
raw_data = pd.read_csv('Travel_data.csv')

### <u>STEP 3: Check if we get  insights from the data</u>


In [12]:
raw_data.describe()

Unnamed: 0,Rating,Category,Water Activities,Outdoor Activities,Walking and Biking,Tours,"Food,Wine & Nightlife"
count,1007.0,0.0,1007.0,1007.0,1007.0,1007.0,1007.0
mean,2.751241,,0.178749,0.397219,0.282026,0.766634,0.387289
std,1.434084,,0.383332,0.489565,0.450209,0.423184,0.487373
min,0.5,,0.0,0.0,0.0,0.0,0.0
25%,1.5,,0.0,0.0,0.0,1.0,0.0
50%,3.0,,0.0,0.0,0.0,1.0,0.0
75%,4.0,,0.0,1.0,1.0,1.0,1.0
max,5.0,,1.0,1.0,1.0,1.0,1.0


### <u>STEP 4:Filter out region based on user's input</u>


In [4]:
# If nothing is selected, include all regions
#Function will be applied to raw_data dataframe

regionSelection = '' #This can be changed based on the user input

def filterRegion(regionSelection):
    if regionSelection == 'North':
        newRegion = raw_data.loc[raw_data['Region'] == 'North']
        return newRegion
    elif regionSelection == 'South':
        newRegion = raw_data.loc[raw_data['Region'] == 'South']
        return newRegion
    elif regionSelection == 'Central':
        newRegion = raw_data.loc[raw_data['Region'] == 'Central']
        return newRegion
    else:
        newRegion = raw_data
        return newRegion

In [5]:
raw_data = filterRegion(regionSelection)

### Now we'll do the following steps:
#### 1. Create a list for the categories: 
water activities, outdoor activities, walking and biking, tours, and food,wine & nightlife

#### 2. Put that list into the Category column

#### 3. Create a tuple with the aggregated data

In [6]:

## At this point Index of the DataFrame is not in proper order. This line drops old index and resets index from 0 to num_rows -1
activities_df = raw_data.reset_index(drop=True) 

#### 1. Create a list for the categories: 

This following one line does a bunch of things. We could have broken it out into multiple apply functions, but 
for conciseness and efficiency combined all of them into a single apply. We do the following here:
1. activities_df.iloc[:, 6:] subselects all rows and the columns starting 6 onwards (i.e. Water Activities) basically we want to apply a function to a subsection of the df
2. now we apply a series of functions to this sub-dataframe:

    a. .dropna() removes missing values for if we have any
    
    b. .astype(int).astype(str) first convert to int and then to str. We dont convert to str as these may be treated as factor variables by python.We need this in string format because we are gonna join with ',' in next step
    
    c. ','.join joins all the values in each column seperated by a comma
    
We are applying it row-by-row and not column-by-column, hence axis=1

In [7]:
activities_df['Category'] = activities_df.iloc[:, 5:].apply(lambda x: ','.join(x.dropna().astype(int).astype(str)),axis=1)

#### 2. Put that list into the Category column:

At this point we have a new column called 'Category' which contains a string like this: "1,0,0,1,1".
But we need a list like this instead: ['1', '0', '0', '1', '1']. To get there we again apply 
a function to each row (only for the column activities_df['Category'] this time). 

Now, we:

1. map(int, x.split(',')) - split the string on commas, then convert them to int, because that is what we need to do the distance calc, we cannot find distance using str values

2. list(....) - put the values of map into a list

In [8]:
activities_df['Category'] = activities_df['Category'].apply(lambda x: list(map(int, x.split(','))))

#### 3. Create a tuple with the aggregated data:

Now we have a df with a column called 'Category' = list like ['1', '0', '0', '1', '1']. 
but we need a dictionay of lists not a DataFrame. 

The following 2 lines of code convert our DF columns (notice only 0 to 6) to a list of Tuples row by row. Each tuple in this list is a row of the original DF. Finally we put each individual list into a Dict with the key of the dict being the index/position in the list. So, first item has key/id of 1 etc. 

Now we have a dict{activity_id: activity}

In [9]:
activities_tuple = [tuple(x) for x in activities_df.iloc[:, 0:6].values]
activities_dict = {i:j for i,j in enumerate(activities_tuple)}

### The final data will look like as follows:

In [153]:
#see what a sample row will look like in our dictionary

activities_dict[900]

('Sonoma Valley Wine Tour from San Francisco',
 'Sonoma',
 '139',
 'North',
 4.0,
 [0, 0, 0, 1, 1])

## C. <u> Applying the Algorithm</u>

### <u>STEP 5:Calculating the cosine distance </u>

Here we are calculating the distance of each activity's category from other activity's categoeies to find similarities between activities. Less distance implies more similarity and vice versa. 

In [154]:
#Compute the distance between the category and rating similarity

from scipy import spatial

def ComputeDistance(a, b):
    # a[5] = category
    catA = a[5]
    catB = b[5]
    catDistance = spatial.distance.cosine(catA, catB)
    # a[4] = rating
    ratingA = a[4]
    ratingB = b[4]
    ratingDistance = abs(ratingA - ratingB)
    return (catDistance + ratingDistance)

In [155]:
#test out the ComputeDistance function from above

ComputeDistance(activities_dict[900], activities_dict[60])

0.5

### <u>STEP 6:                    </u>

We are creating a function that will calculate the distance between two activities (based on rating and category). It will only compare the activities that are different. For example, if I search for horseback riding, it will not give me horseback riding as one of my results.  <br> <br>

activityName = index number of the activity from above <br>
K = number of results we want to show 

Once the distance is calculated, it will apend the distance number in the neighbors list.

In [156]:
#Write the function that will calculate the distance then find the nearest neighbors. 
#This will be used to display the activity recommendations

import operator
    
def getNeighbors(activityName, K): 
    distances = []
    for activity in activities_dict:
        if (activity != activityName):
            dist = ComputeDistance(activities_dict[activityName], activities_dict[activity])
            distances.append((activity, dist))
    distances.sort(key=operator.itemgetter(1))

    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

In [157]:
#Use this to see what value you are entering in below. This won't be used for any functions

activities_dict[900]

('Sonoma Valley Wine Tour from San Francisco',
 'Sonoma',
 '139',
 'North',
 4.0,
 [0, 0, 0, 1, 1])

### <u>STEP 7:                    </u>

Run the algorithm to find the activity recommendations. Our input is activity 900 (Sonoma Valley Wine Tour from San Francisco) and we only want to see 5 results. <br>

It will iterate through the list of activities (already in descending order by distance value) and display the activity name, price, and city.

In [158]:
K = 5 #Total number of recommendations we want to display

neighbors = getNeighbors(900, K) #the number is the key we want to search by. Ex: 55 = Balboa Park Tour by Segway
for y in neighbors:
    print(activities_dict[y][0] + ", $" + str(activities_dict[y][2]) \
          + ", " + str(activities_dict[y][1]))

6 Hour Dessert Wine Tour - Napa Valley, $480, San Francisco
6 Hour Private Wine Tasting Tour with Pre-set Stops, $439.99, Napa
8 Hour Dry Creek Valley Wine Tasting Tour, $549.95, Sonoma
8 Hour Santa Cruz Mountains Wine Tasting Tour from San Francisco, $599.99, Santa Cruz
8 Hour South Bay Wine Tasting Tour from San Francisco, $539.88, San Jose


  dist = 1.0 - uv / np.sqrt(uu * vv)


## 4. <u> Insights from visualizations:</u>

#### Link-> https://public.tableau.com/profile/rucha5691#!/vizhome/DataScienceproject-TheWeekenders/MosthappeningCitiesinCalifornia?publish=yes


-> San Francisco proves to be the most happening city in California based on the number of activities in each city

-> North region has the highest variety and is the most expensive region for food, wine, nightlife

-> Beverly Hills and Pasadena are the highest rated cities to walk and bike

-> Lake Tahoe is the most expensive city based on average walking and biking cost

-> Avg cost of water activities is the highest in Central California

-> Activities in Central Calif - List of activities to do in Central California based on the avg ratings
