# Capstone Project: Travel Recommender System Based on Activity Preferences

Done by: Richelle-Joy Chia, [Linkedin](https://www.linkedin.com/in/richelle-joy-chia/)

# Part 3: Feature Engineering and Recommender System

## 3.1 Import relevant libraries and datasets

In [1]:
# import  libraries 
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

In [2]:
# import datasets
final_data = pd.read_csv('./datasets/final_data.csv')

In [3]:
# preview data 
final_data.head()

Unnamed: 0,attraction_id,name,country,province,city_name,location__lat,location__lng,price,rating,group_reviews,...,park,photography,rental,sea tour,sightseeing,transport,wildlife,winery,duration,images
0,0,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,49.197832,-123.064996,80.0,4.5,Another 'Dave' Guides us Around Vancouver. Lan...,...,1,0,0,0,1,0,0,0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...
1,1,Vancouver To Victoria And Butchart Gardens Tou...,canada,British Columbia,Vancouver,49.197832,-123.064996,210.0,5.0,Canada. I was there in 2004 and below it only ...,...,1,0,0,1,1,0,0,0,13h,https://media-cdn.tripadvisor.com/media/attrac...
2,2,Quebec City And Montmorency Falls Day Trip Fro...,canada,Quebec,Montreal,45.500146,-73.572026,115.0,4.5,Interested City of Canada. Quebec is our belov...,...,0,0,0,0,1,0,0,0,12h,https://media-cdn.tripadvisor.com/media/attrac...
3,3,Niagara Falls Day Trip From Toronto,canada,Ontario,Toronto,43.656151,-79.384264,169.0,5.0,Great day. It was an amazing day. The falls ar...,...,0,0,0,1,1,0,0,1,9h 30m,https://media-cdn.tripadvisor.com/media/attrac...
4,4,"Best Of Niagara Falls Tour From Niagara Falls,...",canada,Ontario,Niagara Falls,43.085714,-79.082431,158.0,5.0,"Valentine's Trip. Chris was a wonderful guide,...",...,0,0,0,1,1,0,0,0,4–5 hours,https://media-cdn.tripadvisor.com/media/attrac...


In [4]:
# preview data info and dtypes 
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 45 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   attraction_id        1705 non-null   int64  
 1   name                 1705 non-null   object 
 2   country              1705 non-null   object 
 3   province             1705 non-null   object 
 4   city_name            1705 non-null   object 
 5   location__lat        1134 non-null   float64
 6   location__lng        1134 non-null   float64
 7   price                1705 non-null   float64
 8   rating               1705 non-null   float64
 9   group_reviews        797 non-null    object 
 10  attraction           1705 non-null   object 
 11  accommodation        1705 non-null   int64  
 12  activities           1705 non-null   int64  
 13  adventure            1705 non-null   int64  
 14  air tour             1705 non-null   int64  
 15  airland tour         1705 non-null   i

## 3.2 Base model 
- This base model includes all hand-labelled categories to test the recommender system.
- I will be building the content-based recommender system using Cosine Similarity. It is one of the most popular techniques used in recommendation systems. The attributes of a thing are termed as "content". Based on these attributes, we will be able to classify whether the two things are similar or not. For eaxmple, in this case, attributes are the hand-labelled categories. The intuition behind this sort of recommendation system is that if a user liked a particular type of activity, he/she might like a similar activity. 

In [5]:
# creating a separate df to store relevant columns for the rs - province, city, and categories
selected_data = final_data.drop(columns=['attraction_id', 'country', 'location__lat', 'location__lng', 'price', 'rating', 'group_reviews', 'attraction', 'duration', 'images'])

In [6]:
# load dataset and set attraction name as the index 
selected_data.set_index('name', inplace=True)
print(selected_data.shape)
selected_data.head()

(1705, 34)


Unnamed: 0_level_0,province,city_name,accommodation,activities,adventure,air tour,airland tour,airlandsea tour,airsea tour,alcohol,...,mountain views,nature,park,photography,rental,sea tour,sightseeing,transport,wildlife,winery
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,British Columbia,Vancouver,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
Vancouver To Victoria And Butchart Gardens Tour By Bus,British Columbia,Vancouver,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,1,0,0,0
Quebec City And Montmorency Falls Day Trip From Montreal,Quebec,Montreal,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Niagara Falls Day Trip From Toronto,Ontario,Toronto,0,1,0,0,0,0,0,1,...,0,0,0,0,0,1,1,0,0,1
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",Ontario,Niagara Falls,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0


In [7]:
# create dummies for province and city columns 

selected_data = pd.get_dummies(data=selected_data, columns=['province', 'city_name'])

In [8]:
# preview data 
selected_data.head()

Unnamed: 0_level_0,accommodation,activities,adventure,air tour,airland tour,airlandsea tour,airsea tour,alcohol,beach,brewery,...,city_name_Vancouver Island,city_name_Wanuskewin Heritage Park,city_name_Whistler,city_name_Whitehorse,city_name_Wildgrape Tours,city_name_Windsor,city_name_Winnipeg,city_name_Wood Islands,city_name_Yarmouth,city_name_Yellowknife
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Vancouver To Victoria And Butchart Gardens Tour By Bus,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Quebec City And Montmorency Falls Day Trip From Montreal,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Niagara Falls Day Trip From Toronto,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# find cosine similarity of every row with every other row and then filter to the first index
attr_sim = pd.DataFrame(cosine_similarity(selected_data), columns=selected_data.index, index=selected_data.index)
print(attr_sim.shape)
attr_sim.head()

(1705, 1705)


name,Vancouver City Sightseeing Tour,Vancouver To Victoria And Butchart Gardens Tour By Bus,Quebec City And Montmorency Falls Day Trip From Montreal,Niagara Falls Day Trip From Toronto,"Best Of Niagara Falls Tour From Niagara Falls, Ontario",Niagara Falls In One Day: Deluxe Sightseeing Tour Of American And Canadian Sides,Whistler Small-Group Day Trip From Vancouver,Ultimate Niagara Falls Tour Plus Helicopter Ride And Skylon Tower Lunch,"Local Food, Craft Beverage And Estate Winery Tour Of Cowichan Valley",Private Tour: Vancouver To Victoria Island,...,Quebec City Shore Excursion: Quebec City Sightseeing Tour,Montreal Indoor Skydiving Introductory Package,Banff Day Trip From Calgary,60-Minute Deluxe Horse-Drawn Carriage Tour,Toronto Inner Harbour Evening Cruise,30-Minute Distillery District Segway Tour In Toronto,Montreal Quadricycle Rental,Intro Survival Course Rockies,Private Toronto Guided City Tour,Peyto Lake Snowshoe Tour
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,1.0,0.680414,0.547723,0.246183,0.288675,0.288675,0.57735,0.258199,0.308607,0.721688,...,0.5,0.182574,0.288675,0.771517,0.288675,0.5,0.182574,0.258199,0.5,0.369274
Vancouver To Victoria And Butchart Gardens Tour By Bus,0.680414,1.0,0.298142,0.502519,0.589256,0.589256,0.471405,0.421637,0.251976,0.942809,...,0.272166,0.0,0.235702,0.503953,0.353553,0.272166,0.0,0.210819,0.272166,0.301511
Quebec City And Montmorency Falls Day Trip From Montreal,0.547723,0.298142,1.0,0.26968,0.316228,0.316228,0.316228,0.282843,0.169031,0.316228,...,0.730297,0.6,0.316228,0.507093,0.316228,0.547723,0.6,0.282843,0.547723,0.26968
Niagara Falls Day Trip From Toronto,0.246183,0.502519,0.26968,1.0,0.746203,0.746203,0.319801,0.667424,0.569803,0.426401,...,0.369274,0.0,0.319801,0.227921,0.639602,0.615457,0.13484,0.286039,0.615457,0.272727
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",0.288675,0.589256,0.316228,0.746203,1.0,1.0,0.375,0.782624,0.267261,0.5,...,0.433013,0.0,0.375,0.267261,0.625,0.57735,0.158114,0.223607,0.57735,0.319801


In [10]:
# look at similar attractions as Vancouver City Sightseeing Tour
attr_sim['Vancouver City Sightseeing Tour'].sort_values(ascending=False).head(30) 

name
Vancouver City Sightseeing Tour                                       1.000000
Private Tour: Victoria And Butchart Gardens From Vancouver            1.000000
Grand City Tour Of Vancouver                                          1.000000
Vancouver Private Tour                                                1.000000
Victoria Tour                                                         1.000000
Stanley Park Bike Tour                                                0.925820
Vancouver Sightseeing Bus Tour (4 Hrs)                                0.925820
Private Tour: Vancouver Day Trip                                      0.925820
Vancouver City Tour With Vancouver Lookout Admission                  0.925820
Vancouver Shore Excursion: Pre-Cruise City Tour With Port Drop Off    0.925820
Vancouver Full-Day Sightseeing And Photography Tour                   0.925820
Epic Electric Bike Tour Of Vancouver                                  0.925820
Bike Tour Of Downtown Vancouver And Stanley Par

Profile based

In [11]:
# recap original dataset
print(selected_data.shape)
selected_data.head()

(1705, 168)


Unnamed: 0_level_0,accommodation,activities,adventure,air tour,airland tour,airlandsea tour,airsea tour,alcohol,beach,brewery,...,city_name_Vancouver Island,city_name_Wanuskewin Heritage Park,city_name_Whistler,city_name_Whitehorse,city_name_Wildgrape Tours,city_name_Windsor,city_name_Winnipeg,city_name_Wood Islands,city_name_Yarmouth,city_name_Yellowknife
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Vancouver To Victoria And Butchart Gardens Tour By Bus,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Quebec City And Montmorency Falls Day Trip From Montreal,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Niagara Falls Day Trip From Toronto,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3.2.1 User profiling for base model
- This step is done to address the cold start problem that occurs when the system is unable to form any relation between users and items (in this case, categories) for which it has insufficient data. 
- Specifically, for this project, the main concern is the user cold-start problems - when there is almost no information available about the user. Hence, I will be addressing this issue by recommending the categories with random items to cold-start user and use the feedback to learn a profile. This step will be clearer in the Streamlit notebook as well as app. 

Step 1: I will first create the "same" feature space by initializing all the categories to zero for a new user.

In [12]:
# create a vector for a "new user" sign-up by initializing all categories to zero
category = selected_data.columns # match feature space for new user<>our existing dataset
my_profile = pd.Series(data=np.zeros(len(category)), index=selected_data.columns) # initialize 0s for all genres to create new user vector using: (https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)
my_profile

accommodation             0.0
activities                0.0
adventure                 0.0
air tour                  0.0
airland tour              0.0
                         ... 
city_name_Windsor         0.0
city_name_Winnipeg        0.0
city_name_Wood Islands    0.0
city_name_Yarmouth        0.0
city_name_Yellowknife     0.0
Length: 168, dtype: float64

Step 2: Create profiles for the users in a hypothetical scenario by asking them to rate 1 for like and -1 for dislike 

In [13]:
# inputs assigned to the respective categories
my_profile['sightseeing'] = 1
my_profile['photography'] = 1
my_profile['wildlife'] = 1
my_profile['city'] = -1
my_profile['cruise'] = -1
my_profile['entertainment'] = -1

In [14]:
# new user vectors
my_profile.values

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,
       -1.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Step 3: Find the dot product between attraction vectors and new user profile to find/show recommendations

In [15]:
# dot product between the attraction vectors and my_profile (new user) vector that are both on the same feature space
recommendations = np.dot(selected_data.values, my_profile.values)

# convert to pandas Series for ease of working with the recommendations data
recommendations = pd.Series(recommendations, index=selected_data.index)

# getting top 20 recommendations for a new-user! 
recommendations.sort_values(ascending=False).head(20)

name
Squamish Cheakamus White-Water Rafting                                               2.0
Wilderness & Eagle Viewing Float                                                     2.0
Become A Voyageur - Canoe Tour                                                       2.0
Whistler Electric Bike Tour (Ebike)                                                  2.0
90 Minute Private Vacation Photography Session With Photographer In Banff            2.0
Banff, Jasper, Okanagan Lake & Kamloops Tour From Vancouver                          2.0
3-Day Algonquin Park Canoe Trip                                                      2.0
Whistler Singletrack Mountain Bike Tour                                              2.0
Brentwood Bay Guided Kayak Tours                                                     2.0
Banff, Yoho & Jasper National Parks Summer Tour From Calgary (Airport Transfers)     2.0
5.5-Hour Jasper Wildlife And Waterfalls Tour With Maligne Lake Cruise From Jasper    2.0
Aurora Tour     

In [16]:
# getting bottom 20 recommendations for a new-user! 
recommendations.sort_values(ascending=False).tail(20)

name
The Second City Comedy Theatre Admission                        -2.0
Halifax Dinner Cruise                                           -2.0
Friday Night Wine And Cheese Sail In Toronto                    -2.0
Quebec City Dinner Cruise                                       -2.0
Bateau-Mouche Dinner Cruise                                     -2.0
Thousand Islands Sunset Dinner Cruise                           -2.0
Vancouver Holiday Dinner And Carols Cruise                      -2.0
Oh Canada Eh Show                                               -2.0
Vancouver Harbor Sunset Dinner Cruise                           -2.0
Small-Group Toronto Ghosts And Spirits Of The Distillery Tour   -2.0
Art Party Paint Night                                           -2.0
Medieval Times Dinner And Tournament Toronto                    -2.0
Hot On The Harbour With Hot Country 103 5                       -2.0
Saturday Afternoon Lunch Cruise                                 -2.0
Ultimate Archery Tag Experien

To test the base model, I created a survey using Qualtrics [(click here for survey link)]((https://ntuhss.az1.qualtrics.com/jfe/form/SV_0GIxdDgBGg9L7kG)) and distributed it to 16 others from my network. 
- Participants were asked to imagine that they were entering a new travel website to get some recommendations on where they can go. 
- Since the website does not have details on their preferences, they are asked to select 5 top and 5 bottom categories before rating each category on a scale of 1-5.
- Finally, they will receive a follow up call to share the recommended attractions.

Results: Most of the participants were indeed satisfied with the recommendations and mentioned that these were activities they would do.

However, some reported that there were too many categories to choose from. Therefore, I attemped to reduce the categories as well as to use feature engineering to further improve the recommendations. 

## 3.3 Feature Engineering

### 3.3.1 Combine related columns to create lesser categories so as to reduce noise in the data

Columns that have been combined as they are somewhat related to each other:
- Combined, brewery, distillery, and winery to be alcohol places
- Camping and hiking to be outdoor activities
- Nature, mountain views, island, park to be nature 

Columns that have been dropped due to their respective reasons:
- Dropped beach as there are too little values and such attractions do not appear
- Dropped activities and adventure as these categories sounded too broad to users. Based on the coding sheet, these categories were also embedded in other categories (e.g., nature, city), which meant that the attractions will still be present in the database
- Dropped tour categories that have double codings (e.g., airland tour which comprises of air tour and land tour) and kept the single labels (e.g., air tour, land tour) instead.
- Dropped experience as it sounds abstract to participants and most found it hard to relate. 

In [17]:
# based on user feedback, there is a beach category but they dont see any beach-related activities, so I decided to check the df. 
# will be droppping this category altogether as there are too little tagged activities
selected_data['beach'].value_counts()

0    1677
1      28
Name: beach, dtype: int64

In [18]:
selected_data.columns.value_counts()

accommodation                 1
city_name_Maple Ridge         1
city_name_Lake George         1
city_name_Lake Louise         1
city_name_Lethbridge          1
                             ..
city_name_Advocate Harbour    1
city_name_Alexandria Bay      1
city_name_Baltimore           1
city_name_Banff               1
city_name_Yellowknife         1
Length: 168, dtype: int64

In [23]:
# combine distillery, brewery, winery
def combine_cat(row):
    if row['distillery'] + row['brewery'] + row['winery'] > 1 :
        return 1
    if row['distillery'] == 1 :
        return 1
    if row['brewery'] == 1:
        return 1
    if row['winery']  == 1:
        return 1
    return 0

In [24]:
# combine camping and hiking
def combine_outdoors(row):
    if row['camping'] + row['hiking'] > 1 :
        return 1
    return 0

In [25]:
# combine nature, mountain views, island, and park
def combine_nature(row):
    if row['nature'] + row['mountain views'] + row['island'] + row['park'] > 1 :
        return 1
    return 0

In [26]:
final_data['alcohol_places'] = final_data.apply(combine_cat, axis = 1)
final_data['outdoor activities'] = final_data.apply(combine_outdoors, axis = 1)
final_data['nature_combined'] = final_data.apply(combine_nature, axis = 1)

In [27]:
# drop irrelevant/combined columns
final_data.drop(['airland tour', 'airlandsea tour', 'landsea tour', 'airsea tour', 'brewery', 'distillery', 'winery', 'beach', 'nature', 'camping', 'hiking', 'mountain views', 'island', 'park', 'location__lat', 'location__lng', 'group_reviews', 'activities', 'adventure', 'experience'], axis = 1, inplace = True)

In [28]:
# check df info to see if the columns have been dropped 
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   attraction_id        1705 non-null   int64  
 1   name                 1705 non-null   object 
 2   country              1705 non-null   object 
 3   province             1705 non-null   object 
 4   city_name            1705 non-null   object 
 5   price                1705 non-null   float64
 6   rating               1705 non-null   float64
 7   attraction           1705 non-null   object 
 8   accommodation        1705 non-null   int64  
 9   air tour             1705 non-null   int64  
 10  alcohol              1705 non-null   int64  
 11  city                 1705 non-null   int64  
 12  classes & workshops  1705 non-null   int64  
 13  cruise               1705 non-null   int64  
 14  entertainment        1705 non-null   int64  
 15  food                 1705 non-null   i

### 3.3.2 Weights for presence or absence of ratings/reviews

I created weighted presence of ratings/reviews as part of my feature engineering to see if attractions with no reviews or ratings would have a role in the recommendations. I plan to assign those attractions with no ratings/reviews at all to a lower weightage while those with ratings/reviews to a higher weightage. Reason being is that if you are really happy or extremely upset/angry with what you have encountered, you would be more inclined to write and leave a comment. 

Before creating the weights, these are the steps involved:
- Merging the main dataset (name: final_data) with the attractions_reviews_cleaned to get the individual ratings and reviews. 
- As mentioned in the data cleaning notebook, the ratings column would need to be cleaned (missing values coded as -1) before creating the weights.

In [29]:
# import dataset of attractions_reviews 
attractions_reviews_cleaned = pd.read_csv('./datasets/attractions_reviews_cleaned.csv')

In [30]:
# preview data
attractions_reviews_cleaned.head()

Unnamed: 0,attraction_id,rating,review,review_date,user
0,0,5.0,Another 'Dave' Guides us Around Vancouver. Lan...,"March 14, 2019",drew22perthaustralia
1,0,5.0,Fantastic way to explore VC. An easy way to ex...,"March 1, 2019",marc_h
2,0,5.0,This was a great half day tour!. Was there for...,"February 28, 2019",maggiehand
3,0,5.0,All the main attractions. Scott was our lovely...,"December 19, 2018",catherine255066
4,0,5.0,Excellent Vancouver Sightseeing Tour. We would...,"November 29, 2018",gearjamkw


In [31]:
# merge datasets
attractions_reviews_cleaned_merged = attractions_reviews_cleaned.merge(final_data, how='outer', on='attraction_id')

In [32]:
# drop duplicated rows by using 1 column 
attractions_reviews_cleaned_merged.dropna(subset=['rental'], inplace=True)

In [33]:
# checking for unique number of attraction id to make sure the duplicated columns have been dropped
attractions_reviews_cleaned_merged['attraction_id'].nunique()

1705

In [34]:
# replace missing data with 999 for now to prevent confusion as the weights will be 0 
attractions_reviews_cleaned_merged['rating_new'] = attractions_reviews_cleaned_merged['rating_x'].replace(np.nan, 999)

In [35]:
# check if the value has been replaced accordingly
attractions_reviews_cleaned_merged['rating_new'].tail()

34821    999.0
34822    999.0
34823    999.0
34824    999.0
34825    999.0
Name: rating_new, dtype: float64

In [36]:
# reset index to create a new column for rating_id
attractions_reviews_cleaned_merged.reset_index(inplace=True) 

In [37]:
attractions_reviews_cleaned_merged.rename(columns={'index': 'rating_id'}, inplace=True)

In [38]:
# preview data
attractions_reviews_cleaned_merged.head()

Unnamed: 0,rating_id,attraction_id,rating_x,review,review_date,user,name,country,province,city_name,...,sea tour,sightseeing,transport,wildlife,duration,images,alcohol_places,outdoor activities,nature_combined,rating_new
0,0,0,5.0,Another 'Dave' Guides us Around Vancouver. Lan...,"March 14, 2019",drew22perthaustralia,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,0.0,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0
1,1,0,5.0,Fantastic way to explore VC. An easy way to ex...,"March 1, 2019",marc_h,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,0.0,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0
2,2,0,5.0,This was a great half day tour!. Was there for...,"February 28, 2019",maggiehand,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,0.0,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0
3,3,0,5.0,All the main attractions. Scott was our lovely...,"December 19, 2018",catherine255066,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,0.0,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0
4,4,0,5.0,Excellent Vancouver Sightseeing Tour. We would...,"November 29, 2018",gearjamkw,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,0.0,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0


Since there are only 2 variables, I've inputed 0 and 1 and scaling of value is not necessary.

In [39]:
# creating a new column and inputing 0 for all rows
attractions_reviews_cleaned_merged['weights'] = 0

In [40]:
# function to assign weights for the presence or absence of ratings 
for index,value in enumerate(attractions_reviews_cleaned_merged['rating_new']):
    if value == 999:
        attractions_reviews_cleaned_merged['weights'][index] = 0
    else:
        attractions_reviews_cleaned_merged['weights'][index] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  attractions_reviews_cleaned_merged['weights'][index] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  attractions_reviews_cleaned_merged['weights'][index] = 0


In [41]:
# preview data
final_data.head()

Unnamed: 0,attraction_id,name,country,province,city_name,price,rating,attraction,accommodation,air tour,...,rental,sea tour,sightseeing,transport,wildlife,duration,images,alcohol_places,outdoor activities,nature_combined
0,0,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,80.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,1,0,0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0
1,1,Vancouver To Victoria And Butchart Gardens Tou...,canada,British Columbia,Vancouver,210.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,1,1,0,0,13h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,1
2,2,Quebec City And Montmorency Falls Day Trip Fro...,canada,Quebec,Montreal,115.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,1,0,0,12h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0
3,3,Niagara Falls Day Trip From Toronto,canada,Ontario,Toronto,169.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,1,1,0,0,9h 30m,https://media-cdn.tripadvisor.com/media/attrac...,1,0,0
4,4,"Best Of Niagara Falls Tour From Niagara Falls,...",canada,Ontario,Niagara Falls,158.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,1,1,0,0,4–5 hours,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0


In [42]:
# group the weights by attraction_id to merge it back with the main data afterwards
weights_merged = attractions_reviews_cleaned_merged.groupby(by=['attraction_id'])['weights'].mean()

In [43]:
weights_merged = pd.DataFrame(weights_merged).reset_index()

In [44]:
# merge data
final_data = final_data.merge(weights_merged, how='left', on='attraction_id')

In [45]:
final_data.head()

Unnamed: 0,attraction_id,name,country,province,city_name,price,rating,attraction,accommodation,air tour,...,sea tour,sightseeing,transport,wildlife,duration,images,alcohol_places,outdoor activities,nature_combined,weights
0,0,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,80.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,1,0,0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0
1,1,Vancouver To Victoria And Butchart Gardens Tou...,canada,British Columbia,Vancouver,210.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,1,1,0,0,13h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,1,1.0
2,2,Quebec City And Montmorency Falls Day Trip Fro...,canada,Quebec,Montreal,115.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,1,0,0,12h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0
3,3,Niagara Falls Day Trip From Toronto,canada,Ontario,Toronto,169.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,1,1,0,0,9h 30m,https://media-cdn.tripadvisor.com/media/attrac...,1,0,0,1.0
4,4,"Best Of Niagara Falls Tour From Niagara Falls,...",canada,Ontario,Niagara Falls,158.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,1,1,0,0,4–5 hours,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0


### 3.3.3 Weights for rating 
- I created weights for the ratings and scaled them to see if it would affect the recommender system as well. Higher ratings would be assigned a higher weightage while lower ratings would be assigned a lower weightage.

Before creating the weights, these are the steps involved:
- Cleaning the rating column (missing values coded as -1) and inputing the mean before creating the weights.

In [46]:
# replace the wrongly labeled ratings of -1 with na 
final_data['rating'].replace(-1, np.nan, inplace=True)   

In [47]:
final_data['rating'].describe()

count    987.000000
mean       4.668693
std        0.620935
min        1.000000
25%        4.500000
50%        5.000000
75%        5.000000
max        5.000000
Name: rating, dtype: float64

In [48]:
# replacing the missing values with the mean
final_data['rating'].fillna(value=4.670363, inplace=True)

In [49]:
# check the descriptives again
final_data['rating'].describe()

count    1705.000000
mean        4.669396
std         0.472335
min         1.000000
25%         4.670363
50%         4.670363
75%         5.000000
max         5.000000
Name: rating, dtype: float64

Since the ratings are not normally distributed (i.e., left skewed), I will be scaling the scores. 

In [50]:
# import relevant libraries
from sklearn.preprocessing import MinMaxScaler

In [51]:
# instantiate minmaxscaler 
scaler = MinMaxScaler()

In [52]:
# fit transform the ratings
final_data[['rating_scaled']] = scaler.fit_transform(final_data[['rating']])

In [53]:
# check to see if values have been scaled
final_data[['rating_scaled']]

Unnamed: 0,rating_scaled
0,0.875000
1,1.000000
2,0.875000
3,1.000000
4,1.000000
...,...
1700,1.000000
1701,0.625000
1702,0.917591
1703,0.625000


## 3.4 Main model

- This model includes the new features. 

In [54]:
# creating a separate df to store relevant columns for the rs - attractoin name, province, city, categories, weights, and scaled ratings
selected_data_main_model = final_data.drop(columns=['attraction_id', 'attraction', 'country', 'price', 'rating', 'duration', 'images'])

In [55]:
# preview data info
selected_data_main_model.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1705 entries, 0 to 1704
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 1705 non-null   object 
 1   province             1705 non-null   object 
 2   city_name            1705 non-null   object 
 3   accommodation        1705 non-null   int64  
 4   air tour             1705 non-null   int64  
 5   alcohol              1705 non-null   int64  
 6   city                 1705 non-null   int64  
 7   classes & workshops  1705 non-null   int64  
 8   cruise               1705 non-null   int64  
 9   entertainment        1705 non-null   int64  
 10  food                 1705 non-null   int64  
 11  land tour            1705 non-null   int64  
 12  photography          1705 non-null   int64  
 13  rental               1705 non-null   int64  
 14  sea tour             1705 non-null   int64  
 15  sightseeing          1705 non-null   i

In [56]:
# load dataset and set attraction name as the index
selected_data_main_model.set_index('name', inplace=True)
print(selected_data_main_model.shape)
selected_data_main_model.head()

(1705, 22)


Unnamed: 0_level_0,province,city_name,accommodation,air tour,alcohol,city,classes & workshops,cruise,entertainment,food,...,rental,sea tour,sightseeing,transport,wildlife,alcohol_places,outdoor activities,nature_combined,weights,rating_scaled
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,British Columbia,Vancouver,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1.0,0.875
Vancouver To Victoria And Butchart Gardens Tour By Bus,British Columbia,Vancouver,0,0,0,0,0,1,0,0,...,0,1,1,0,0,0,0,1,1.0,1.0
Quebec City And Montmorency Falls Day Trip From Montreal,Quebec,Montreal,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,1.0,0.875
Niagara Falls Day Trip From Toronto,Ontario,Toronto,0,0,1,0,0,1,0,1,...,0,1,1,0,0,1,0,0,1.0,1.0
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",Ontario,Niagara Falls,0,0,0,0,0,1,0,0,...,0,1,1,0,0,0,0,0,1.0,1.0


In [57]:
# create dummies for province and city columns 
selected_data_main_model = pd.get_dummies(data=selected_data_main_model, columns=['province', 'city_name'])

In [58]:
# find cosine similarity of every row with every other row and then filter to the first index 
attr_sim_main = pd.DataFrame(cosine_similarity(selected_data_main_model), columns=selected_data_main_model.index, index=selected_data_main_model.index)
print(attr_sim_main.shape)
attr_sim_main.head()

(1705, 1705)


name,Vancouver City Sightseeing Tour,Vancouver To Victoria And Butchart Gardens Tour By Bus,Quebec City And Montmorency Falls Day Trip From Montreal,Niagara Falls Day Trip From Toronto,"Best Of Niagara Falls Tour From Niagara Falls, Ontario",Niagara Falls In One Day: Deluxe Sightseeing Tour Of American And Canadian Sides,Whistler Small-Group Day Trip From Vancouver,Ultimate Niagara Falls Tour Plus Helicopter Ride And Skylon Tower Lunch,"Local Food, Craft Beverage And Estate Winery Tour Of Cowichan Valley",Private Tour: Vancouver To Victoria Island,...,Quebec City Shore Excursion: Quebec City Sightseeing Tour,Montreal Indoor Skydiving Introductory Package,Banff Day Trip From Calgary,60-Minute Deluxe Horse-Drawn Carriage Tour,Toronto Inner Harbour Evening Cruise,30-Minute Distillery District Segway Tour In Toronto,Montreal Quadricycle Rental,Intro Survival Course Rockies,Private Toronto Guided City Tour,Peyto Lake Snowshoe Tour
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Vancouver City Sightseeing Tour,1.0,0.752892,0.704388,0.449181,0.526711,0.526711,0.798563,0.471105,0.526711,0.708389,...,0.698792,0.360427,0.519511,0.853699,0.480376,0.708389,0.42173,0.384804,0.587317,0.445833
Vancouver To Victoria And Butchart Gardens Tour By Bus,0.752892,1.0,0.496588,0.603023,0.707107,0.707107,0.824958,0.632456,0.471405,0.881917,...,0.48795,0.166667,0.58313,0.629941,0.541093,0.503953,0.233299,0.347289,0.376867,0.540279
Quebec City And Montmorency Falls Day Trip From Montreal,0.704388,0.496588,1.0,0.449181,0.526711,0.526711,0.526711,0.471105,0.390786,0.417768,...,0.848868,0.744882,0.519511,0.708389,0.480376,0.708389,0.752904,0.384804,0.587317,0.445833
Niagara Falls Day Trip From Toronto,0.449181,0.603023,0.449181,1.0,0.746203,0.746203,0.426401,0.76277,0.639602,0.455842,...,0.441367,0.150756,0.419264,0.455842,0.695516,0.683763,0.211027,0.421803,0.600615,0.363955
"Best Of Niagara Falls Tour From Niagara Falls, Ontario",0.526711,0.707107,0.526711,0.746203,1.0,1.0,0.5,0.894427,0.375,0.534522,...,0.517549,0.176777,0.49163,0.534522,0.69474,0.668153,0.247451,0.368355,0.552006,0.426776


In [59]:
# look at similar attractions as Vancouver City Sightseeing Tour
attr_sim_main['Vancouver City Sightseeing Tour'].sort_values(ascending=False).head(30) 

name
Vancouver City Sightseeing Tour                                                1.000000
Stanley Park Horse-Drawn Tours                                                 1.000000
Vancouver Sightseeing Bus Tour (4 Hrs)                                         1.000000
Vancouver City Tour With Vancouver Lookout Admission                           1.000000
Vancouver City Private Tour                                                    0.999010
Street Art Cycling Tour                                                        0.999010
Small-Group Afternoon Bike Tour Of Vancouver                                   0.999010
Epic Electric Bike Tour Of Vancouver                                           0.999010
Beyond City Sights E-Bike Adventure Tour                                       0.999010
Grand City Tour Of Vancouver                                                   0.999010
Vancouver Private Walking Tour Of Downtown Chinatown And Gastown               0.999010
Bike Tour Of Downtown Vanco

### 3.4.1 User profiling for main model

Step 1: I will first create the "same" feature space by initializing all the categories to zero for a new user.

In [60]:
# create a vector for a "new user" sign-up by initializing all categories to zero
category = selected_data_main_model.columns # match feature space for new user<>our existing dataset
my_profile2 = pd.Series(data=np.zeros(len(category)), index=selected_data_main_model.columns) # initialize 0s for all genres to create new user vector using: (https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)
my_profile2

accommodation             0.0
air tour                  0.0
alcohol                   0.0
city                      0.0
classes & workshops       0.0
                         ... 
city_name_Windsor         0.0
city_name_Winnipeg        0.0
city_name_Wood Islands    0.0
city_name_Yarmouth        0.0
city_name_Yellowknife     0.0
Length: 156, dtype: float64

Step 2: Create profiles for the users in a hypothetical scenario by asking them to rate 1 for like and -1 for dislike 

In [61]:
# inputs assigned to the respective categories
my_profile2['sightseeing'] = 1
my_profile2['photography'] = 1
my_profile2['wildlife'] = 1
my_profile2['city'] = -1
my_profile2['cruise'] = -1
my_profile2['entertainment'] = -1

In [62]:
# new user vectors
my_profile2.values

array([ 0.,  0.,  0., -1.,  0., -1., -1.,  0.,  0.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [63]:
# dot product between the attraction vectors and my_profile (new user) vector that are both on the same feature space
recommendations = np.dot(selected_data_main_model.values, my_profile2.values)

# convert to pandas Series for ease of working with the recommendations data
recommendations = pd.Series(recommendations, index=selected_data_main_model.index)

# getting top 20 recommendations for a new-user! 
recommendations.sort_values(ascending=False).head(20)

name
Squamish Cheakamus White-Water Rafting                                               2.0
Wilderness & Eagle Viewing Float                                                     2.0
Become A Voyageur - Canoe Tour                                                       2.0
Whistler Electric Bike Tour (Ebike)                                                  2.0
90 Minute Private Vacation Photography Session With Photographer In Banff            2.0
Banff, Jasper, Okanagan Lake & Kamloops Tour From Vancouver                          2.0
3-Day Algonquin Park Canoe Trip                                                      2.0
Whistler Singletrack Mountain Bike Tour                                              2.0
Brentwood Bay Guided Kayak Tours                                                     2.0
Banff, Yoho & Jasper National Parks Summer Tour From Calgary (Airport Transfers)     2.0
5.5-Hour Jasper Wildlife And Waterfalls Tour With Maligne Lake Cruise From Jasper    2.0
Aurora Tour     

In [64]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1705 entries, 0 to 1704
Data columns (total 30 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   attraction_id        1705 non-null   int64  
 1   name                 1705 non-null   object 
 2   country              1705 non-null   object 
 3   province             1705 non-null   object 
 4   city_name            1705 non-null   object 
 5   price                1705 non-null   float64
 6   rating               1705 non-null   float64
 7   attraction           1705 non-null   object 
 8   accommodation        1705 non-null   int64  
 9   air tour             1705 non-null   int64  
 10  alcohol              1705 non-null   int64  
 11  city                 1705 non-null   int64  
 12  classes & workshops  1705 non-null   int64  
 13  cruise               1705 non-null   int64  
 14  entertainment        1705 non-null   int64  
 15  food                 1705 non-null   i

In [65]:
# export main model data
selected_data_main_model.to_csv('./datasets/selected_data_main_model.csv', index=False)

In [66]:
# export attractions reviews data for nlp
attractions_reviews_cleaned_merged.to_csv('./datasets/attractions_reviews_cleaned_merged.csv', index=False)

In [67]:
# save final data
final_data.to_csv('./datasets/data_test.csv', index=False)

## 3.5 Overall thoughts on base and main models 

Overall, it seems like both models are performing relatively decent as the "ideal" relatable activities were accurately recommended, as seen by the similarity scores. I will store the cosine similarity calculations and use it to build the recommender application. 

This recommender serves as a start for people to explore other activities that they could be interested in. For example, if one were to fancy city sightseeing tour, this recommender would allow them to be exposed to other similar activities, such as a indigeneous walking tour or horse-drawn tour, which they may not have thought of. Furthermore, there may be merit in using the main model instead due to the feature engineering and reduction of category labels. 
