# Capstone Project: Travel Recommender System Based on Activity Preferences

Done by: Richelle-Joy Chia, [Linkedin](https://www.linkedin.com/in/richelle-joy-chia/)

# Part 4: Natural Language Processing (NLP)

The purpose of this NLP step is to conduct a sentiment analysis (Hugging Face) to retrieve the emotions for each review and add the labels as an information for users in Streamlit. As mentioned in my introduction, people these days are bombarded with tons of information and to have to go through all the individual reviews to come up with a conclusion of whether or not this activity may be worth trying. Therefore, I would like to simplify this process by allowing users to have a snapshot of what people generally feel.

I used the jHHart model, which has been trained on 6 diverse datasets predicting Ekman's 6 basic emotions + a neutral class. This model aims to classify emotions from English text data. 

- https://huggingface.co/j-hartmann/emotion-english-roberta-large 

## 4.1 Import relevant libraries and datasets

In [1]:
# import libraries 

import pandas as pd
import numpy as np
import re
from sklearn.pipeline import Pipeline 
from transformers import pipeline
from tqdm import tqdm

In [2]:
nlp_dataset = pd.read_csv('./datasets/attractions_reviews_cleaned_merged.csv')
final_data = pd.read_csv('./datasets/data_test.csv')

## 4.2 Reviews EDA

In this section, I explored the reviews column to look at the general descriptives of character and word length. 

In [3]:
# preview data
nlp_dataset.head()

Unnamed: 0,rating_id,attraction_id,rating_x,review,review_date,user,name,country,province,city_name,...,sightseeing,transport,wildlife,duration,images,alcohol_places,outdoor activities,nature_combined,rating_new,weights
0,0,0,5.0,Another 'Dave' Guides us Around Vancouver. Lan...,"March 14, 2019",drew22perthaustralia,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0,1
1,1,0,5.0,Fantastic way to explore VC. An easy way to ex...,"March 1, 2019",marc_h,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0,1
2,2,0,5.0,This was a great half day tour!. Was there for...,"February 28, 2019",maggiehand,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0,1
3,3,0,5.0,All the main attractions. Scott was our lovely...,"December 19, 2018",catherine255066,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0,1
4,4,0,5.0,Excellent Vancouver Sightseeing Tour. We would...,"November 29, 2018",gearjamkw,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,...,1.0,0.0,0.0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0.0,0.0,0.0,5.0,1


In [4]:
# look at a sample review
nlp_dataset.iloc[6].review

"It was worth taking and very interesting.. I really enjoyed this tour, and I would recommend it to family and friends. You get to see alot of places and things when on the tour. it's very enjoyable."

In [5]:
# convert review column to string
nlp_dataset['review'] = nlp_dataset['review'].astype(str)

Here, I would like to explore the character and word count of reviews

In [6]:
# creating new column for character count
nlp_dataset['char_count'] = nlp_dataset['review'].map(len)

In [7]:
# creating new column for word count
nlp_dataset['word_count'] = nlp_dataset['review'].map(lambda x: len(x.split()))

In [8]:
# descriptives of character count
nlp_dataset['char_count'].describe()

count    14511.000000
mean       243.015437
std         74.696637
min          3.000000
25%        237.000000
50%        264.000000
75%        284.000000
max        419.000000
Name: char_count, dtype: float64

In [9]:
# descriptives of word count
nlp_dataset['word_count'].describe()

count    14511.000000
mean        43.693612
std         13.215418
min          1.000000
25%         45.000000
50%         48.000000
75%         50.000000
max         68.000000
Name: word_count, dtype: float64

## 4.3 Using RegEx to remove symbols

Thereafter, I used RegEx to remove symbols to process the data before putting it through HuggingFace.

In [10]:
# reformat using regex

def split_it(text):
    x = re.findall("[a-zA-Z]+", str(text))
    return(' '.join(x))

nlp_dataset['review_new'] = nlp_dataset['review'].apply(split_it)

In [11]:
nlp_dataset['review_new']

0        Another Dave Guides us Around Vancouver Landse...
1        Fantastic way to explore VC An easy way to exp...
2        This was a great half day tour Was there for b...
3        All the main attractions Scott was our lovely ...
4        Excellent Vancouver Sightseeing Tour We would ...
                               ...                        
14506                                                  nan
14507                                                  nan
14508                                                  nan
14509                                                  nan
14510                                                  nan
Name: review_new, Length: 14511, dtype: object

Over here, I checked to see if the reviews have been accurately processed, and it seems to be working as intended.

In [12]:
# explore review before RegEx
nlp_dataset.iloc[20].review

'Awesome tour!. Yesterday our tour guide, Ed, showed us the sights of Vancouver from Coal Harbor to Stanley Park, Granville Island, Chinatown, Gas Town and the 360 degree views from the Vancouver Lookout. It was a rainy day but our group of 6 friends since high school...'

In [13]:
# explore review after RegEx
nlp_dataset.iloc[20].review_new

'Awesome tour Yesterday our tour guide Ed showed us the sights of Vancouver from Coal Harbor to Stanley Park Granville Island Chinatown Gas Town and the degree views from the Vancouver Lookout It was a rainy day but our group of friends since high school'

## 4.4 Hugging Face for sentiment analysis (jHart)

In [19]:
# run pipeline to activate model 1 and test with a random sentence to see if it works

classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", truncation=True)
classifier("I love my life")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at j-hartmann/emotion-english-distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


[{'label': 'joy', 'score': 0.9614477157592773}]

In [20]:
# custom function to store the respective scores (in this scenario, scores refer to the probability of the emotion occuring, which will be used as the metric) 

def classify(df):
    labels=[]
    scores=[]
    for i in tqdm(df['review'], desc = 'tqdm() Progress Bar'): 
        index = 0
        if index < len(df):
            output = classifier([i])
            labels.append(output[0]['label'])
            scores.append(output[0]['score'])
            index += 1
    df['labels_jhart'] = pd.DataFrame(labels)
    df['scores_jhart'] = pd.DataFrame(scores)
    return

In [21]:
%%time
classify(no_att_reviews_merged_nlp)

tqdm() Progress Bar: 100%|██████████| 14566/14566 [28:48<00:00,  8.43it/s]

CPU times: total: 49min 32s
Wall time: 28min 48s





In [117]:
# store df as csv
nlp_dataset.to_csv('nlp_dataset.csv', index=False)

In [14]:
nlp_dataset = pd.read_csv('./datasets/nlp_dataset.csv')

#### Explore the labels created by the Hugging Face model

In this section, I explored the count of each label and examined the distribution of the labels. 

In [15]:
# look at the mean of the labels 
nlp_dataset.groupby('labels_jhart')['scores_jhart'].mean()

labels_jhart
anger       0.662026
disgust     0.655375
fear        0.759702
joy         0.827723
neutral     0.626314
sadness     0.641412
surprise    0.676826
Name: scores_jhart, dtype: float64

In [16]:
# creating dummies to separate out the columns
nlp_dataset = pd.get_dummies(data=nlp_dataset, columns=['labels_jhart'])

In [17]:
# preview data
nlp_dataset.head()

Unnamed: 0,attraction_id,rating_x,review,review_date,user,name,country,province,city,location__lat,...,word_count,review_new,scores_jhart,labels_jhart_anger,labels_jhart_disgust,labels_jhart_fear,labels_jhart_joy,labels_jhart_neutral,labels_jhart_sadness,labels_jhart_surprise
0,0,5.0,Another 'Dave' Guides us Around Vancouver. Lan...,"March 14, 2019",drew22perthaustralia,Vancouver City Sightseeing Tour,canada,british_columbia,vancouver,49.197832,...,51,Another Dave Guides us Around Vancouver Landse...,0.976757,0,0,1,0,0,0,0
1,0,5.0,Fantastic way to explore VC. An easy way to ex...,"March 1, 2019",marc_h,Vancouver City Sightseeing Tour,canada,british_columbia,vancouver,49.197832,...,50,Fantastic way to explore VC An easy way to exp...,0.972611,0,0,0,1,0,0,0
2,0,5.0,This was a great half day tour!. Was there for...,"February 28, 2019",maggiehand,Vancouver City Sightseeing Tour,canada,british_columbia,vancouver,49.197832,...,52,This was a great half day tour Was there for b...,0.891894,0,0,0,1,0,0,0
3,0,5.0,All the main attractions. Scott was our lovely...,"December 19, 2018",catherine255066,Vancouver City Sightseeing Tour,canada,british_columbia,vancouver,49.197832,...,49,All the main attractions Scott was our lovely ...,0.940322,0,0,0,1,0,0,0
4,0,5.0,Excellent Vancouver Sightseeing Tour. We would...,"November 29, 2018",gearjamkw,Vancouver City Sightseeing Tour,canada,british_columbia,vancouver,49.197832,...,49,Excellent Vancouver Sightseeing Tour We would ...,0.800723,0,0,0,0,1,0,0


As the dataset used in Hugging Face included individual reviews, I will be grouping the results by attraction ID before merging it to the main dataset that only shows 1 attraction id. 

In [18]:
# creating a new df to store the groupby results
nlp_merged = nlp_dataset.groupby(by=['attraction_id'])[['labels_jhart_anger', 'labels_jhart_disgust', 'labels_jhart_fear', 'labels_jhart_joy', 'labels_jhart_neutral', 'labels_jhart_sadness', 'labels_jhart_surprise']].sum()

In [19]:
# reset index that has been messed up due to merge 
nlp_merged = pd.DataFrame(nlp_merged).reset_index()

In [20]:
# preview data
nlp_merged.head()

Unnamed: 0,attraction_id,labels_jhart_anger,labels_jhart_disgust,labels_jhart_fear,labels_jhart_joy,labels_jhart_neutral,labels_jhart_sadness,labels_jhart_surprise
0,0,2,1,3,36,12,0,3
1,1,1,0,3,58,18,3,6
2,2,2,2,2,43,13,3,2
3,3,1,1,0,18,4,2,1
4,4,2,0,3,75,16,1,14


In [21]:
# displaying columns of nlp_merged df to see if the labels are accurate
nlp_merged.columns

Index(['attraction_id', 'labels_jhart_anger', 'labels_jhart_disgust',
       'labels_jhart_fear', 'labels_jhart_joy', 'labels_jhart_neutral',
       'labels_jhart_sadness', 'labels_jhart_surprise'],
      dtype='object')

This portion aims to display the top emotion labels for each attraction id. 

In [22]:
# created a copy to set attraction_id as index 
nlp_merged2 = nlp_merged.copy()

In [23]:
nlp_merged2.set_index('attraction_id',inplace=True)

In [24]:
# display top emotion labels 
nlp_merged2.idxmax(axis = 1)

attraction_id
0           labels_jhart_joy
1           labels_jhart_joy
2           labels_jhart_joy
3           labels_jhart_joy
4           labels_jhart_joy
                ...         
3546        labels_jhart_joy
3557        labels_jhart_joy
3561    labels_jhart_neutral
3609    labels_jhart_neutral
3649    labels_jhart_neutral
Length: 1760, dtype: object

In [25]:
# reset index
nlp_merged2 = nlp_merged2.idxmax(axis = 1).reset_index()

In [26]:
# rename column
nlp_merged2.rename({0: 'most_frequent_label'}, axis=1, inplace=True)

In [27]:
# preview data 
nlp_merged2

Unnamed: 0,attraction_id,most_frequent_label
0,0,labels_jhart_joy
1,1,labels_jhart_joy
2,2,labels_jhart_joy
3,3,labels_jhart_joy
4,4,labels_jhart_joy
...,...,...
1755,3546,labels_jhart_joy
1756,3557,labels_jhart_joy
1757,3561,labels_jhart_neutral
1758,3609,labels_jhart_neutral


After creating the new column (most_frequent_label), I will be merging this column back to the main df

In [28]:
# merge most_frequent_label column w main df
final_data = pd.merge(final_data,nlp_merged2[['attraction_id','most_frequent_label']],on='attraction_id', how='left')

In [29]:
# check to see if data has been merged correctly
final_data.head()

Unnamed: 0,attraction_id,name,country,province,city_name,price,rating,attraction,accommodation,air tour,...,transport,wildlife,duration,images,alcohol_places,outdoor activities,nature_combined,weights,rating_scaled,most_frequent_label
0,0,Vancouver City Sightseeing Tour,canada,British Columbia,Vancouver,80.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,3h 30m,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0,0.875,labels_jhart_joy
1,1,Vancouver To Victoria And Butchart Gardens Tou...,canada,British Columbia,Vancouver,210.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,13h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,1,1.0,1.0,labels_jhart_joy
2,2,Quebec City And Montmorency Falls Day Trip Fro...,canada,Quebec,Montreal,115.0,4.5,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,12h,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0,0.875,labels_jhart_joy
3,3,Niagara Falls Day Trip From Toronto,canada,Ontario,Toronto,169.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,9h 30m,https://media-cdn.tripadvisor.com/media/attrac...,1,0,0,1.0,1.0,labels_jhart_joy
4,4,"Best Of Niagara Falls Tour From Niagara Falls,...",canada,Ontario,Niagara Falls,158.0,5.0,https://tripadvisor.ca/AttractionProductDetail...,0,0,...,0,0,4–5 hours,https://media-cdn.tripadvisor.com/media/attrac...,0,0,0,1.0,1.0,labels_jhart_joy


Based on the descriptives below, the most common sentiments were neutral and joy. Given that the mean of the ratings is 4.67, it does seem plausible that the highest ratings were in general on the positive and neutarl side, more than the negative. 

In [30]:
# looking at the counts of each label
final_data['most_frequent_label'].value_counts()

labels_jhart_neutral     961
labels_jhart_joy         692
labels_jhart_surprise     17
labels_jhart_fear         16
labels_jhart_sadness      10
labels_jhart_disgust       6
labels_jhart_anger         3
Name: most_frequent_label, dtype: int64

In [34]:
# display mean of ratings again
final_data['rating'].mean()

4.669396266275661

The next 2 steps are important for display in streamlit - removing underline and capitalize text.

In [31]:
# remove symbols
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_neutral','neutral')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_joy','joy')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_surprise','surprise')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_fear','fear')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_sadness','sadness')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_disgust','disgust')
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.replace('labels_jhart_anger','anger')


In [32]:
# capitalize text
final_data['most_frequent_label'] = final_data['most_frequent_label'].str.capitalize()

In [33]:
# export data
final_data.to_csv('./datasets/data_test.csv', index=False)

# Conclusion 
In summary, this project aims to lay the foundation for a bigger goal, that is, to kick start a travel recommender system based on activity preferences in order to make travel planning a more efficient task. In this project, I was able to use Cosine Similarity, User Acceptance Testing, and A/B testing to improve the model predictions by 20%. 
- In the base model, 60% of the recommended attractions were found to be in line with users and what they would do.
- In the main model, 80% of the recommended attractions were what people would be likely to do as well. 

This recommender system was deployed in Streamlit - [Click here to try!](https://richellejoychia-streamlit-apps-streamlit-app-wtufwn.streamlit.app/)
 
# Limitations and future work
Even though the results of the recommender system were satisfactory, there are still some limitations to consider for future work.

- The activities included were only in Canada and may not be as comprehensive. It would be interesting to scrap data from other countries and incorporate it into the recommender system such that users who like a particular attraction would be recommended attractions from other countries as well. 
- Only cosine similarity was used in building this recommender systemm. Future work can collect implicit feedback that are not ratings or scores provided by the user, such as clicks and participated activities. Thereafter, a hybrid recommender system can be implemented.
- The data is still somewhat limited for a recommender system and future work can continue to collect more information from users to improve recommendations.
- The NLP Hugging Face model is only used for category labelling. It would be interesting to create weights with the results and see whether positive, negative, or neutral emotions would influence the results. 

As for streamlit deployment,
- It is best viewed on desktop due to the layout, which may not be convenient for most users. One possible solution would be to deploy this on a mobile app for easy access. 
- The radio button may not be the most intuitive for users to rate. Afterwhich, I would like to create a swipe function such that users can swipe the photo either right (like) or left (dislike) to indicate their preferences. Having a bigger button would suffice as well. These steps would allow smoother and more instantaneous response.
- The filter on the sidebar is still a work in progress and future work can improve the future to ensure that when users click on the fliter, the filtered activities will follow suit.
- Currently, there is only 1 emotion label that is being shown. It may be useful to incoporate more than 1. Moreover, future work can use different huggingface models to explore other labels. 
- Only 1 photo is being displayed and this photo may not be the most accurate. As such, it would be interesting to explore more photos or even short videos/reels in future. 
- There are only 6 recommendations and would be good to include an option for people to view more recommendations and rate whether or not they like the activity.