<a id="index"></a>
# Vacation Recommendations using Tripadvisor Reviews

This workbook goes through the various steps to model and create a recommendation engine for vacation spots and activities across the globe based on user interests.  

Key steps included in this workbook:  
  
[Step 1: Import scraped reviews from TripAdvisor](#review_data)    
[Step 2: Dimensionality reduction - LDA topic modeling with reviews](#LDA)    
[Step 3: Calculate similarities between activities based on topic distribution of reviews](#cosine)  
[Step 4: Create Recommendation Engine](#recommendation)  

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from operator import itemgetter
from gensim import corpora, models, similarities, matutils
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import seaborn as sns
import pickle
%pylab inline

Populating the interactive namespace from numpy and matplotlib


[Back to Top](#index)
<a id="review_data"></a>
## Step 1: Import scraped reviews from TripAdvisor
Please note: Scrapy code is provided in separate file in repository. 

### Read in scraped files

In [2]:
# Selected json data files from Tripadvisor scrape, a number of cities are excluded for purposes of simplicity
atl_reviews = pd.read_json("scraped_data/all_atl_reviews.json")
berlin_reviews = pd.read_json("scraped_data/all_berlin_reviews.json")
budapest_reviews = pd.read_json("scraped_data/all_budapest_reviews.json")
chicago_reviews = pd.read_json("scraped_data/all_chicago_reviews.json")
cusco_reviews = pd.read_json("scraped_data/all_cusco_reviews.json")

In [3]:
# Combine cities into one pandas dataframe
dfs = [atl_reviews, berlin_reviews, budapest_reviews, chicago_reviews, cusco_reviews]
all_reviews = pd.concat(dfs)

In [4]:
# Select columns for use in this analysis
all_reviews = all_reviews[["city", "activity_name", "review_text"]]
all_reviews.head()

Unnamed: 0,city,activity_name,review_text
0,atlanta,Georgia World Congress Center,This is one if the largest and nicest convenie...
1,atlanta,Georgia World Congress Center,I travel to a fair amount of conventions inclu...
2,atlanta,Georgia World Congress Center,The meeting and vendors were outstanding...how...
3,atlanta,Georgia World Congress Center,Great place to have a concert. We sat in the b...
4,atlanta,Georgia World Congress Center,"We were very impressed with the facility, but ..."


[Back to Top](#index)
<a id="LDA"></a>
## Step 2: Dimensionality reduction - LDA topic modeling with reviews

The purpose of this step is to utilize topic modeling techniques in order to reduce the dimensionality of the TripAdvisor user reviews

### Preprocessing reviews

In [5]:
review_text_only = all_reviews.iloc[:,-1].tolist()

In [15]:
p_stemmer = PorterStemmer()
en_stop = stopwords.words('english') + ['.', ',', '(', ')', "'", '"', "-", "!", "!!", "!!!", "..."]

# List for tokenized documents in loop
cleaned_reviews = []

# Loop through document list
for line in review_text_only:
    # Clean and tokenize document string
    raw = line.lower()
    tokens = wordpunct_tokenize(raw)
    
    # Remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # Stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # Combine words back into reviews
    cleaned_reviews.append(" ".join(stemmed_tokens))
    

cleaned_reviews[:2]

['one largest nicest conveni center visit usher guest servic nice friendli knowledg',
 'travel fair amount convent includ held orlando vega set one modern facil apart next staff base experi georgia world congress center runaway favorit park attend']

In [16]:
# Create Count Vectorizer
vectorizer = CountVectorizer(stop_words=en_stop, max_df=0.5)
cv = vectorizer.fit_transform(cleaned_reviews).transpose()

In [17]:
# Check number of unique words and validate matrix shape
print("# of Unique Words: " + str(len(vectorizer.get_feature_names())))
print("Bag of Words Matrix shape: " + str(cv.shape))

# of Unique Words: 29895
Bag of Words Matrix shape: (29895, 74862)


In [9]:
# Run LDA model on reviews. Model runtime is lengthy (~40 min), pickle file of results available below.
corpus = matutils.Sparse2Corpus(cv)
id2word = dict((v, k) for k, v in vectorizer.vocabulary_.items())
lda = models.LdaModel(corpus=corpus, num_topics=50, id2word=id2word, random_state = 21, passes=10)

# Convert results from model to working format
lda_corpus =lda[corpus]
lda_docs = [doc for doc in lda_corpus]

# Pickle results of model
# with open('lda_docs_5_cities.pkl', 'wb') as picklefile:
#     pickle.dump(lda_docs, picklefile)

In [7]:
# Read in pickled results of model
with open('lda_docs_5_cities.pkl', 'rb') as picklefile_lda:
    lda_docs = pickle.load(picklefile_lda)
    
lda_docs[:2]

[[(5, 0.087720139515548662),
  (8, 0.20759607396803828),
  (18, 0.094213689538458753),
  (25, 0.10180648414140132),
  (31, 0.097364106918526233),
  (41, 0.084086662713812918),
  (42, 0.078461538461538541),
  (49, 0.18413592012729169)],
 [(5, 0.040818692113012478),
  (8, 0.12156664314465122),
  (17, 0.12947984723081701),
  (20, 0.042468201232785212),
  (21, 0.17327899824384269),
  (22, 0.08359388149720734),
  (23, 0.085545782449837904),
  (43, 0.040809558450479122),
  (44, 0.16322614286129261),
  (49, 0.088443022006844133)]]

[Back to Top](#index)
<a id="cosine"></a>
## Step 3: Calculate similarity between activities based on topic distribution of reviews 
In this step, I will take the LDA topic distributions for each review and average them by activity in order to get a representative topic distribution vector for each activity. Then, I will calculate the cosine similarity between each activity in the dataset to be used in the recommendation engine to find similar activities.


### Process data to prepare for pairwise similarity

In [8]:
# Expand sparse matrix from model into full matrix for utilization in pairwise comparisons
length = len(lda_docs) 
doc_vectors = np.zeros((length, 50))  # Initialize numpy array
for i, doc2topics in enumerate(lda_docs):
    for topic, percentage in doc2topics:
        doc_vectors[i][topic] = percentage

all_reviews["lda_vector"] = list(doc_vectors)

In [9]:
# Average LDA topic distribution of each review by activity
activity_vectors = pd.DataFrame(all_reviews.groupby(["activity_name", "city"])
                                ["lda_vector"].apply(lambda x: np.mean(x, axis=0))).reset_index()


### Calculate cosine similarity

In [10]:
# Use LDA model results by activity for cosine similarity calculation
activity_vectors_lda = list(activity_vectors.lda_vector)
cosine_similarities_lda = cosine_similarity(activity_vectors_lda, activity_vectors_lda)
activity_vectors["cosine_similarities_lda"] = list(cosine_similarities_lda)
activity_vectors = activity_vectors.reset_index()

activity_vectors.head()

Unnamed: 0,index,activity_name,city,lda_vector,cosine_similarities_lda
0,0,31st Street Harbor,chicago,"[0.00699710180582, 0.00736124401914, 0.0044676...","[1.0, 0.513042600253, 0.789935793263, 0.528108..."
1,1,333 West Wacker Drive,chicago,"[0.045780379994, 0.00809322447906, 0.0, 0.0, 0...","[0.513042600253, 1.0, 0.544421766284, 0.349733..."
2,2,360 Chicago Observation Deck,chicago,"[0.0283879799752, 0.00474233563058, 0.01093371...","[0.789935793263, 0.544421766284, 1.0, 0.629756..."
3,3,3d Gallery Budapest,budapest,"[0.00412249661846, 0.00262300845502, 0.0103597...","[0.528108751076, 0.349733495365, 0.62975685397..."
4,4,ACVB Visitor Center - Underground Atlanta,atlanta,"[0.0, 0.0, 0.0119561989825, 0.00887876281606, ...","[0.432273473749, 0.34509709698, 0.453786622931..."


In [11]:
#Create dictionary from similarity matrix where each key is a unique activity and the values are the sorted activities based on cosine similarity 
cosine_dicts_all = []
for idx, row in activity_vectors.iterrows():
    similar_items_dict = {}
    similar_indices = cosine_similarities_lda[idx].argsort()[::-1]
    for i in similar_indices:
        similar_items_dict[cosine_similarities_lda[idx][i]] = i
    cosine_dicts_all.append(similar_items_dict)

[Back to Top](#index)
<a id="recommendation"></a>
## Step 4: Create Recommendation Engine

The recommendation engine will take user input of a city and three activities the user enjoys doing in that city and will recommend the three top locations to go and similar activities in each of those locations based on the user input.  
    
Please note below is the code for the recommendation engine, but the flask app I built where this is implemented in a user-friendly format is not included in this notebook.

In [12]:
# Take user inputs and find the index for each activity to enable search in dictionary
def get_index(activity_names, city):
    act_indices=[]
    for activity_name in activity_names:
        act_index = int(activity_vectors[(activity_vectors["activity_name"] == activity_name) & 
                         (activity_vectors["city"] == city)]["index"])
        act_indices.append(act_index)
    return act_indices

In [13]:
def get_recommendations(activity_names, city):
    
    # Get index of user selected activity names 
    act_indices = get_index(activity_names, city)
    
    # Reorganize activity data by city & save 3 most similar activities for each user-selected activity for each city
    recommendations = []
    by_city_sim = {}
    by_city_idx = {}
    for i, act_index in enumerate(act_indices):
        cosine_sims = cosine_dicts_all[act_index]
        count = Counter()
        for key, value in cosine_sims.items():
            cos_city = activity_vectors.iloc[value]["city"]
            if cos_city == city:
                pass
            elif cos_city not in by_city_sim:
                by_city_sim[cos_city] = [key]
                by_city_idx[cos_city] = [value]
                count[cos_city] += 1
            elif count[cos_city] < 3 and value not in by_city_idx[cos_city]:
                by_city_sim[cos_city].append(key)
                by_city_idx[cos_city].append(value)
                count[cos_city] += 1
                if sum(count.values()) == 51:
                    break
            else: 
                pass
    
    #average similarity metrics of most similar activities for each city and select three most similar cities
    avg_similar_acts = {}
    for key, value in by_city_sim.items():
        avg_similar_acts[key] = np.mean(value)
    sorted_avg_sims = sorted(avg_similar_acts.items(),key = itemgetter(1),reverse = True)  
    top_3_cities = list(zip(*sorted_avg_sims))[0][:3]
    recommendations = {}
    for key, values in by_city_idx.items():
        if key in top_3_cities:
            activities = [activity_vectors.iloc[val]["activity_name"] for val in values]
            recommendations[key] = activities
    
    return recommendations

In [14]:
get_recommendations(["Little Five Points", "Fernbank Science Center", "Georgia Dome"], "atlanta")

{'berlin': ['Europa-Center',
  'Die Hackeschen Hoefe',
  'Alexanderplatz',
  'Science Center Spectrum',
  'LEGOLAND Discovery Centre',
  'Kindermuseum MachtMit',
  'Alte Forsterei',
  'Mercedes-Benz Arena Berlin',
  'Computerspielemuseum'],
 'budapest': ['Liszt Ferenc Square',
  'Originart Gallery',
  'WestEnd City Center',
  'Miniversum',
  'Palace of Wonders',
  'Semmelweis Museum of Medical History (Orvostorteneti Muzeum)',
  'Groupama Arena',
  'Budapest Pinball Museum',
  'Stade Puskas Ferenc'],
 'chicago': ['The Magnificent Mile',
  'Devon Avenue',
  "Bloomingdale's",
  "Chicago Children's Museum",
  'Adler Planetarium',
  'Museum of Science and Industry',
  'Guaranteed Rate Field',
  'United Center',
  'Wrigley Field']}