In [83]:
import sys
print(sys.version)

3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]


In [84]:
import numpy as np
import pandas as pd
import json

#### In this notebook, we are going to process reviews related to Arcadia National Park. The reviews were scraped in the previous step where we parse out the JSON containing the reviews in the html and store the JSON strings in a CSV file. Here, we are going to read the CSV file and continue our processing and analysis.

In [85]:
reviews_raw = pd.read_csv('../Data/Arcadia_Reviews.csv')

In [86]:
pd.set_option('max_colwidth', 200)
reviews_raw.head()

Unnamed: 0,Review_JSON
0,"{""totalCount"": 1757, ""preferredReviewIds"": [], ""reviews"": [{""id"": 774017634, ""url"": ""/ShowUserReviews-g60709-d12785836-r774017634-Acadia_National_Park-Bar_Harbor_Mount_Desert_Island_Maine.html"", ""..."
1,"{""totalCount"": 1757, ""preferredReviewIds"": [], ""reviews"": [{""id"": 773183548, ""url"": ""/ShowUserReviews-g60709-d12785836-r773183548-Acadia_National_Park-Bar_Harbor_Mount_Desert_Island_Maine.html"", ""..."
2,"{""totalCount"": 1757, ""preferredReviewIds"": [], ""reviews"": [{""id"": 772279667, ""url"": ""/ShowUserReviews-g60709-d12785836-r772279667-Acadia_National_Park-Bar_Harbor_Mount_Desert_Island_Maine.html"", ""..."
3,"{""totalCount"": 1757, ""preferredReviewIds"": [], ""reviews"": [{""id"": 771033153, ""url"": ""/ShowUserReviews-g60709-d12785836-r771033153-Acadia_National_Park-Bar_Harbor_Mount_Desert_Island_Maine.html"", ""..."
4,"{""totalCount"": 1757, ""preferredReviewIds"": [], ""reviews"": [{""id"": 770479455, ""url"": ""/ShowUserReviews-g60709-d12785836-r770479455-Acadia_National_Park-Bar_Harbor_Mount_Desert_Island_Maine.html"", ""..."


In [87]:
pd.reset_option('max_colwidth')

Each row in the DataFrame is one page scraped and could contain multiple reviews. Thus, we would like to put all the reviews together first in a list, then process those reviews altogether into a formal DataFrame.

In [88]:
data = []
for row in reviews_raw.loc[:, 'Review_JSON']:
    reviews = json.loads(row)['reviews']
    
    for review in reviews:
        output = {}
        output['CreatedDate'] = review.get('createdDate')
        output['PublishedDate'] = review.get('publishedDate')
        output['Username'] = review['userProfile'].get('username')
        try:
            output['Hometown'] = review['userProfile']['hometown']['location']['additionalNames'].get('long')
        except (AttributeError, TypeError):
            output['Hometown'] = None
        output['NumUserGeneratedContent'] = review['userProfile']['contributionCounts'].get('sumAllUgc')
        output['NumHelpfulVote'] = review['userProfile']['contributionCounts'].get('helpfulVote')
        output['Review_Title'] = review.get('title')
        output['Review_Language'] = review.get('language')
        output['Review_TripDate'] = review['tripInfo'].get('stayDate')
        output['Review_TripType'] = review['tripInfo'].get('tripType')
        output['Review_Text'] = review.get('text')
        output['Review_Rating'] = review.get('rating')
        output['Review_HelpfulVotes'] = review['socialStatistics'].get('likeCount')
        
        data.append(output)

In [89]:
#Example
data[0]

{'CreatedDate': '2020-10-12',
 'PublishedDate': '2020-10-12',
 'Username': 'mplegal',
 'Hometown': 'Orlando, Florida',
 'NumUserGeneratedContent': 227,
 'NumHelpfulVote': 46,
 'Review_Title': 'Nature Lovers’ Shanghai-la! ',
 'Review_Language': 'en',
 'Review_TripDate': '2020-10-31',
 'Review_TripType': 'COUPLES',
 'Review_Text': 'Visiting Acadia National Park specifically in the fall has been a bucket list destination for me for decades and I can say it’s the most beautiful place I’ve ever been!  We hit the big list destinations: Cadillac Mountain, Jordan’s Pond House, the Park Loop Road but we leave knowing there are so many other destinations within the park that we missed.  We barely scratched the surface, but that’s okay because it has inspired us to return.  The Rangers and staff were very friendly and accommodating, the park does a good job regarding social distancing, the bathrooms were clean and if you plan to travel to Cadillac Mountain or Sand Beach, Make sure you pre-registe

In [90]:
reviews = pd.DataFrame(data)

In [91]:
reviews.head()

Unnamed: 0,CreatedDate,PublishedDate,Username,Hometown,NumUserGeneratedContent,NumHelpfulVote,Review_Title,Review_Language,Review_TripDate,Review_TripType,Review_Text,Review_Rating,Review_HelpfulVotes
0,2020-10-12,2020-10-12,mplegal,"Orlando, Florida",227,46,Nature Lovers’ Shanghai-la!,en,2020-10-31,COUPLES,Visiting Acadia National Park specifically in ...,5,0
1,2020-10-11,2020-10-11,carolHjones,"Columbus, Georgia",31,21,Great trip,en,2020-10-31,FAMILY,Beautiful park. Definitely rent bikes in Bar H...,5,0
2,2020-10-11,2020-10-11,rydharter,"Austin, Texas",2,0,Amazing (Civid-Time) Road Trip,en,2020-10-31,NONE,Acadia was part of my first visit to New Engla...,5,0
3,2020-10-10,2020-10-10,JohnPatsi,"Tullahoma, Tennessee",559,94,Beautiful piece of God’s nature,en,2020-10-31,COUPLES,Beautiful piece of God’s nature situated along...,5,0
4,2020-10-06,2020-10-06,384katiec,"Indianapolis, Indiana",300,58,Great visit!,en,2020-09-30,NONE,"As the park ranger explained, this is the swis...",5,0


#### If you are careful enough, one thing you might note is that Review_TripDate is always month end - the reason is that TripAdvisor does not store day information, so it defaults the day to month end. We will leave as is and deal with it later.

In [92]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1757 entries, 0 to 1756
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   CreatedDate              1757 non-null   object
 1   PublishedDate            1757 non-null   object
 2   Username                 1757 non-null   object
 3   Hometown                 1329 non-null   object
 4   NumUserGeneratedContent  1757 non-null   int64 
 5   NumHelpfulVote           1757 non-null   int64 
 6   Review_Title             1757 non-null   object
 7   Review_Language          1757 non-null   object
 8   Review_TripDate          1757 non-null   object
 9   Review_TripType          1757 non-null   object
 10  Review_Text              1757 non-null   object
 11  Review_Rating            1757 non-null   int64 
 12  Review_HelpfulVotes      1757 non-null   int64 
dtypes: int64(4), object(9)
memory usage: 178.6+ KB


#### Now, let's start with some exploratory data visualizations
#### We can start with a distribution on the PublishedDate to see when reviews are published

In [93]:
PublishedDate = reviews.loc[:, 'PublishedDate']
PublishedDate = PublishedDate.map(lambda x: x[:7])
pub_monthly_freq = PublishedDate.value_counts()
pub_monthly_freq.sort_index(inplace=True)
pub_monthly_freq = pd.DataFrame({'Date': pub_monthly_freq.index, 'NumOfReviews': pub_monthly_freq.values})

In [94]:
import plotly.express as px

fig = px.line(pub_monthly_freq, x = 'Date', y = 'NumOfReviews', hover_data = {'Date': '|%B, %Y'},
             title = 'Number of Reviews by Month')
fig.update_xaxes(dtick = 'M3', tickformat = '%b\n%Y')

fig.show()

#### From the chart above, we can see the followings:
1. It seems the reviews dated back as early as Aug, 2017, but there were no reviews earlier than that.
2. The number of reviews show seasonal patterns through months
3. The peak of seasonal pattern in 2020 is weak - this is likely the effect of COVID-19. To confirm, we can also plot the actual trip dates on the same plot as well.

In [95]:
Review_TripDate = reviews.loc[:, 'Review_TripDate']
Review_TripDate = Review_TripDate.map(lambda x: x[:7])
rev_monthly_freq = Review_TripDate.value_counts()
rev_monthly_freq.sort_index(inplace=True)
rev_monthly_freq = pd.DataFrame({'Date': rev_monthly_freq.index, 'Review_TripDate': rev_monthly_freq.values})

pub_monthly_freq.rename(columns={'NumOfReviews': 'PublishedDate'}, inplace=True)

In [98]:
pubrev_monthly_freq = pd.merge(pub_monthly_freq, rev_monthly_freq, on = 'Date', how = 'outer')
pubrev_monthly_freq.sort_values(by = 'Date', inplace = True)
pubrev_monthly_freq.fillna(0.0, inplace = True)
pubrev_monthly_freq = pubrev_monthly_freq.astype({'PublishedDate': 'int32', 'Review_TripDate': 'int32'})
pubrev_monthly_freq.head()

Unnamed: 0,Date,PublishedDate,Review_TripDate
38,2017-03,0,2
39,2017-06,0,2
40,2017-07,0,12
0,2017-08,31,64
1,2017-09,77,98


In [101]:
fig = px.line(pubrev_monthly_freq, x = 'Date', y = ['PublishedDate', 'Review_TripDate'], hover_data = {'Date': '|%B, %Y'},
             title = 'Number of Reviews/Trips by Month', labels = {'value': 'Count'})
fig.update_xaxes(dtick = 'M3', tickformat = '%b\n%Y')

fig.show()

As we predicted, the review publishing dates closely resembles the actual trip dates. But still, we can see that the publishing dates are a little bit behind the actual trip dates in general, which makes sense because usually tourists give reviews after their trips.