# Conclusions
Our study aimed to explore the sentiment of users towards various locations by analyzing their reviews on three popular review sites, namely TripAdvisor, Google Maps, and Yelp. To achieve this, we employed a combination of web scraping and API methods to collect the data, and then utilized natural language processing (NLP) techniques to standardize the reviews and find their frequency.

We began by collecting reviews of different locations from the three review sites mentioned above. This allowed us to obtain a diverse dataset containing a wide range of reviews, including positive and negative ones. We then used NLP techniques to preprocess the data and extract useful information from the reviews.

One of the first things we observed during our analysis was that certain words were frequently used across all rating levels. These included words like "place," "great," "museum," and "San Francisco." This indicates that users tend to describe the general characteristics of a location in their reviews, regardless of whether they had a positive or negative experience.

However, we also noticed that reviews with lower ratings tended to mention negative aspects related to visiting with children. This suggests that some locations may not be suitable for families with kids or may not offer sufficient facilities for them. This information can be valuable for businesses in the tourism industry to improve their services and cater to the needs of families.

To perform sentiment analysis on our dataset, we compared the performance of various classifiers, including NLTK's VADER and NaiveBayesClassifier, as well as scikit-learn's BernoulliNB, ComplementNB, MultinomialNB, DecisionTreeClassifier, LogisticRegression, and RandomForestClassifier. We evaluated the accuracy, compatibility with other classifiers, and time efficiency of each classifier to determine the best option for our dataset.

Our results showed that all scikit-learn classifiers performed similarly in terms of accuracy. However, DecisionTreeClassifier was the best option overall due to its compatibility with other classifiers and time efficiency. This classifier is easy to interpret and can be useful for predicting the sentiment of reviews in real-time applications.

Overall, our study highlights the importance of analyzing reviews from multiple sources and using NLP techniques for sentiment analysis. This can provide valuable insights to businesses in the tourism industry, helping them to gain a better understanding of their customers and make more informed decisions. In conclusion, our findings can be used to improve services and cater to the needs of customers in the tourism industry.

#### Appendix

In [None]:
################  Overall Statistics ####################### 
# examine data
review_all.info()

We firstly examined the data `review_all`, and it shows that there are some null values in the *review* column, and the *datecolumn* are not in the correct data type. We therefore converted the *datecolumn* and dropped empty rows based on *reviews*, and then drop duplicates from the data.

# Convert the 'date' column to datetime format
review_all['datecolumn'] = pd.to_datetime(review_all['datecolumn'], format='%b, %Y')

# Drop rows with null values in the 'review' column
review_all = review_all.dropna(subset=['review'])

# Drop duplicate
review_all = review_all.drop_duplicates()
review_all.info()

And then we examined the numbers of platform, attraction and reviews.

# Examined the numbers of platform, attraction and reviews
platforms = review_all['platform'].unique()
places = review_all['attraction'].unique()

fig = go.Figure()
fig.add_trace(go.Indicator(
    mode = "number",
    value = len(platforms),
    title = {'text': "Platforms",'font': {'color': 'black','size':20}},
    number={'font':{'color': 'black','size':50}},
    domain = {'row': 0, 'column': 0}
))
fig.add_trace(go.Indicator(
    mode = "number",
    value = len(places),
    title = {'text': "Attractions",'font': {'color': 'brown','size':20}},
    number={'font':{'color': 'brown','size':50}},
    domain = {'row': 0, 'column': 1}
))
fig.add_trace(go.Indicator(
    mode = "number",
    value = len(review_all['review']),
    title = {'text': "Reviews",'font': {'color': 'green','size':20}},
    number={'font':{'color': 'green','size':50}},
    domain = {'row': 0, 'column': 2}
))
fig.update_layout(
    grid = {'rows': 1, 'columns': 3, 'pattern': "independent"})
fig.show()

###################  Platform Wise Analysis ######################## 
We analyzed the distribution of where reviews are coming from, and then analyze what attractions are in obtained in our data and their shares to each platform.

# Create pie chart
fig1 = px.pie(review_all, names='platform')
fig1.update_layout(title='Pie Charts of Platform')
fig1.show()

It shows that most of our reviews are coming from TripAdvisor and only a little percent are from Yelp. This happens because Yelp API has limitation of only 3 reviews available for each location, while TripAdvisor and Google don't have.

# Create a function to make pie charts with specific platform
def platform_piechart(platform, column):
    t_place_count = review_all.loc[review_all['platform']==platform][column].value_counts().sort_values()
    fig = px.pie(t_place_count, 
             values=column, 
             names=t_place_count.index,
             title = f'{column}s from {platform} ')
    fig.show()

# Print three piechart with respect to piechart
for platform in platforms:
    platform_piechart(platform,'attraction')

Above plots shows what places are included in the data. We can see most of the places got from yelp are restaurants rather than attractions.

########################  Attraction Wise Analysis ########################  
Here we focus on the number of reviews we obtained for different location accross platform.

# create bar plot of reviews counts for different places
attraction_count = review_all['attraction'].value_counts()
fig = px.bar(attraction_count.head(10))
fig.update_xaxes(title='Place Name')
fig.update_yaxes(title='# of Reviews')
fig.update_layout(title='# of Reviews for Places (Top 10)')
fig.show()

########################   Date Wise Analysis ########################   
We first converted *datecolumn* to two columns month and year, and we analyze how many reviews are made in the last year from Mar 2022 to Feb 2023. We didn't analyze every year because covid hitted in between, and it may not represent to overall trend.

# Convert datecolumns to month and year
review_all['month'] = review_all['datecolumn'].dt.month
review_all['year'] = review_all['datecolumn'].dt.year

# Create barplot for # reviews in a year
year_count = review_all['year'].value_counts()
fig = px.bar(year_count)
fig.update_xaxes(title='Year')
fig.update_yaxes(title='# of Reviews')
fig.update_layout(title='# of Reviews by Year')
fig.show()

# Create bar plot for reviews between mar 2022 to feb 2023
year2022 = review_all.loc[(review_all['datecolumn']>'2022-2-28') & (review_all['datecolumn']<'2023-3-1')]
month2022_count = year2022['month'].value_counts()
fig = px.bar(month2022_count)
fig.update_xaxes(title='Month')
fig.update_yaxes(title='# of Reviews')
fig.update_layout(title='# of Reviews by Month from Mar 2022 to Feb 2023')
fig.show()

The result shows that lots of those reviews are from 2019 and 2022. Since we want to based on only the past year, we plot the bar plot for month from Jan to Dec. The results shows that there are more proportion of reviews are left in between Jun-Sep 2022,
One explanation might be that its summer vacation at that period of time, lots of family might travel with their children. Another explanation could be related to the decline in covid worldwide.