In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path

# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

In [None]:
restaursnts = pd.read_csv('./data/European Restaurant Reviews.csv')

# Convert to lowercase for easier processing and property access
restaursnts.columns = restaursnts.columns.str.lower().str.replace(' ', '_')

# France is the most represented country
restaursnts.country.value_counts()

# The dataset represents a timeframe from April 2012 to September 2023
timeframe = restaursnts.review_date.min(), restaursnts.review_date.max()
timeframe

# We do not have an equal number of reviews for each restaurant, which is normal to me
grouped = restaursnts.groupby('restaurant_name').size()
grouped

# The dataset is not balanced in terms of sentiment, we have 82% positive and 18% negative reviews
restaursnts.sentiment.value_counts(normalize=True) * 100

# We dont have missing values, so let's split the review date into month and year
restaursnts[restaursnts.review_date.isna()]
restaursnts[~restaursnts.review_date.str.contains(' ', na=False)]

# I will also split the review_date for easier analysis
split_cols = restaursnts.review_date.str.split(' ', expand=True)
restaursnts['review_month'] = split_cols[0]
restaursnts['review_year'] = split_cols[1].astype(int)

restaursnts.drop(columns=['review_date'], inplace=True)

restaursnts

Unnamed: 0,country,restaurant_name,sentiment,review_title,review,review_month,review_year
0,France,The Frog at Bercy Village,Negative,Rude manager,The manager became agressive when I said the c...,May,2024
1,France,The Frog at Bercy Village,Negative,A big disappointment,"I ordered a beef fillet ask to be done medium,...",Feb,2024
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,"This is an attractive venue with welcoming, al...",Nov,2023
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Sadly I used the high TripAdvisor rating too ...,Mar,2023
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,From the start this meal was bad- especially g...,Nov,2022
...,...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Despite the other reviews saying that this is ...,Oct,2016
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,beer is good. food is awfull The only decent...,Oct,2016
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,"for terrible service of a truly comedic level,...",Oct,2016
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,We visited the Havana's Club Museum which is l...,Oct,2016


### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

In [64]:
# First get the negative reviews
negative_reviews = restaursnts[restaursnts.sentiment == 'Negative']
negative_reviews_count = len(negative_reviews)
negative_reviews_words_count = negative_reviews.review.str.split().str.len().sum()
average_negative_review_length = negative_reviews_words_count / negative_reviews_count
average_negative_review_length

# Actuallt, we can do it a lot faster...
negative_reviews.review.str.split().str.len().describe()
# Average negative review's length is 140 words

# Now the positive reviews...
positive_reviews = restaursnts[restaursnts.sentiment == 'Positive']
# Average positive review is 50 words
positive_reviews.review.str.split().str.len().describe()

# In order to create a robust metric for length of a text, we need to take into account that the positive reviews are usually longer because people are more expressive
# and use more words to describe their good experience. Ex: "horrible pancakes" vs "I loved the pancakes!" -> two words vs 4 words and a sign.
# So if we were to normalize the word counts, for positive reviews we would need to divide them by 2.8 (140 / 50) to get a more accurate comparison.
# Actually... I might be overthinking it, a simpler way would be to simply count the words and not the characters, since in a review, words are what matter the most.
# I already did that so there isn't much to do here.

restaursnts.review.str.split().str.len().describe()

count    1502.000000
mean       66.131158
std        74.008747
min         2.000000
25%        26.000000
50%        42.500000
75%        74.000000
max       646.000000
Name: review, dtype: float64

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

### Problem 4. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

### Problem 5. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

### Problem 6. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

### Problem 7. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

### Problem 8. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

### \* Problem 9. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.