In [234]:
%matplotlib inline

In [235]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

In [236]:
european_restaurant_reviews_data = pd.read_csv("data/European Restaurant Reviews.csv")
european_restaurant_reviews_data.index = european_restaurant_reviews_data.index + 1
european_restaurant_reviews_data

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review
1,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
2,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
3,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."
4,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...
5,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...
...,...,...,...,...,...,...
1498,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...
1499,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...
1500,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,..."
1501,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...


In [237]:
len(european_restaurant_reviews_data)

1502

In [238]:
european_restaurant_reviews_data.columns

Index(['Country', 'Restaurant Name', 'Sentiment', 'Review Title',
       'Review Date', 'Review'],
      dtype='object')

In [239]:
european_restaurant_reviews_data.columns = european_restaurant_reviews_data.columns.to_series().apply(lambda name: name.strip().lower().replace(" ", "_"))
european_restaurant_reviews_data.columns

Index(['country', 'restaurant_name', 'sentiment', 'review_title',
       'review_date', 'review'],
      dtype='object')

In [240]:
european_restaurant_reviews_data.head()

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review
1,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
2,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
3,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."
4,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...
5,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...


In [241]:
european_restaurant_reviews_data["country"].value_counts()
most_common_country = european_restaurant_reviews_data["country"].value_counts().idxmax()
print(f"Most represented country: {most_common_country}")

Most represented country: France


In [242]:
european_restaurant_reviews_data.describe(include='all')

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review
count,1502,1502,1502,1502,1502,1502
unique,7,7,2,1343,143,1426
top,France,The Frog at Bercy Village,Positive,Amazing,May 2014 •,I actually never write reviews for the restaur...
freq,512,512,1237,9,108,4


In [243]:
european_restaurant_reviews_data['restaurant_name'].value_counts()

restaurant_name
The Frog at Bercy Village                512
Ad Hoc Ristorante (Piazza del Popolo)    318
The LOFT                                 210
Old Square (Plaza Vieja)                 146
Stara Kamienica                          135
Pelmenya                                 100
Mosaic                                    81
Name: count, dtype: int64

In [244]:
european_restaurant_reviews_data.columns

Index(['country', 'restaurant_name', 'sentiment', 'review_title',
       'review_date', 'review'],
      dtype='object')

In [245]:
european_restaurant_reviews_data['review_date'] = european_restaurant_reviews_data['review_date'].str.extract(r'([A-Za-z]+\s+\d{4})')[0]

In [246]:
european_restaurant_reviews_data["review_date"] = pd.to_datetime(
european_restaurant_reviews_data["review_date"], format="%B %Y", errors="coerce")
european_restaurant_reviews_data["review_date"] = european_restaurant_reviews_data["review_date"].dt.strftime("%d-%m-%Y")

In [247]:
european_restaurant_reviews_data = european_restaurant_reviews_data.dropna(subset=['review_date'])

In [248]:
european_restaurant_reviews_data['review_date']

1       01-05-2024
10      01-05-2019
16      01-05-2018
17      01-05-2018
19      01-05-2017
           ...    
1462    01-05-2014
1463    01-05-2014
1464    01-05-2014
1465    01-05-2014
1466    01-05-2014
Name: review_date, Length: 248, dtype: object

In [251]:
reviews_data_min = european_restaurant_reviews_data["review_date"].min()
reviews_data_min

'01-05-2012'

In [252]:
reviews_data_max = european_restaurant_reviews_data["review_date"].max()
reviews_data_max

'01-05-2024'

#### No, it's not balanced. It's very negative.

In [None]:

european_restaurant_reviews_data["sentiment"].value_counts()

sentiment
Positive    222
Negative     26
Name: count, dtype: int64

### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

### Problem 4. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

### Problem 5. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

### Problem 6. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

### Problem 7. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

### Problem 8. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

### \* Problem 9. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.