In [1]:
%matplotlib inline

In [103]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import string
import re
from collections import Counter

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer


# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

In [106]:
restaurants_data = pd.read_csv("data/European Restaurant Reviews.csv")

In [41]:
restaurants_data.head(3)

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."


In [5]:
restaurants_data.shape

(1502, 6)

In [6]:
print(f"There are {restaurants_data.shape[0]} observations.")

There are 1502 observations.


In [7]:
restaurants_data.dtypes

Country            object
Restaurant Name    object
Sentiment          object
Review Title       object
Review Date        object
Review             object
dtype: object

In [8]:
restaurants_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1502 entries, 0 to 1501
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Country          1502 non-null   object
 1   Restaurant Name  1502 non-null   object
 2   Sentiment        1502 non-null   object
 3   Review Title     1502 non-null   object
 4   Review Date      1502 non-null   object
 5   Review           1502 non-null   object
dtypes: object(6)
memory usage: 70.5+ KB


In [107]:
# make column names snake_case for easier work with the data:
restaurants_data.columns = restaurants_data.columns.to_series().apply(lambda x: x.replace(" ", "_").lower())
restaurants_data.columns

Index(['country', 'restaurant_name', 'sentiment', 'review_title',
       'review_date', 'review'],
      dtype='object')

In [10]:
restaurants_data.review_date

0       May 2024 •
1       Feb 2024 •
2       Nov 2023 •
3       Mar 2023 •
4       Nov 2022 •
           ...    
1497    Oct 2016 •
1498    Oct 2016 •
1499    Oct 2016 •
1500    Oct 2016 •
1501    Oct 2016 •
Name: review_date, Length: 1502, dtype: object

In [108]:
# convert review date column to datetime type:
restaurants_data.review_date = restaurants_data.review_date.apply(lambda x: x.replace(" •", ""))
restaurants_data.review_date = pd.to_datetime(restaurants_data.review_date, format = "%b %Y", errors = "coerce")
restaurants_data.review_date

0      2024-05-01
1      2024-02-01
2      2023-11-01
3      2023-03-01
4      2022-11-01
          ...    
1497   2016-10-01
1498   2016-10-01
1499   2016-10-01
1500   2016-10-01
1501   2016-10-01
Name: review_date, Length: 1502, dtype: datetime64[ns]

In [44]:
most_repr_country = restaurants_data.country.value_counts().idxmax()
country_count = restaurants_data.country.value_counts().max()
print(f"The most represented country is {most_repr_country}. It occurs {country_count} times.")

The most represented country is France. It occurs 512 times.


In [45]:
max_date = restaurants_data.review_date.max().strftime("%B %Y")

In [46]:
min_date = restaurants_data.review_date.min().strftime("%B %Y")

In [47]:
print(f"Time range of dataset: from {min_date} to {max_date}.")

Time range of dataset: from September 2010 to July 2024.


In [48]:
rest_review_counts = restaurants_data.restaurant_name.value_counts()
rest_review_counts

restaurant_name
The Frog at Bercy Village                512
Ad Hoc Ristorante (Piazza del Popolo)    318
The LOFT                                 210
Old Square (Plaza Vieja)                 146
Stara Kamienica                          135
Pelmenya                                 100
Mosaic                                    81
Name: count, dtype: int64

In [49]:
# check if all review counts are the same number:
is_balanced = rest_review_counts.nunique() == 1
print(f"Is the sample balanced in terms of restaurants - {is_balanced}.")

Is the sample balanced in terms of restaurants - False.


In [50]:
rest_sentiment_counts = restaurants_data.sentiment.value_counts()
rest_sentiment_counts

sentiment
Positive    1237
Negative     265
Name: count, dtype: int64

In [51]:
# check if all review counts are the same number:
is_balanced_sent = rest_sentiment_counts.nunique() == 1
print(f"Is the sample balanced in terms of sentiment - {is_balanced_sent}.")

Is the sample balanced in terms of sentiment - False.


### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

In [20]:
restaurants_data.review

0       The manager became agressive when I said the c...
1       I ordered a beef fillet ask to be done medium,...
2       This is an attractive venue with welcoming, al...
3       Sadly I  used the high TripAdvisor rating too ...
4       From the start this meal was bad- especially g...
                              ...                        
1497    Despite the other reviews saying that this is ...
1498    beer is good.  food is awfull  The only decent...
1499    for terrible service of a truly comedic level,...
1500    We visited the Havana's Club Museum which is l...
1501    Food and service was awful. Very pretty stop. ...
Name: review, Length: 1502, dtype: object

In [109]:
# make new columns 'review_words' with words from reviews:
restaurants_data["review_words"] = restaurants_data.review.str.split("\s+") # split words by one or more spaces
restaurants_data["review_words"] = restaurants_data.review_words.apply(lambda review_words: [w.lower() for w in review_words]) # make every word lowercase

In [110]:
# make new column 'review_words_count' with number of words in each review:
restaurants_data["review_words_count"] = restaurants_data.review_words.apply(lambda x: len(x))

In [23]:
restaurants_data.head()

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,review_words,review_words_count
0,France,The Frog at Bercy Village,Negative,Rude manager,2024-05-01,The manager became agressive when I said the c...,"[the, manager, became, agressive, when, i, sai...",28
1,France,The Frog at Bercy Village,Negative,A big disappointment,2024-02-01,"I ordered a beef fillet ask to be done medium,...","[i, ordered, a, beef, fillet, ask, to, be, don...",58
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,2023-11-01,"This is an attractive venue with welcoming, al...","[this, is, an, attractive, venue, with, welcom...",40
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,2023-03-01,Sadly I used the high TripAdvisor rating too ...,"[sadly, i, used, the, high, tripadvisor, ratin...",279
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,2022-11-01,From the start this meal was bad- especially g...,"[from, the, start, this, meal, was, bad-, espe...",243


Calculate the mean, median and std of the number of words in the positive and negative reviews:

In [111]:
positive_reviews_mean = restaurants_data.review_words_count[restaurants_data.sentiment == "Positive"].mean()
negative_reviews_mean = restaurants_data.review_words_count[restaurants_data.sentiment == "Negative"].mean()

positive_reviews_median = restaurants_data.review_words_count[restaurants_data.sentiment == "Positive"].median()
negative_reviews_median = restaurants_data.review_words_count[restaurants_data.sentiment == "Negative"].median()

positive_reviews_std = restaurants_data.review_words_count[restaurants_data.sentiment == "Positive"].std()
negative_reviews_std = restaurants_data.review_words_count[restaurants_data.sentiment == "Negative"].std()

In [55]:
print(f"Information about number of words in Negative reviews: " 
    f"mean = {negative_reviews_mean}, median = {negative_reviews_median}, std = {negative_reviews_std}.")

print(f"Information about number of words in Positive reviews: " 
    f"mean = {positive_reviews_mean}, median = {positive_reviews_median}, std = {positive_reviews_std}.")

Information about number of words in Negative reviews: mean = 140.57358490566037, median = 95.0, std = 131.7596355957585.
Information about number of words in Positive reviews: mean = 50.18350848827809, median = 37.0, std = 38.7410428479818.


**Mean and Median**: \
Positive reviews are often shorter, maybe because their expectations are met. They are focusing on key positive aspects, without needing detailed explanations. \
The negative reviews, on the other hand, are often longer, because  people often provide more detailed feedback when they are dissatisfied. They might elaborate on what went wrong, give specific examples, and express their frustration, leading to longer reviews. 

**Standart deviation**: \
The std about Negative reviews shows high variability in length - while some reviews might be very long, others could still be very short, contributing to a large standard deviation.
The standart deviation about Positive reviews is lower, so positive reviews tend to be more uniformly brief, with less variation in length. 

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

To prepare restaurant reviews for sentiment analysis, i have to clean and preprocess the text. I use lematization technique.

In [112]:
# download NLTK data to remove unnecessary symbols and words:
nltk.download("stopwords") # for removing common words
nltk.download("punkt") # for removing punctuation
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [113]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def text_preprocessing(review_words):
    review_words = ' '.join(review_words)
    review_words = review_words.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
    
    tokens = word_tokenize(review_words) # tokenize the words
    
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words] # remove everything (including stopwords) except alphabet

    lematized_tokens = ' '.join([lemmatizer.lemmatize(word) for word in tokens]) # lematize the tokens
    
    return lematized_tokens

In [114]:
restaurants_data["lematized_review_words"] = restaurants_data.review_words.apply(text_preprocessing)
restaurants_data["lematized_review_words"]

0       manager became agressive said carbonara good r...
1       ordered beef fillet ask done medium got well d...
2       attractive venue welcoming albeit somewhat slo...
3       sadly used high tripadvisor rating literally f...
4       start meal bad especially given price visited ...
                              ...                        
1497    despite review saying lovely place hang especi...
1498    beer good food awfull decent thing shish kabob...
1499    terrible service truly comedic level full pint...
1500    visited havana club museum located old havana ...
1501    food service awful pretty stop good photo bad ...
Name: lematized_review_words, Length: 1502, dtype: object

In [59]:
restaurants_data.head(3)

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,review_words,review_words_count,lematized_review_words
0,France,The Frog at Bercy Village,Negative,Rude manager,2024-05-01,The manager became agressive when I said the c...,"[the, manager, became, agressive, when, i, sai...",28,manager became agressive said carbonara good r...
1,France,The Frog at Bercy Village,Negative,A big disappointment,2024-02-01,"I ordered a beef fillet ask to be done medium,...","[i, ordered, a, beef, fillet, ask, to, be, don...",58,ordered beef fillet ask done medium got well d...
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,2023-11-01,"This is an attractive venue with welcoming, al...","[this, is, an, attractive, venue, with, welcom...",40,attractive venue welcoming albeit somewhat slo...


### Problem 4. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

In [115]:
def return_top_words(lematized_words, n):
    words = " ".join(lematized_words).split() # make list of words
    words_count = Counter(words)
    
    return words_count.most_common(n)

In [116]:
words_from_positive_reviews = restaurants_data[restaurants_data.sentiment == "Positive"].lematized_review_words
words_from_negative_reviews = restaurants_data[restaurants_data.sentiment == "Negative"].lematized_review_words

In [117]:
n = 10
top_positive_words = return_top_words(words_from_positive_reviews, n)
top_negative_words = return_top_words(words_from_negative_reviews, n)

print(f"The top 10 words in positive reviews are:  {top_positive_words}\n")
print(f"The top 10 words in negative reviews are:  {top_negative_words}")

The top 10 words in positive reviews are:  [('food', 740), ('great', 570), ('service', 542), ('good', 511), ('restaurant', 433), ('place', 397), ('nice', 302), ('wine', 301), ('menu', 265), ('staff', 257)]

The top 10 words in negative reviews are:  [('restaurant', 250), ('food', 247), ('u', 209), ('wine', 204), ('table', 172), ('menu', 152), ('good', 151), ('service', 146), ('one', 142), ('would', 127)]


**Summary**: \
The "top" words in reviews means the most frequently used words in the reviews. \
The most common words in positive reviews are associated with satisfaction and pleasant dining experience. Words like "food", "great", "service", "restaurant", "place" and "nice" suggest that customers frequently highlight the quality of the food, positive service and nice atmosphere. \
The most common words in negative reviews are focused on specific complaints about the restaurant's food, service and other aspects of the dining experience. Words like "restaurant", "food", "menu", "good" and "service" indicate complaints revolve around the dining experience itself, unfulfilled expectations, dissatisfaction with the personal service. \
These words suggest the most important things that people look for when visiting restaurants.

### Problem 5. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

In [118]:
restaurants_data["title_words"] = restaurants_data.review_title.str.split("\s+")
restaurants_data["title_words"] = restaurants_data.title_words.apply(lambda title_words: [w.lower() for w in title_words]) 

In [119]:
restaurants_data["lematized_title_words"] = restaurants_data.title_words.apply(text_preprocessing)
restaurants_data = restaurants_data.drop(columns = "title_words")
restaurants_data.lematized_title_words

0                              rude manager
1                        big disappointment
2                   pretty place bland food
3          great service wine inedible food
4       avoid worst meal rome possibly ever
                       ...                 
1497                           tourism trap
1498                           beer factory
1499                                brewery
1500                       nothing exciting
1501                           tourist trap
Name: lematized_title_words, Length: 1502, dtype: object

In [120]:
title_words_from_positive_reviews = restaurants_data[restaurants_data.sentiment == "Positive"].lematized_title_words
title_words_from_negative_reviews = restaurants_data[restaurants_data.sentiment == "Negative"].lematized_title_words

In [121]:
n = 10
top_positive_title_words = return_top_words(title_words_from_positive_reviews, n)
top_negative_title_words = return_top_words(title_words_from_negative_reviews, n)

print(f"The top 10 words in positive titles are:  {top_positive_title_words}\n")
print(f"The top 10 words in negative titles are:  {top_negative_title_words}")

The top 10 words in positive titles are:  [('great', 225), ('food', 176), ('good', 105), ('place', 104), ('service', 95), ('excellent', 78), ('best', 78), ('restaurant', 75), ('dinner', 74), ('amazing', 66)]

The top 10 words in negative titles are:  [('food', 32), ('service', 25), ('disappointing', 21), ('place', 16), ('bad', 15), ('rome', 14), ('great', 12), ('terrible', 12), ('restaurant', 11), ('ad', 11)]


In [122]:
def extract_words(words_frequencies):
    words_list = [word for word, f in words_frequencies]
    return words_list

set_top_positive_words_review = set(extract_words(top_positive_words))
set_top_negative_words_review = set(extract_words(top_negative_words))

set_top_positive_words_title = set(extract_words(top_positive_title_words))
set_top_negative_words_title = set(extract_words(top_negative_title_words))

In [123]:
positive_matching_words = set_top_positive_words_title.intersection(set_top_positive_words_review)
negative_matching_words = set_top_negative_words_title.intersection(set_top_negative_words_review)

print(f"The most frequently used words that occur in both positive reviews and titles: {', '.join(positive_matching_words)}.")
print(f"The most frequently used words that occur in both negative reviews and titles: {', '.join(negative_matching_words)}.")

The most frequently used words that occur in both positive reviews and titles: great, food, service, restaurant, good, place.
The most frequently used words that occur in both negative reviews and titles: food, restaurant, service.


**Summary of results for most frequently used words**: \
Most frequently used words in both positive reviews and titles are: great, good, restaurant, food, service, place. The presence of these words in both contexts suggests they are significant in expressing positive sentiments. Words like "great", "good" and "food" are commonly used to describe positive dining experiences. \
Most frequently used words in both negative reviews and titles are: restaurant, food, service. These words reflect common complaints or issues, such as problems with the "restaurant", "food" or "service". The recurrence of these terms in negative contexts emphasizes the typical points of criticism.

In [124]:
restaurants_data.lematized_review_words = restaurants_data.lematized_review_words.str.split("\s+")
restaurants_data.lematized_title_words = restaurants_data.lematized_title_words.str.split("\s+")

In [125]:
top_positive_words = ['food', 'great', 'service', 'good', 'restaurant', 'place', 'nice', 'wine', 'menu', 'staff']
top_negative_words = ['restaurant', 'food', 'u', 'wine', 'table', 'menu', 'good', 'service', 'one', 'would']
top_positive_words_set = set(top_positive_words)
top_negative_words_set = set(top_negative_words)

Extract the words that are top positive / negative words and exist in review:

In [126]:
def extract_positive_words(row):
    if row["sentiment"] == "Positive":
        review_words_set = set(row["lematized_review_words"])
        common_words = review_words_set.intersection(top_positive_words_set)
        return list(common_words)
    return np.nan 

def extract_negative_words(row):
    if row["sentiment"] == "Negative":
        review_words_set = set(row["lematized_review_words"])
        common_words = review_words_set.intersection(top_negative_words_set)
        return list(common_words)
    return np.nan 

restaurants_data["existing_top_pos_words_review"] = restaurants_data.apply(extract_positive_words, axis = 1)
restaurants_data["existing_top_neg_words_review"] = restaurants_data.apply(extract_negative_words, axis = 1)

Extract the words that are top positive / negative words and exist in review:

In [127]:
top_positive_words_title = ['great', 'food', 'good', 'place', 'service', 'excellent', 'best', 'restaurant', 'dinner', 'amazing']
top_negative_words_title = ['food', 'service', 'disappointing', 'place', 'bad', 'rome', 'great', 'terrible', 'restaurant', 'ad']
top_positive_words_set_title = set(top_positive_words_title)
top_negative_words_set_title = set(top_negative_words_title)

def extract_positive_words(row):
    if row["sentiment"] == "Positive":
        review_words_set = set(row["lematized_title_words"])
        common_words = review_words_set.intersection(top_positive_words_set_title)
        return list(common_words)
    return np.nan 

def extract_negative_words(row):
    if row["sentiment"] == "Negative":
        review_words_set = set(row["lematized_title_words"])
        common_words = review_words_set.intersection(top_negative_words_set_title)
        return list(common_words)
    return np.nan 

restaurants_data["existing_top_pos_words_title"] = restaurants_data.apply(extract_positive_words, axis = 1)
restaurants_data["existing_top_neg_words_title"] = restaurants_data.apply(extract_negative_words, axis = 1)

In [128]:
restaurants_data.sample(3)

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,review_words,review_words_count,lematized_review_words,lematized_title_words,existing_top_pos_words_review,existing_top_neg_words_review,existing_top_pos_words_title,existing_top_neg_words_title
549,Italy,Ad Hoc Ristorante (Piazza del Popolo),Negative,"Very slow, average food, lack of wine, totally...",2011-12-01,Visited Ad Hoc for NYE with my husband and was...,"[visited, ad, hoc, for, nye, with, my, husband...",436,"[visited, ad, hoc, nye, husband, disappointed,...","[slow, average, food, lack, wine, totally, ove...",,"[food, would, one, service, table, menu, resta...",,[food]
1387,Cuba,Old Square (Plaza Vieja),Positive,Restoration to its former glory a good start,2014-05-01,Well on its way to being restored to its forme...,"[well, on, its, way, to, being, restored, to, ...",27,"[well, way, restored, former, splendour, squar...","[restoration, former, glory, good, start]",[],,[good],
948,Poland,Stara Kamienica,Positive,Lovely Touch,2023-06-01,I booked this restaurant a month earlier back ...,"[i, booked, this, restaurant, a, month, earlie...",54,"[booked, restaurant, month, earlier, back, aus...","[lovely, touch]","[great, service, nice, restaurant, wine]",,[],


In [129]:
def has_common_words_for_sentiment(row, sentiment_type):
    if row["sentiment"] == sentiment_type:
        if sentiment_type == "Positive":
            list1 = row["existing_top_pos_words_review"]
            list2 = row["existing_top_pos_words_title"]
        elif sentiment_type == "Negative":
            list1 = row["existing_top_neg_words_review"]
            list2 = row["existing_top_neg_words_title"]
        else:
            return False
        return bool(set(list1) & set(list2))
    return False


restaurants_data["positive_correlation"] = restaurants_data.apply(
    lambda row: has_common_words_for_sentiment(row, "Positive"), axis=1)

restaurants_data["negative_correlation"] = restaurants_data.apply(
    lambda row: has_common_words_for_sentiment(row, "Negative"), axis=1)

In [130]:
total_positives = len(restaurants_data[restaurants_data.sentiment == "Positive"])
total_negatives = len(restaurants_data[restaurants_data.sentiment == "Negative"])

positive_with_top_words = len(restaurants_data[restaurants_data.positive_correlation == True])
negative_with_top_words = len(restaurants_data[restaurants_data.negative_correlation == True])

pct_positive_matching = (positive_with_top_words / total_positives) * 100
pct_negative_matching = (negative_with_top_words / total_negatives) * 100

print(f"{pct_positive_matching:.2f}% of all positive reviews contain one or more of the top positive words in both the review and the title.")
print(f"{pct_negative_matching:.2f}% of all negative reviews contain one or more of the top negative words in both the review and the title.")

29.59% of all positive reviews contain one or more of the top positive words in both the review and the title.
14.34% of all negative reviews contain one or more of the top negative words in both the review and the title.


**Correlation analysis**: \
Positive sentiment percentage = 29.59% \
About 30% of positive reviews contain at least one of the top positive words in both the review text and the title. This indicates a fairly strong correlation between the content and the title of positive reviews, maybe because people are more focused and consistent. It suggests that titles are often descriptive of the positive feedback provided in the review. 

Negative Sentiment Percentage = 14.34% \
Approximately 14% of negative reviews have at least one of the top negative words in both the review text and the title. This lower percentage compared to positive sentiment show less consistency. Titles in negative reviews might be less descriptive of the content. The reasons about that might be emotions, or an attempt to capture attention. Maybe negative titles highlight the general dissatisfaction, and review text focus on specific problem, explained in details. 

### Problem 6. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

In [82]:
restaurants_data.lematized_title_words.head(5)

0                               [rude, manager]
1                         [big, disappointment]
2                  [pretty, place, bland, food]
3        [great, service, wine, inedible, food]
4    [avoid, worst, meal, rome, possibly, ever]
Name: lematized_title_words, dtype: object

In [83]:
restaurants_data.lematized_review_words.head(5)

0    [manager, became, agressive, said, carbonara, ...
1    [ordered, beef, fillet, ask, done, medium, got...
2    [attractive, venue, welcoming, albeit, somewha...
3    [sadly, used, high, tripadvisor, rating, liter...
4    [start, meal, bad, especially, given, price, v...
Name: lematized_review_words, dtype: object

In [131]:
# convert lematized words list to string:
restaurants_data["lematized_title_words_str"] = restaurants_data.lematized_title_words.apply(lambda x: " ".join(x))
restaurants_data["lematized_review_words_str"] = restaurants_data.lematized_review_words.apply(lambda x: " ".join(x))

In [85]:
restaurants_data.head(3)

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,review_words,review_words_count,lematized_review_words,lematized_title_words,existing_top_pos_words_review,existing_top_neg_words_review,existing_top_pos_words_title,existing_top_neg_words_title,positive_correlation,negative_correlation,lematized_title_words_str,lematized_review_words_str
0,France,The Frog at Bercy Village,Negative,Rude manager,2024-05-01,The manager became agressive when I said the c...,"[the, manager, became, agressive, when, i, sai...",28,"[manager, became, agressive, said, carbonara, ...","[rude, manager]",,[good],,[],False,False,rude manager,manager became agressive said carbonara good r...
1,France,The Frog at Bercy Village,Negative,A big disappointment,2024-02-01,"I ordered a beef fillet ask to be done medium,...","[i, ordered, a, beef, fillet, ask, to, be, don...",58,"[ordered, beef, fillet, ask, done, medium, got...","[big, disappointment]",,[],,[],False,False,big disappointment,ordered beef fillet ask done medium got well d...
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,2023-11-01,"This is an attractive venue with welcoming, al...","[this, is, an, attractive, venue, with, welcom...",40,"[attractive, venue, welcoming, albeit, somewha...","[pretty, place, bland, food]",,"[food, restaurant, service]",,"[food, place]",False,True,pretty place bland food,attractive venue welcoming albeit somewhat slo...


In [132]:
# initialize vectorizers, set hyperparameters:

title_vectorizer = CountVectorizer(
    ngram_range = (1, 1),  # determines whether to use only single words (unigrams) or combinations of words
    stop_words = "english",
    lowercase = True,
    max_features = 1000,
    min_df = 2,             
    max_df = 0.90, 
)

review_vectorizer = CountVectorizer(
    ngram_range = (1, 2),   # capture unigrams and bigrams
    stop_words = "english", # remove common stop words
    lowercase = True,        # convert input words lowercase for consistency (ok in my case)
    max_features = 10000,   # take the first 10000 most common words
    min_df = 2,             # minimum number of times a word must occur
    max_df = 0.85,          # maximum number of times a word must occur
)

In [133]:
# fit and transform
title_features = title_vectorizer.fit_transform(restaurants_data.lematized_title_words_str)
review_features = review_vectorizer.fit_transform(restaurants_data.lematized_review_words_str)

In [134]:
title_features

<1502x405 sparse matrix of type '<class 'numpy.int64'>'
	with 3662 stored elements in Compressed Sparse Row format>

In [136]:
review_features

<1502x7402 sparse matrix of type '<class 'numpy.int64'>'
	with 54898 stored elements in Compressed Sparse Row format>

In [137]:
pct_titles = (3662 / (1502 * 405)) * 100
pct_reviews = (54898 / (1502 * 7402)) * 100

print(f"Sparsity percentage for title matrix: {pct_titles:.2f}%")
print(f"Sparsity percentage for review content matrix: {pct_reviews:.2f}%")

Sparsity percentage for title matrix: 0.60%
Sparsity percentage for review content matrix: 0.49%


Titles: 0.60% sparsity means that approximately 0.60% of the matrix entries are non-zero, and the remaining are zero. This relatively low sparsity indicates that titles have a fairly higher density of terms compared to the review content. Titles tend to use a more focused set of words that appear frequently across different titles. 

Reviews: 0.49% sparsity means that about 0.49% of the matrix entries are non-zero, and the remaining are zero. This slightly lower sparsity compared to titles suggests that reviews are more denser in terms of word usage. Reviews might contain a broader range of vocabulary but still result in a large proportion of zeros due to the big number of unique tokens.

### Problem 7. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

### Problem 8. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

### \* Problem 9. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.