<a href="https://colab.research.google.com/github/sumitrB/DataMining/blob/main/HotelReviewsEmbeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Word2Vec for Semantic Analysis of Hotel Reviews**
This notebook explores the application of Word2Vec to analyze semantic relationships in hotel reviews. Through various tuning experiments, we assess the model’s ability to capture meaningful connections between words, identify potential preprocessing challenges, and evaluate the impact of dataset limitations on embedding quality.

In [None]:
!pip install gensim



In [None]:
import pandas as pd
import string
import nltk
import gensim
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

nltk.download('punkt_tab')  # Tokenizer
nltk.download('stopwords')  # Stopwords list

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Extract file from Drive
file = '/content/drive/MyDrive/DataMining/Asgn4/hotel_reviews_extract.csv'

# Read the dataset
df= pd.read_csv(file)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2438 entries, 0 to 2437
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_full  2438 non-null   object
 1   date         1654 non-null   object
 2   summary      2438 non-null   object
 3   details      2438 non-null   object
dtypes: object(4)
memory usage: 76.3+ KB


In [None]:
df.head()

Unnamed: 0,review_full,date,summary,details
0,Nov 18 2009\tMary's Going Off to College Bash\...,Nov 18 2009,Mary's Going Off to College Bash,Neal Buker etc. We didn't have much of a view ...
1,Nov 16 2009 \tRefreshing Marriott\tStayed a lo...,Nov 16 2009,Refreshing Marriott,Stayed a long weekend at the Marriott with wif...
2,Oct 4 2009 \tLoved La Gaffe\tMy partner and I ...,Oct 4 2009,Loved La Gaffe,My partner and I stayed here for 10days at the...
3,Oct 13 2009 \tPerfect location nice big room e...,Oct 13 2009,Perfect location nice big room excellent service,We found a great hotel in the Bloomsbury secti...
4,\tClean enjoyable but definately room for impr...,,Clean enjoyable but definately room for improv...,Stayed a few nights at Hotel Fusion. Great loc...


In [None]:
# Extract the 'details' column
details = df['details']
details.head()

Unnamed: 0,details
0,Neal Buker etc. We didn't have much of a view ...
1,Stayed a long weekend at the Marriott with wif...
2,My partner and I stayed here for 10days at the...
3,We found a great hotel in the Bloomsbury secti...
4,Stayed a few nights at Hotel Fusion. Great loc...


#**Prep**

In [None]:
# Convert text to lowercase
details = details.str.lower()

details.head()

Unnamed: 0,details
0,neal buker etc. we didn't have much of a view ...
1,stayed a long weekend at the marriott with wif...
2,my partner and i stayed here for 10days at the...
3,we found a great hotel in the bloomsbury secti...
4,stayed a few nights at hotel fusion. great loc...


In [None]:
# Remove punctuation and non-alphabetic characters
details = details.apply(lambda x: re.sub(r'[^a-z\s]', '', x))  # Keep only lowercase letters and spaces

# Tokenize the text
details = details.apply(word_tokenize)

# Remove stopwords and words shorter than 2 characters
stop_words = set(stopwords.words('english'))  # Load NLTK's stop words list
details = details.apply(lambda tokens: [word for word in tokens if word not in stop_words and len(word) > 2])

# Display first few processed rows
details.head()

Unnamed: 0,details
0,"[neal, buker, etc, didnt, much, view, really, ..."
1,"[stayed, long, weekend, marriott, wife, son, a..."
2,"[partner, stayed, days, thend, long, european,..."
3,"[found, great, hotel, bloomsbury, section, lon..."
4,"[stayed, nights, hotel, fusion, great, locatio..."


#**Train a Word2Vec model**

In [None]:
# Train Word2Vec Model
word2vec_model = Word2Vec(
    sentences=details,  # Tokenized sentences
    vector_size=150,    # Each word is represented as a 150-dimensional vector
    window=10,          # Words within a 10-word distance are considered as context
    min_count=5,        # Ignores words that appear less than 5 times
    workers=4,          # Uses 4 CPU cores for parallel processing
    sg=0               # CBOW (Continuous Bag of Words) model (use sg=1 for Skip-Gram)
)

# Save the model for later use
word2vec_model.save("word2vec_hotel_reviews.model")

# Print summary
print("Word2Vec model trained successfully!")
print(f"Vocabulary size: {len(word2vec_model.wv)}")


Word2Vec model trained successfully!
Vocabulary size: 3861


#**Let's explore our Model**

In [None]:
# 10 most frequent words in the dataset
most_common_words = list(word2vec_model.wv.key_to_index.keys())[:10]

print("Top 10 most popular words in the dataset:")
print(most_common_words)

Top 10 most popular words in the dataset:
['hotel', 'room', 'stay', 'good', 'would', 'staff', 'one', 'rooms', 'great', 'breakfast']


In [None]:
def words_similar_to(word):
    try:
        similar_words = word2vec_model.wv.most_similar(word, topn=5)
        return [word_pair[0] for word_pair in similar_words]
    except KeyError:
        return f"'{word}' not found in vocabulary."

# Test the function with example words
test_words = ["clean", "service", "hotel", "room", "staff"]

for word in most_common_words:
    print(f"Words similar to '{word}': {words_similar_to(word)}")

Words similar to 'hotel': ['london', 'business', 'highly', 'definitely', 'recommend']
Words similar to 'room': ['floor', 'bed', 'double', 'shower', 'toilet']
Words similar to 'stay': ['would', 'definitely', 'money', 'recommend', 'staying']
Words similar to 'good': ['great', 'excellent', 'food', 'helpfull', 'restaurant']
Words similar to 'would': ['stay', 'staying', 'money', 'better', 'anyone']
Words similar to 'staff': ['friendly', 'helpful', 'service', 'helpfull', 'courteous']
Words similar to 'one': ['left', 'moved', 'sleep', 'copy', 'another']
Words similar to 'rooms': ['spacious', 'size', 'comfortable', 'large', 'clean']
Words similar to 'great': ['excellent', 'good', 'food', 'value', 'restaurant']
Words similar to 'breakfast': ['buffet', 'nice', 'free', 'included', 'helpfulthe']


# **Analysis of Model Output**
The Word2Vec model effectively captures meaningful relationships between words in the dataset. It successfully groups related terms, such as ‘rooms’ being associated with ‘spacious’ and ‘comfortable’, and ‘staff’ linked to ‘friendly’ and ‘helpful’. Additionally, the model demonstrates a strong understanding of sentiment relationships, as seen in how ‘great’ aligns with ‘excellent’ and ‘breakfast’ with ‘free’ and ‘buffet’, reflecting common themes in hotel reviews.

However, some inconsistencies exist. Certain words, like ‘hits’ and ‘copy’, appear irrelevant, indicating possible noise in the dataset. Additionally, the presence of words such as ‘staffthe’ suggests a preprocessing error, likely due to improper tokenization or spacing issues. A larger dataset could improve embedding quality, leading to more accurate and meaningful word associations.

# **Creative word math**

In [None]:
def word_math(positive, negative1, negative2):
    try:
        result = word2vec_model.wv.most_similar(positive=[positive], negative=[negative1, negative2], topn=1)
        return result[0][0]  # Return the most similar word
    except KeyError as e:
        return f"One of the words ('{positive}', '{negative1}', '{negative2}') is not in the vocabulary."

# Example usage
print(word_math("staff", "rude", "unfriendly"))

confortable


In [None]:
# What makes a room better? (+ "room" - "small" - "cramped")
print(word_math("room", "small", "cramped"))

told


In [None]:
# What do guests value in service? (+ "service" - "slow" - "unhelpful")
print(word_math("service", "slow", "unhelpful"))

confortable


In [None]:
# What improves a hotel experience? (+ "hotel" - "dirty" - "noisy")
print(word_math("hotel", "dirty", "noisy"))

location


In [None]:
print("confortable" in word2vec_model.wv.key_to_index)

True


# Analysis of Results

The result for (+ "staff" - "rude" - "unfriendly") → "confortable" (misspelled) suggests that the model has learned a typo from the dataset. This indicates that "staff" and "comfort" are closely associated in the reviews, but the presence of a misspelled word suggests potential data quality issues.

The result for (+ "room" - "small" - "cramped") → "told" is unrelated to room size, which means the model is not effectively capturing spatial concepts. This could be due to a lack of training examples where "room" is explicitly linked to adjectives like "spacious" or "large."

The result for (+ "service" - "slow" - "unhelpful") → "confortable" (misspelled) again reinforces the issue of learned typos. The fact that a service-related query returns "comfortable" suggests that the dataset frequently associates good service with comfort. However, the misspelling indicates a need for better text preprocessing before training.

The result for (+ "hotel" - "dirty" - "noisy") → "restaurants" indicates that the model has learned an association between hotels and restaurants in positive contexts. While "clean" or "quiet" would have been expected, this result suggests that hotels with positive reviews are often linked to good restaurant experiences, influencing the model’s predictions.


In [None]:
# Retrain the Word2Vec model with optimized parameters
word2vec_model = Word2Vec(
    sentences=details,
    vector_size=150,
    window=13,          # Larger window size for better context understanding
    min_count=2,        # Retain more words to improve associations
    workers=4,
    sg=0
)

# Save the improved model
word2vec_model.save("word2vec_hotel_reviews_v2.model")

# Print updated vocabulary size
print(f"Updated Vocabulary Size: {len(word2vec_model.wv)}")

Updated Vocabulary Size: 8093


In [None]:
# Conceptual Questions for Word Vector Arithmetic

print("1. What makes a guest feel welcome?")
print(word_math("staff", "rude", "unfriendly"))

print("\n2. What makes a room better?")
print(word_math("room", "small", "cramped"))

print("\n3. What improves a hotel experience?")
print(word_math("hotel", "dirty", "noisy"))

print("\n4. What do guests value in service?")
print(word_math("service", "slow", "unhelpful"))

print("\n5. What makes breakfast enjoyable?")
print(word_math("breakfast", "cold", "bland"))


1. What makes a guest feel welcome?
ken

2. What makes a room better?
issuei

3. What improves a hotel experience?
solely

4. What do guests value in service?
welcomingthe

5. What makes breakfast enjoyable?
welcomingthe


# **Conclusion**
The Word2Vec model struggled to establish clear semantic relationships, producing vague or irrelevant outputs. Initially, with CBOW (sg=0), window=5, and min_count=3, all word arithmetic queries returned "pleasant", indicating over-generalization rather than meaningful word associations.

Increasing the context window to 13 and reducing min_count=2 resulted in fragmented and meaningless words like "welcomingthe" and "issuei", suggesting poor tokenization or insufficient context for rare words. Even with more training epochs (15), the model failed to form strong relationships, highlighting dataset limitations.