## Hotel Review

We aim to perform sentiment analysis on customer reviews to understand their 
sentiments towards a product or service. This helps us to understand customer 
satisfaction and areas that need improvement.
 
NLP Techniques: 
1. Text Preprocessing: Tokenization, stop word removal and lemmatization. 
2. Feature Extraction: Bag of Words, Term Frequency – Inverse Document 
Frequency (TF-IDF) and word embeddings. 
3. Sentiment Classification: Random Forest (RF) Classifier, Logistic Regression, 
CNN 

### Data Preparation

Each customer review is a textual feedback and an overall rating.  
The ratings can range from 1 to 10.  
We will split them into two categories: bad reviews have ratings < 5 and good reviews have ratings >= 5.

The textual is divide into two part (positive and negative). We group the together in order to start with only one raw text data.
additionally if the user doesn't leave any negative or positive comment, this will appear as "No Negative" or "No Positive". those part have to be removed from the text.

In [47]:
import pandas as pd

#read data
reviews_df = pd.read_csv('dataset/Hotel_Reviews.csv')

#append the positive and negative reviews
reviews_df['review'] = reviews_df['Negative_Review'] + reviews_df['Positive_Review']
#create the label
reviews_df['review_type'] = reviews_df['Reviewer_Score'].apply(lambda x: 'Bad_review' if x < 5 else 'Good_review')
#sample data in order to speed up the computation
reviews_df = reviews_df.sample(frac=0.1, replace=False, random_state=42)
#clean data
reviews_df['review'] = reviews_df['review'].apply(lambda x: x.replace('No Negative', '').replace('No Positive', ''))


We will perform several transformations to clean the textual data:
- lower the text
- tokenize the text and remove the punctuation
- remove useless stop words
- lemmatize the text

In order to speed up the computation we will convert the csv data to pickle and continue working with the pkl file

In [83]:
from typing import List
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#lemmatize token and remove stop word, if len of word is greater than 1 remove it
def clean_text(text: str) -> List[str]:
    doc = nlp(text)
    return [token.lemma_ for token in doc if not token.is_stop and len(token) > 1]
        
#convert the csv data to pkl in order to speed up the computation  
#reviews_df['review_clean'] = reviews_df['review'].apply(lambda x: clean_text(x))
#reviews_df.to_pickle('dataset/Hotel_Reviews.pkl')


Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng,review,review_type,review_clean
488440,Via Senigallia 6 20161 Milan Italy,904,7/21/2017,8.1,Hotel Da Vinci,United Kingdom,Would have appreciated a shop in the hotel th...,52,16670,Hotel was great clean friendly staff free bre...,62,1,9.6,"[' Leisure trip ', ' Couple ', ' Double Room '...",13 days,45.533137,9.171102,Would have appreciated a shop in the hotel th...,Good_review,"[appreciate, shop, hotel, sell, drinking, wate..."
274649,Arlandaweg 10 Westpoort 1043 EW Amsterdam Neth...,612,12/12/2016,8.6,Urban Lodge Hotel,Belgium,No tissue paper box was present at the room,10,5018,No Positive,0,7,8.8,"[' Leisure trip ', ' Group ', ' Triple Room ',...",234 day,52.385649,4.834443,No tissue paper box was present at the room,Good_review,"[tissue, paper, box, present, room]"
374688,Mallorca 251 Eixample 08008 Barcelona Spain,46,11/26/2015,8.3,Alexandra Barcelona A DoubleTree by Hilton,Sweden,Pillows,3,351,Nice welcoming and service,5,15,7.9,"[' Business trip ', ' Solo traveler ', ' Twin ...",616 day,41.393192,2.16152,Pillows Nice welcoming and service,Good_review,"[Pillows, nice, welcoming, service]"
404352,Piazza Della Repubblica 17 Central Station 201...,241,10/17/2015,9.1,Hotel Principe Di Savoia,United States of America,No Negative,0,1543,Everything including the nice upgrade The Hot...,27,9,10.0,"[' Leisure trip ', ' Couple ', ' Ambassador Ju...",656 day,45.479888,9.196298,Everything including the nice upgrade The Hot...,Good_review,"[include, nice, upgrade, Hotel, revamp, surpri..."
451596,Singel 303 309 Amsterdam City Center 1012 WJ A...,834,5/16/2016,9.1,Hotel Esther a,United Kingdom,No Negative,0,4687,Lovely hotel v welcoming staff,7,2,9.6,"[' Business trip ', ' Solo traveler ', ' Class...",444 day,52.370545,4.888644,Lovely hotel v welcoming staff,Good_review,"[lovely, hotel, welcome, staff]"


In [3]:
import pickle

with open('dataset/Hotel_Reviews.pkl', 'rb') as f:
    reviews = pickle.load(f)
 
#reviews = pd.read_pickle('dataset/Hotel_Reviews.pkl')

### tfidf Encoding

We want to create document embeddings using bag-of-words approach 

In [26]:
from typing import List
from scipy.sparse import lil_matrix, csr_matrix
import numpy as np

class TfIdfModel:

    #Create an index for the vocabulary from the docs
    def build_index(self, docs: List[List[str]]) -> None:
        self.index = dict()
        words = [word for doc in docs for word in doc]
        self.index = {word:i for i, word in enumerate(sorted(set(words)))}


    def train(self, docs: List[List[str]]) -> None:
        self.build_index(docs)
        num_docs = len(docs)
        num_terms = len(self.index)
    
        # Use a sparse matrix
        term_doc_matrix = lil_matrix((num_docs, num_terms), dtype=np.float64)
    
        # Compute the term frequency matrix
        for i, doc in enumerate(docs):
            for term in doc:
                term_doc_matrix[i, self.index[term]] += 1
    
        # Convert to CSR format for efficient arithmetic operations
        term_doc_matrix = term_doc_matrix.tocsr()
    
        td_log_matrix = term_doc_matrix.copy()
        td_log_matrix.data = np.log10(td_log_matrix.data + 1)
    
        df_vector = np.diff(term_doc_matrix.tocsc().indptr)
        df_vector[df_vector == 0] = 1
        idf_vector = np.log10(num_docs / df_vector)
    
        self.tfidf_matrix = td_log_matrix.multiply(idf_vector)


    #Embed a word into our tfidf vector space If the word is not in the index it will return None
    def embed(self, word: str) -> np.ndarray:
        if word not in self.index:
            return None

        word_index = self.index[word]
        word_vector = self.tfidf_matrix.getcol(word_index).toarray().flatten()
        return word_vector

    def vector_size(self) -> int:
        return self.tfidf_matrix.shape[0]

In [30]:
import numpy as np

#Create a document embedding using the bag of words approach
def bagOfWords(model: TfIdfModel, doc: List[str]) -> np.ndarray:
    embd_sum = np.zeros(model.vector_size())
    valid_embeds = []

    for token in doc:
        embed = model.embed(token)
        if embed is not None:
            valid_embeds.append(embed)

    if valid_embeds:
        embd_sum = np.sum(valid_embeds, axis=0) / len(valid_embeds)

    return embd_sum


In [None]:
model = TfIdfModel()  
docs = reviews['review_clean']
model.train([review for review in reviews['review_clean'][:len(reviews)-1]])

labels_train = np.array([])

embed_train = np.array([bagOfWords(model,review) for review in reviews['review_clean'][:len(reviews)-1]])
labels_train = np.array([review for review in reviews['Reviewer_Score']])

print(embed_train.shape)
print(labels_train.shape)