## Hotel Review

We aim to perform sentiment analysis on customer reviews to understand their 
sentiments towards a product or service. This helps us to understand customer 
satisfaction and areas that need improvement.
 
NLP Techniques: 
1. Text Preprocessing: Tokenization, stop word removal and lemmatization. 
2. Feature Extraction: Bag of Words, Term Frequency – Inverse Document 
Frequency (TF-IDF) and word embeddings. 
3. Sentiment Classification: Random Forest (RF) Classifier, Logistic Regression, 
CNN 

### Data Preparation

The data provided is stored as csv, after some data preparation and cleaning text we decide to transform this into pickle format.
Each customer review is a textual feedback and an overall rating.  
The ratings can range from 1 to 10.  
We will split them into two categories: bad reviews have ratings < 5 and good reviews have ratings >= 5.

The textual is divide into two part (positive and negative). We group the together in order to start with only one raw text data.
additionally if the user doesn't leave any negative or positive comment, this will appear as "No Negative" or "No Positive". those part have to be removed from the text.

In [1]:
#no more needed after data converting to pickle
'''import pandas as pd

#read data
reviews_df = pd.read_csv('dataset/Hotel_Reviews.csv')

#append the positive and negative reviews
reviews_df['review'] = reviews_df['Negative_Review'] + reviews_df['Positive_Review']
#create the label
reviews_df['review_type'] = reviews_df['Reviewer_Score'].apply(lambda x: 'Bad_review' if x < 5 else 'Good_review')
#sample data in order to speed up the computation
reviews_df = reviews_df.sample(frac=0.1, replace=False, random_state=42)
#clean data
reviews_df['review'] = reviews_df['review'].apply(lambda x: x.replace('No Negative', '').replace('No Positive', '')) '''


"import pandas as pd\n\n#read data\nreviews_df = pd.read_csv('dataset/Hotel_Reviews.csv')\n\n#append the positive and negative reviews\nreviews_df['review'] = reviews_df['Negative_Review'] + reviews_df['Positive_Review']\n#create the label\nreviews_df['review_type'] = reviews_df['Reviewer_Score'].apply(lambda x: 'Bad_review' if x < 5 else 'Good_review')\n#sample data in order to speed up the computation\nreviews_df = reviews_df.sample(frac=0.1, replace=False, random_state=42)\n#clean data\nreviews_df['review'] = reviews_df['review'].apply(lambda x: x.replace('No Negative', '').replace('No Positive', '')) "

We will perform several transformations to clean the textual data:
- lower the text
- tokenize the text and remove the punctuation
- remove useless stop words
- lemmatize the text

In order to speed up the computation we will convert the csv data to pickle and continue working with the pkl file

In [2]:
#lemmatize token and remove stop word, if len of word is greater than 1 remove it 
#convert the data to pickle
'''
from typing import List
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def clean_text(text: str) -> List[str]:
    doc = nlp(text)
    return [token.lemma_ for token in doc if not token.is_stop and len(token) > 1]
        
#convert the csv data to pkl in order to speed up the computation  
reviews_df['review_clean'] = reviews_df['review'].apply(lambda x: clean_text(x))
reviews_df.to_pickle('dataset/Hotel_Reviews.pkl')
'''


"\nfrom typing import List\nimport spacy\n\nnlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])\n\ndef clean_text(text: str) -> List[str]:\n    doc = nlp(text)\n    return [token.lemma_ for token in doc if not token.is_stop and len(token) > 1]\n        \n#convert the csv data to pkl in order to speed up the computation  \nreviews_df['review_clean'] = reviews_df['review'].apply(lambda x: clean_text(x))\nreviews_df.to_pickle('dataset/Hotel_Reviews.pkl')\n"

In [3]:
import pickle

with open('dataset/Hotel_Reviews.pkl', 'rb') as f:
    reviews = pickle.load(f)


### tfidf Encoding

We want to create document embeddings using bag-of-words approach 

In [26]:
from typing import List
from scipy.sparse import lil_matrix, csr_matrix
import numpy as np

class TfIdfModel:

    #Create an index for the vocabulary from the docs
    def build_index(self, docs: List[List[str]]) -> None:
        self.index = dict()
        words = [word for doc in docs for word in doc]
        self.index = {word:i for i, word in enumerate(sorted(set(words)))}


    def train(self, docs: List[List[str]]) -> None:
        self.build_index(docs)
        num_docs = len(docs)
        num_terms = len(self.index)
    
        # Use a sparse matrix
        term_doc_matrix = lil_matrix((num_docs, num_terms), dtype=np.float64)
    
        # Compute the term frequency matrix
        for i, doc in enumerate(docs):
            for term in doc:
                term_doc_matrix[i, self.index[term]] += 1
    
        # Convert to CSR format for efficient arithmetic operations
        term_doc_matrix = term_doc_matrix.tocsr()
    
        td_log_matrix = term_doc_matrix.copy()
        td_log_matrix.data = np.log10(td_log_matrix.data + 1)
    
        df_vector = np.diff(term_doc_matrix.tocsc().indptr)
        df_vector[df_vector == 0] = 1
        idf_vector = np.log10(num_docs / df_vector)
    
        self.tfidf_matrix = td_log_matrix.multiply(idf_vector)


    #Embed a word into our tfidf vector space If the word is not in the index it will return None
    def embed(self, word: str) -> np.ndarray:
        if word not in self.index:
            return None

        word_index = self.index[word]
        word_vector = self.tfidf_matrix.getcol(word_index).toarray().flatten()
        return word_vector

    def vector_size(self) -> int:
        return self.tfidf_matrix.shape[0]

In [30]:
import numpy as np

#Create a document embedding using the bag of words approach
def bagOfWords(model: TfIdfModel, doc: List[str]) -> np.ndarray:
    embd_sum = np.zeros(model.vector_size())
    valid_embeds = []

    for token in doc:
        embed = model.embed(token)
        if embed is not None:
            valid_embeds.append(embed)

    if valid_embeds:
        embd_sum = np.sum(valid_embeds, axis=0) / len(valid_embeds)

    return embd_sum


In [None]:
model = TfIdfModel()  
docs = reviews['review_clean']
model.train([review for review in reviews['review_clean'][:len(reviews)-1]])

labels_train = np.array([])

embed_train = np.array([bagOfWords(model,review) for review in reviews['review_clean'][:len(reviews)-1]])
labels_train = np.array([review for review in reviews['Reviewer_Score']])

print(embed_train.shape)
print(labels_train.shape)