<a href="https://colab.research.google.com/github/skappal7/NLP/blob/main/hotel_review_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HOTEL REVIEW ANALYSIS - CUSTOMER SENTIMENT ORIENTATION STUDY 🙂 😐 ☹️

![](https://images.pexels.com/photos/60217/pexels-photo-60217.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500)

In [None]:
'jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10'

'jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10'

#  **OBJECTIVE 💡**

**The objective of this analysis is to understand the sentiment orientation of the customer relative to their hote stay. The secondary aim of this analysis is to identify topics and themes to create customer experience improvement related strategies based on the data insights.**

**Getting the data:**
To get the data within Kaggle you just need to run the below code block which will get the data ingested within the environment. In case you are using Google Colab you just need to get the data path from the lefthand side panel or you can also mount the google drive to access the data from your google drive. 

Google Colab Data Path: ('/content/filename.csv') should do the magic.

Importing pandas is key to perform dataframe related activities.

In [1]:
import pandas as pd

PyCaret is a low code Ml flow library that provides hossts of packages to solve various data related problems starting from simple regression to NL related acticities. In this notebook I will utilizing PyCAret's NLP library to perform the Hotel Review Sentiment Analysis, creating a Latent Dirichlet Allocation Model and assigning that model to new set of data for sentiment prediction. 

Options to install PyCaret:
* pip intall pycaret (basic)
* pip install pycaret [full] (entire package with all the dependencies)
* pip install pycaret-nightly (updated and full version)

In [2]:
pip install pycaret-nightly[full]

Collecting pycaret-nightly[full]
[?25l  Downloading https://files.pythonhosted.org/packages/80/1b/60c46571035b95a1e727ed69ba201ef8b4431038612384061b6a1c57e8bf/pycaret_nightly-2.3.2.dev1625618655-py3-none-any.whl (264kB)
[K     |█▎                              | 10kB 14.4MB/s eta 0:00:01[K     |██▌                             | 20kB 19.3MB/s eta 0:00:01[K     |███▊                            | 30kB 24.3MB/s eta 0:00:01[K     |█████                           | 40kB 21.9MB/s eta 0:00:01[K     |██████▏                         | 51kB 10.7MB/s eta 0:00:01[K     |███████▌                        | 61kB 10.0MB/s eta 0:00:01[K     |████████▊                       | 71kB 9.6MB/s eta 0:00:01[K     |██████████                      | 81kB 9.9MB/s eta 0:00:01[K     |███████████▏                    | 92kB 10.8MB/s eta 0:00:01[K     |████████████▍                   | 102kB 9.9MB/s eta 0:00:01[K     |█████████████▋                  | 112kB 9.9MB/s eta 0:00:01[K     |████████████

PyCaret uses interactive plotting ability. In order to render interactive plots in Google Colab, run the below line of code in your colab notebook.

In [14]:
from pycaret.utils import enable_colab 
enable_colab()

Colab mode enabled.


In [3]:
data = pd.read_csv('/content/tripadvisor_hotel_reviews.csv')

In [4]:
data.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [5]:
data.shape

(20491, 2)

In [6]:
from pycaret.nlp import *
nlp_sent = setup(data = data, target = 'Review', session_id = 999,log_experiment = True, experiment_name = 'HotRev1')

Description,Value
session_id,999
Documents,20491
Vocab Size,32301
Custom Stopwords,False


# Once the setup is succesfully executed it prints the information grid with the following information: 🛠️

**session_id :** A pseduo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment session_id is set as 999 for later reproducibility.

**Number of Documents :** Number of documents (or samples in dataset if dataframe is passed).

**Vocab Size :** Size of vocabulary in the corpus after applying all text pre-processing such as removal of stopwords, bigram/trigram extraction, lemmatization etc.
Notice that all text pre-processing steps are performed automatically when you execute setup().

# Let's Perform Topic Modeling 🎯

**What is Topic Model?** In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently. In a hotel review dataset checkin, checkout, night stay etc. words will appear mostly relative to various customer experience intensities.

In [7]:
lda = create_model('lda')

In [8]:
print(lda)

LdaModel(num_terms=32301, num_topics=4, decay=0.5, chunksize=100)


We have created Latent Dirichlet Allocation (LDA) model using create_model(). Notice the num_topics parameter is set to 4 which is a default value taken when no num_topics parameter in create_model() is passed. In below example, we will create LDA model with 6 topics and we will also set multi_core parameter to True. When multi_core is set to True Latent Dirichlet Allocation (LDA) uses all CPU cores to parallelize and speed up model training.

In [9]:
lda2 = create_model('lda', num_topics = 6, multi_core = True)

In [10]:
print(lda2)

LdaModel(num_terms=32301, num_topics=6, decay=0.5, chunksize=100)


In [11]:
lda_results = assign_model(lda2)
lda_results.head()

Unnamed: 0,Review,Rating,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Dominant_Topic,Perc_Dominant_Topic
0,nice hotel expensive parking get good deal sta...,4,0.736896,0.002287,0.002322,0.002304,0.143942,0.112249,Topic 0,0.74
1,special charge decide chain shoot anniversary ...,2,0.387373,0.000947,0.134051,0.000952,0.475724,0.000953,Topic 4,0.48
2,nice room experience hotel hotel level positiv...,3,0.519576,0.001139,0.072922,0.044589,0.360628,0.001146,Topic 0,0.52
3,unique great stay wonderful time location exce...,5,0.339264,0.129007,0.243405,0.002444,0.002442,0.283437,Topic 0,0.34
4,great stay great stay go seahawk game building...,5,0.084995,0.028271,0.001176,0.124227,0.490774,0.270558,Topic 4,0.49


In [15]:
plot_model()

**Top 100 Biagrams**

In [16]:

plot_model(plot = 'bigram')

**Frequency Distribution of Topic 5**

In [20]:
plot_model(lda2, plot = 'frequency', topic_num = 'Topic 5')

Topic Distribution 

In [18]:
plot_model(lda2, plot = 'topic_distribution')

In [22]:
evaluate_model(lda2)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Frequency Plot', 'freque…

# Intrinsic Model Evaluation Method Using Coherence Value

**What is Intrinsic Evaluation Method?**

Intrinsic evaluation methods assess how well the word embeddings inherently capture the semantic or syntactic relationships between the words. Where Semantics refers to the meaning of words, whereas syntax refers to the grammar. You could also evaluate the embeddings on syntactic analogies, such as plurals, tenses and comparatives.
N
Hence, using the tune_model() we will create a topic coherence score by iterating on a pre-defined grid with different number of topics and create a model for each parameter.Topic coherence is then evaluated for different models and are visually presented in a graph that has the  Coherence Score on y-axis as a function of # Topics on x-axis. You can view the results below:

**Note: This part of the process took the longest around 4+ hours to create the semantic and syntactic relationships in between the topics.**

In [27]:
tuned_unsupervised = tune_model(model = 'lda', multi_core = True)

IntProgress(value=0, description='Processing: ', max=25)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 200 | Coherence: 0.3898


In [28]:
evaluate_model(tuned_unsupervised )

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Frequency Plot', 'freque…

In [30]:
print(tuned_unsupervised)

LdaModel(num_terms=32301, num_topics=200, decay=0.5, chunksize=100)


In [29]:
plot_model(tuned_unsupervised, plot = 'topic_distribution')

In [31]:
plot_model(tuned_unsupervised, plot = 'frequency', topic_num = 'Topic 70')

In [37]:
save_model(tuned_unsupervised,'Final Tuned LDA Model 07072021')

Model Succesfully Saved


(<gensim.models.ldamulticore.LdaMulticore at 0x7f95e70d8f10>,
 'Final Tuned LDA Model 07072021.pkl')

In [38]:
saved_lda = load_model('Final Tuned LDA Model 07072021')

Model Sucessfully Loaded


In [39]:
print(saved_lda)

LdaModel(num_terms=32301, num_topics=200, decay=0.5, chunksize=100)
