## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk

from tqdm import trange
from nltk import tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer


numpy: Useful for numerical operations and array manipulations.
pandas: Ideal for data manipulation and analysis using DataFrames.
Libraries for Visualization:
matplotlib.pyplot: Provides plotting capabilities for creating static, interactive, and animated visualizations.
seaborn: Enhances matplotlib by providing a high-level interface for drawing attractive statistical graphics.
Libraries for Text Processing:
string: Provides constants and classes for string operations.
re: Supports regular expression operations for pattern matching and text processing.
nltk (Natural Language Toolkit): A suite of libraries for natural language processing. Specific modules used here include:
nltk.tokenize: For splitting text into words or sentences.
nltk.corpus.stopwords: Provides a list of common stopwords in various languages.
nltk.stem.WordNetLemmatizer: For reducing words to their base or root form.
nltk.probability.FreqDist: Computes the frequency distribution of words or events.
Utility Libraries:
tqdm.trange: Adds a progress bar to loops, providing feedback on execution progress.
Data Structures and Algorithms:
collections.Counter: Counts occurrences of elements in an iterable, useful for frequency analysis.
Feature Extraction:
sklearn.feature_extraction.text.CountVectorizer: Converts a collection of text documents to a matrix of token counts.

In [None]:
import warnings
warnings.filterwarnings('ignore') #Suppresses warning messages,
nltk.download('omw-1.4', quiet=True)
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (17,7) #Sets global parameters for Matplotlib plots, runtime configuration parameters
plt.rcParams['font.size'] = 18

## Loading the Data

In [None]:
data = pd.read_csv("tripadvisor_hotel_reviews.csv")
data.head(10)

Now that we have our data, we can begin with the EDA.<br>**But first**, we need to transform the 'Rating' column to binary labels

In [None]:
data['Rating'].value_counts() #frequency of each unique value in the Rating colum

In [None]:
# rating 4, 5 => Positive; 1, 2, 3 => Negative
def ratings(rating):
    if rating>3 and rating<=5:
        return "Positive"
    if rating>0 and rating<=3:
        return "Negative"

In [None]:
data['Rating'] = data['Rating'].apply(ratings)# apply() method applies a function (ratings) to each element in the Rating column.
plt.pie(data['Rating'].value_counts(), labels=data['Rating'].unique().tolist(), autopct='%1.1f%%')
plt.show()

## Exploratory Data Analysis

### Counts and Lenght:
Start by checking how long the reviews are
* Character count
* Word count
* Mean word length
* Mean sentence length

In [None]:
lenght = len(data['Review'][0])#irst element (row) of the Review column in the DataFrame.
print(f'Length of a sample review: {lenght}')

nice hotel expensive parking got good deal stayed sat night because attending event hotel clean comfortable would stay again bargain price parking good central location" , 593 characters

In [None]:
data['Length'] = data['Review'].str.len()
data.head(10)

#### **Word Count**: Number of words in a review

In [None]:
word_count = data['Review'][0].split()
print(f'Word count in a sample review: {len(word_count)}')

In [None]:
def word_count(review):
    review_list = review.split()
    return len(review_list)

In [None]:
data['Word_count'] = data['Review'].apply(word_count)
data.head(10)

#### **Mean word length**: Average length of words

In [None]:
data['mean_word_length'] = data['Review'].map(lambda rev: np.mean([len(word) for word in rev.split()]))
#average length of words in each review
data.head(10)

Mean Word Length=
Word Count/
Length of the Review
​

For example, for the first review:

Length of the Review: 593
Word Count: 87
Mean Word Length
=
593/
87
≈
5.804598
Mean Word Length=
87
593
​
 ≈5.804598

#### **Mean sentence length**: Average length of the sentences in the review

In [None]:
import nltk

nltk.download('punkt_tab')

np.mean([len(sent) for sent in tokenize.sent_tokenize(data['Review'][0])])

tokenize.sent_tokenize(data['Review'][0]): Splits the first review (data['Review'][0]) into individual sentences.
len(sent): Calculates the number of characters in each sentence.
[len(sent) for sent in ...]: Creates a list of sentence lengths for the review.
np.mean(...): Calculates the mean (average) of the sentence lengths.

In [None]:
data['mean_sent_length'] = data['Review'].map(lambda rev: np.mean([len(sent) for sent in tokenize.sent_tokenize(rev)]))
data.head(10)

Mean Sentence Length=

Length of the Review/Number of Sentences
​
 =
1
593
​
 =591.0

Row 1:
Sentences: ["I love this product.", "It works well."]
Lengths: [20, 14]
Mean: (20 + 14) / 2 = 17.0
Row 2:
Sentences: ["Not worth the price.", "Too expensive and low quality."]
Lengths: [21, 29]
Mean: (21 + 29) / 2 = 25.0
The mean_sent_length column will contain these averages for each review.

In [None]:
def visualize(col):

    print()
    plt.subplot(1,2,1)
    sns.boxplot(y=data[col], x=data['Rating']) # Changed hue to x
    plt.ylabel(col, labelpad=12.5)

    plt.subplot(1,2,2)
    sns.kdeplot(x=data[col], hue=data['Rating']) # Changed data[col] to x=data[col]
    plt.legend(data['Rating'].unique())
    plt.xlabel('')
    plt.ylabel('')

plt.show() # Moved plt.show() outside the loop


In [None]:
features = data.columns.tolist()[2:]
for feature in features:
    visualize(feature)

## Term Frequency Analysis
Examining the most frequently occuring words is one of the most popular systems of Text analytics. For example, in a sentiment analysis problem, a positive text is bound to have words like 'good', 'great', 'nice', etc. more in number than other words that imply otherwise.

*Note*: Term Frequencies are more than counts and lenghts, so the first requirement is to preprocess the text

In [None]:
df = data.drop(features, axis=1)
df.head()

In [None]:
df.info()

There is no missing data, therefore, we can move to the next stage. For Term frequency analysis, it is essential that the text data be preprocessed.
* Lowercase
* Remove punctutations
* Stopword removal

In [None]:
def clean(review):

    review = review.lower()
    review = re.sub('[^a-z A-Z 0-9-]+', '', review)
    review = " ".join([word for word in review.split() if word not in stopwords.words('english')])

    return review

In [None]:
 import nltk
 nltk.download('stopwords')
df['Review'] = df['Review'].apply(clean)
df.head(10)
# Convert Text to Lowercase
# Convert Text to Lowercase
#Remove Stopwords
#tokenization

In [None]:
df['Review'][0]

In [None]:
def corpus(text):
    text_list = text.split()
    return text_list

In [None]:
df['Review_lists'] = df['Review'].apply(corpus)
df.head(10)

In [None]:
corpus = []
for i in trange(df.shape[0], ncols=150, nrows=10, colour='green', smoothing=0.8):
    corpus += df['Review_lists'][i]
len(corpus) #append all elements from the Review_lists column into corpus

In [None]:
mostCommon = Counter(corpus).most_common(10)
mostCommon

In [None]:
words = []
freq = []
for word, count in mostCommon:
    words.append(word)
    freq.append(count)

In [None]:
sns.barplot(x=freq, y=words)
plt.title('Top 10 Most Frequently Occuring Words')
plt.show()

## Most Frequently occuring N_grams

**What is an N-gram?** <br>
An n-gram is sequence of n words in a text. Most words by themselves may not present the entire context. Typically adverbs such as 'most' or 'very' are used to modify verbs and adjectives. Therefore, n-grams help analyse phrases and not just words which can lead to better insights.
<br>
> A **Bi-gram** means two words in a sequence. 'Very good' or 'Too great'<br>
> A **Tri-gram** means three words in a sequence. 'How was your day' would be broken down to 'How was your' and 'was your day'.<br>

For separating text into n-grams, we will use `CountVectorizer` from Sklearn

In [None]:
cv = CountVectorizer(ngram_range=(2,2))
bigrams = cv.fit_transform(df['Review'])

In [None]:
count_values = bigrams.toarray().sum(axis=0)
ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv.vocabulary_.items()], reverse = True))
ngram_freq.columns = ["frequency", "ngram"]

In [None]:
sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])
plt.title('Top 10 Most Frequently Occuring Bigrams')
plt.show()

In [None]:
cv1 = CountVectorizer(ngram_range=(3,3))
trigrams = cv1.fit_transform(df['Review'])
count_values = trigrams.toarray().sum(axis=0)
ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in cv1.vocabulary_.items()], reverse = True))
ngram_freq.columns = ["frequency", "ngram"]

In [None]:
sns.barplot(x=ngram_freq['frequency'][:10], y=ngram_freq['ngram'][:10])
plt.title('Top 10 Most Frequently Occuring Trigrams')
plt.show()

<div class="alert alert-info" role="alert">
    <h2>But what about Word Clouds?</h2>

<p>
    While word clouds are very appealing, they really don't provide a lot of information. A word or two are very obviously visible but other than that, there is not a lot to examine. <b>A simple bar plot may not be as attractive as a word cloud but it is surely more informative</b> - which is our ultimate goal. A word cloud may serve better as a cover to present your solution (which is why its right on top), but it can hardly be the solution. Of course, this is my personal opinion and word clouds should be used if they're absolutely needed. <br><br>
    What do you think? Let me know in the comments!</p>
</div>