# **Import Libraries, Packages and Data**

**Import Libraries & Packages**

In [None]:
import pandas as pd
#import numpy as np
from pprint import pprint

from nltk.tokenize import sent_tokenize, word_tokenize

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=41eb728d1e5b21d7eebf39888186d1b164dbdbbbdb4f8c0c29a2806837421702
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-3.3.1
Collecting gensim==3.8.0
  Downloading gensim-3.8.0-cp37-cp37m-manylinux1_x86_64.

**Mount to GDrive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Import Data**

In [None]:
path = '/content/drive/MyDrive/Colab Notebooks/Hotel_Analysis/Hotel_Datasets/515k_Hotel_Reviews_SENTIMENTS_40tokens.csv'
df = pd.read_csv(path)
df.head(5)

Unnamed: 0,Review_Date,Hotel_Country,Hotel_Name,Sentiments,Reviewer_Nationality,Review,cleaned_Reviews
0,2017-07-24,Netherlands,Hotel Arena,Negative,Poland,Backyard of the hotel is total mess shouldn t...,backyard hotel total mess happen hotel star
1,2017-07-17,Netherlands,Hotel Arena,Negative,United Kingdom,Cleaner did not change our sheet and duvet ev...,cleaner change sheet duvet everyday made bed a...
2,2017-07-17,Netherlands,Hotel Arena,Negative,United Kingdom,Apart from the price for the brekfast Everyth...,apart price brekfast good
3,2017-09-07,Netherlands,Hotel Arena,Negative,Belgium,Even though the pictures show very clean room...,even though picture show clean room actual roo...
4,2017-08-07,Netherlands,Hotel Arena,Negative,Norway,The aircondition makes so much noise and its ...,aircondition make noise hard sleep night


**Overview of Dataset**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545181 entries, 0 to 545180
Data columns (total 7 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   Review_Date           545181 non-null  object
 1   Hotel_Country         545181 non-null  object
 2   Hotel_Name            545181 non-null  object
 3   Sentiments            545181 non-null  object
 4   Reviewer_Nationality  545181 non-null  object
 5   Review                545181 non-null  object
 6   cleaned_Reviews       545181 non-null  object
dtypes: object(7)
memory usage: 29.1+ MB


**Extract only reviews text**

In [None]:
# Extract only reviews text
reviews = pd.DataFrame(df.cleaned_Reviews)

**First 5 rows of reviews text**

In [None]:
reviews.head()

Unnamed: 0,cleaned_Reviews
0,backyard hotel total mess happen hotel star
1,cleaner change sheet duvet everyday made bed a...
2,apart price brekfast good
3,even though picture show clean room actual roo...
4,aircondition make noise hard sleep night


### **Forming Bigrams & Trigrams**

1. We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. 

2. Bigrams are phrases containing 2 words e.g. ‘air conditioning’, where ‘air’ and ‘conditioning’ are more likely to co-occur rather than appear separately. 

3. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Food and Drinks’. 

4. We use Pointwise Mutual Information (PMI) score to identify significant bigrams and trigrams to concatenate. 

5. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. 

6. This helps the LDA model better cluster topics.

***Reference:***
https://nicharuc.github.io/topic_modeling/

**Create Bigrams & Filter those that only occur in the documents at least 50 times**

In [None]:
# Create bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_documents([text.split() for text in reviews.cleaned_Reviews])

# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
bigram_scores = finder.score_ngrams(bigram_measures.pmi)

**Create Trigrams & Filter those that only occur in the documents at least 50 times**

In [None]:
# Create trigrams
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_documents([text.split() for text in reviews.cleaned_Reviews])

# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
trigram_scores = finder.score_ngrams(trigram_measures.pmi)

**Store bigram scores into dataframe**

In [None]:
# store bigram scores into dataframe
bigram_pmi = pd.DataFrame(bigram_scores)
bigram_pmi.columns = ['bigram', 'pmi']
bigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

**Store trigram scores into dataframe**

In [None]:
# store trigram scores into dataframe
trigram_pmi = pd.DataFrame(trigram_scores)
trigram_pmi.columns = ['trigram', 'pmi']
trigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)

**Filter for bigrams with only noun-type structure**

(noun/adj, noun)

In [None]:
# Filter for bigrams with only noun-type structure
def bigram_filter(bigram):
    tag = nltk.pos_tag(bigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']:
        return False
    if 'n' in bigram or 't' in bigram:
        return False
    if 'PRON' in bigram:
        return False
    return True

**Filter for trigrams with only noun-type structure**

(noun/adj,all types,noun/adj)

In [None]:
# Filter for trigrams with only noun-type structure
def trigram_filter(trigram):
    tag = nltk.pos_tag(trigram)
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['JJ','NN']:
        return False
    if 'n' in trigram or 't' in trigram:
         return False
    if 'PRON' in trigram:
        return False
    return True 

**Set pmi threshold**

Get top 500 bigrams/trigrams with highest PMI score

In [None]:
# set pmi threshold 
# get top 500 bigrams/trigrams with highest PMI score
filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram:\
                                              bigram_filter(bigram['bigram'])\
                                              and bigram.pmi > 5, axis = 1)][:500]

filtered_trigram = trigram_pmi[trigram_pmi.apply(lambda trigram: \
                                                 trigram_filter(trigram['trigram'])\
                                                 and trigram.pmi > 5, axis = 1)][:500]


bigrams = [' '.join(x) for x in filtered_bigram.bigram.values if len(x[0]) > 2 or len(x[1]) > 2]
trigrams = [' '.join(x) for x in filtered_trigram.trigram.values if len(x[0]) > 2 or len(x[1]) > 2 and len(x[2]) > 2]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


**Examples of bigrams**

In [None]:
# examples of bigrams
bigrams[:20]

['moulin rouge',
 'ziggo dome',
 'h tel',
 'shepherd bush',
 'hustle bustle',
 'du nord',
 'hash brown',
 'wear tear',
 'boarding pass',
 'memory foam',
 'lancaster gate',
 'sagrada familia',
 'marble arch',
 'stone throw',
 'dressing gown',
 'hammersmith apollo',
 'westminster abbey',
 'winter wonderland',
 'lick paint',
 'body lotion']

**Examples of trigrams**

In [None]:
# examples of trigrams
trigrams[:20]

['arc de triomphe',
 'gare de lyon',
 'champ lys e',
 'royal albert hall',
 'st paul cathedral',
 'iron ironing board',
 'gare de l',
 'stone throw away',
 'lancaster gate tube',
 'cross st pancras',
 'hop hop bus',
 'full length mirror',
 'new year eve',
 'fresh orange juice',
 'red light district',
 'earl court tube',
 'fire alarm went',
 'single pushed together',
 'earl court underground',
 'flat screen tv']

**Concatenate n-grams**

In [None]:
# Concatenate n-grams
def replace_ngram(x):
    for gram in trigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    for gram in bigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    return x

**Replicate reviews dataframe**

In [None]:
#replicate reviews dataframe
reviews_w_ngrams = reviews.copy()

**Apply ngrams function**

In [None]:
# apply ngrams function
reviews_w_ngrams.cleaned_Reviews = reviews_w_ngrams.cleaned_Reviews.map(lambda x: replace_ngram(x))

**Form tokens**

In [None]:
# form tokens
reviews_w_ngrams = reviews_w_ngrams.cleaned_Reviews.map(lambda x: [word for word in x.split()\
                                                                    if len(word) > 2])

**First 5 rows of Ngrams data**

In [None]:
reviews_w_ngrams.head()

0    [backyard, hotel, total, mess, happen, hotel, ...
1    [cleaner, change, sheet, duvet, everyday, made...
2                       [apart, price, brekfast, good]
3    [even, though, picture, show, clean, room, act...
4      [aircondition, make, noise, hard, sleep, night]
Name: cleaned_Reviews, dtype: object

### **Filter for only Nouns**
1. Nouns are very likely the major indicators of a topic. 

2. For example, for the sentence ‘The bed is comfortable’, we know the sentence is talking about ‘bed’. 

3. The other words in the sentence provide more context and explanation about the topic (‘bed’) itself. 

4. Therefore, filtering for the noun cleans the text for words that are more interpretable in the topic model.

**Function to filter for only nouns**

In [None]:
# Filter for only nouns
def noun_only(x):
    pos_comment = nltk.pos_tag(x)
    filtered = [word[0] for word in pos_comment if word[1] in ['NN']]
    # to filter both noun and verbs
    #filtered = [word[0] for word in pos_comment if word[1] in ['NN','VB', 'VBD', 'VBG', 'VBN', 'VBZ']]
    return filtered

**Apply filter for nouns function**

In [None]:
# apply filter for nouns function
final_reviews = reviews_w_ngrams.map(noun_only)

**Dataset after extracting Nouns only**

In [None]:
final_reviews

0                                [hotel, mess, hotel, star]
1         [change, sheet, duvet, everyday, bed, floor, b...
2                                         [price, brekfast]
3         [picture, room, room, quit, dirty, clock, room...
4                              [aircondition, sleep, night]
                                ...                        
545176                                          [breakfast]
545177                                 [staff, check, time]
545178                                          [breakfast]
545179                 [room, family, member, comfy, space]
545180                                        [staff, kind]
Name: cleaned_Reviews, Length: 545181, dtype: object

### **Output Bigrams Trigrams Dataset for Topic Modelling**

In [None]:
# Convert to dataframe
reviews_w_ngrams = pd.DataFrame(reviews_w_ngrams)

In [None]:
# output as csv file
output_loc = '/content/drive/MyDrive/Colab Notebooks/Hotel_Analysis/Hotel_Datasets/515k_Hotel_Reviews_NGRAMS_TOKENS.csv'
reviews_w_ngrams.to_csv(output_loc, index = False)

### **Output Bigram Trigrams + Nouns only Dataset for Topic Modelling**

In [None]:
# convert to dataframe
reviews_final = pd.DataFrame(final_reviews)

In [None]:
#output as csv
output_loc = '/content/drive/MyDrive/Colab Notebooks/Hotel_Analysis/Hotel_Datasets/515k_Hotel_Reviews_NOUNS_TOKENS.csv'
reviews_final.to_csv(output_loc, index = False)