# Topic Extraction on Bokan & Comperitors 

## Goals
- Extract and get meaningful topics from 5* reviews of key competitors -> Wordcloud for each topic -> ideas for rebranding
- Extract topics from Bokan's bad reviews -> Wordcloud for each topic -> focus on when rebranding
- Extract topics from Bokan's good reviews -> Wordcloud for each topic -> put forward when advertizing

## Outline
- Import Libraries
- Load and preprocess the data
- Topic Extraction
    - Topic extraction from competitors good reviews using LDA with gensim 
    - Topic extraction from bokan bad reviews using GSDMM & conclusions
    - Topic extraction from bokan good reviews using GSDMM & conclusions

## Import libraries 

In [1]:
#!git clone https://github.com/rwalk/gsdmm
#!pip install transformers==2.4.1
#!pip install flair

In [2]:
import os
import numpy as np
import pandas as pd
import pickle
import gensim
import gensim.corpora as corpor
import pyLDAvis.gensim
import pyLDAvis
from gsdmm.gsdmm import MovieGroupProcess
from gensim.models.coherencemodel import CoherenceModel
import operator
from tqdm import tqdm_notebook as tqdm
import flair
import nltk
from nltk import word_tokenize
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import seaborn as sns
import matplotlib.pyplot as plt

  from google.protobuf.pyext import _message
  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict
  import pandas.util.testing as tm


In [3]:
nltk.download('vader_lexicon')
flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

2020-03-14 22:24:20,218 loading file /Users/Abderrahmane/.flair/models/imdb-v0.4.pt


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/Abderrahmane/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Load and reprocess the data 

In [143]:
from data import jl_to_df

In [144]:
bokan_37 = jl_to_df.read_jl_file("data/Bokan_37.jl")
bokan_38_39 = jl_to_df.read_jl_file("data/Bokan_38_39.jl")
boisdale = jl_to_df.read_jl_file("data/Boisdale.jl")
cinnamon = jl_to_df.read_jl_file("data/Cinnamon.jl")
ivy = jl_to_df.read_jl_file("data/Ivy.jl")
peninsula = jl_to_df.read_jl_file("data/Peninsula.jl")

In [147]:
# Put all the reviews of bokan and competitors in two dataframes
document_competitors = pd.concat([boisdale, cinnamon, ivy, peninsula], axis = 0)
document_bokan = pd.concat([bokan_37, bokan_38_39], axis = 0)
document_competitors.rating = document_competitors.rating.astype(int)
document_bokan.rating = document_bokan.rating.astype(int)

In [150]:
# Check if the dataframes was correctly created
print(len(document_competitors) == sum(map(lambda x: len(x),[boisdale, cinnamon, ivy, peninsula])))
print(len(document_bokan) == sum(map(lambda x: len(x),[bokan_37, bokan_38_39])))

True
True


In [151]:
document_competitors.head()

Unnamed: 0,id_resto,id_comment,resto,resto_url,rating,title,diner_date,rating_date,answer_text,reviewer_pseudo,reviewer_origin,reviewer_info_sup,other_ratings_category,other_ratings_value,url,content
0,g186338-d2180904,g186338-d2180904-r747618087,Boisdale_Canary_Wharf,/Restaurant_Review-g186338-d2180904-Reviews-Bo...,4,Fabulous ambience and delightful service from ...,February 2020,25 February 2020,,Hopesprings1,"[Sunningdale, United Kingdom]","[[pencil-paper, 2]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,"[Fabulous food, service and old school live Ja..."
1,g186338-d2180904,g186338-d2180904-r744514687,Boisdale_Canary_Wharf,/Restaurant_Review-g186338-d2180904-Reviews-Bo...,4,Upscale McDonalds,February 2020,12 February 2020,,futtock21,"[London, United Kingdom]","[[pencil-paper, 2802], [thumbs-up-fill, 1252]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[Boisdale Canary Wharf is part of a coterie of...
2,g186338-d2180904,g186338-d2180904-r743436725,Boisdale_Canary_Wharf,/Restaurant_Review-g186338-d2180904-Reviews-Bo...,4,Good night out!,January 2020,6 February 2020,,Jan S,"[Hornchurch, United Kingdom]","[[pencil-paper, 94], [thumbs-up-fill, 54]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[My first visit to Boisdale. I visited on a Fr...
3,g186338-d2180904,g186338-d2180904-r743081178,Boisdale_Canary_Wharf,/Restaurant_Review-g186338-d2180904-Reviews-Bo...,5,"Very good, will return",February 2020,4 February 2020,,tripbiscuit,"[London, UK]","[[pencil-paper, 113], [thumbs-up-fill, 83]]","[Value, Service, Food]","[50, 50, 50]",https://www.tripadvisor.co.uk/ShowUserReviews-...,[I took advantage of a groupon offer as I was ...
4,g186338-d2180904,g186338-d2180904-r744537071,Boisdale_Canary_Wharf,/Restaurant_Review-g186338-d2180904-Reviews-Bo...,3,Boisdale,February 2020,12 February 2020,,grahamm586,[],"[[pencil-paper, 3]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[Not very busy when we visited. Service pretty...


In [155]:
# We are only interested in 5 star ratings from competitors
document_competitors = document_competitors.loc[document_competitors.rating == 5, :]

# And we split bokan review into very good (5 stars) and very bad (<= 2stars)
document_bokan_good = document_bokan.loc[document_bokan.rating == 5, :]
document_bokan_bad = document_bokan.loc[document_bokan.rating <= 2, :]

In [156]:
document_bokan_bad.head()

Unnamed: 0,id_resto,id_comment,resto,resto_url,rating,title,diner_date,rating_date,answer_text,reviewer_pseudo,reviewer_origin,reviewer_info_sup,other_ratings_category,other_ratings_value,url,content
2,g186338-d12156905,g186338-d12156905-r741474592,Bokan_37_Restaurant,/Restaurant_Review-g186338-d12156905-Reviews-B...,2,Much better out there......,January 2020,27 January 2020,[Thank you Kevin T for sharing your experience...,satanbug,"[Auckland, New Zealand]","[[pencil-paper, 528], [thumbs-up-fill, 170]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[We were on site so decided to visit. We share...
5,g186338-d12156905,g186338-d12156905-r741502709,Bokan_37_Restaurant,/Restaurant_Review-g186338-d12156905-Reviews-B...,2,Disappointed,January 2020,27 January 2020,[Thank you Elena T for your review. I have spo...,elenatX4465YE,"[London, United Kingdom]","[[pencil-paper, 6], [thumbs-up-fill, 5]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[I dined here with friends on a Saturday night...
17,g186338-d12156905,g186338-d12156905-r734982490,Bokan_37_Restaurant,/Restaurant_Review-g186338-d12156905-Reviews-B...,1,Christmas Disasters,December 2019,27 December 2019,[Thank you for your feedback Yvonne G. I am sl...,Yvonne G,"[Croydon, United Kingdom]","[[pencil-paper, 72], [thumbs-up-fill, 72]]","[Value, Service, Food]","[10, 10, 10]",https://www.tripadvisor.co.uk/ShowUserReviews-...,[I booked Bokan Christmas dinner because we we...
18,g186338-d12156905,g186338-d12156905-r709002464,Bokan_37_Restaurant,/Restaurant_Review-g186338-d12156905-Reviews-B...,2,"Ok food, terrible service",September 2019,12 September 2019,"[Thank you Claire, I am so sorry about this an...",ClaireCur,[],"[[pencil-paper, 1], [thumbs-up-fill, 1]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,[My husband and I were staying at the Novotel ...
23,g186338-d12156905,g186338-d12156905-r708642279,Bokan_37_Restaurant,/Restaurant_Review-g186338-d12156905-Reviews-B...,1,"Good view, bad service",September 2019,10 September 2019,[Thank you for your review. I am so sad and di...,danielcO2200BQ,[],"[[pencil-paper, 1], [thumbs-up-fill, 2]]",[],[],https://www.tripadvisor.co.uk/ShowUserReviews-...,"[We had really bad service, change of courses ..."


In [159]:
print("Bokan bad: " ,len(document_bokan_bad))
print("Bokan good: " ,len(document_bokan_good))
print("Competitors: " ,len(document_competitors))

Bokan bad:  70
Bokan good:  320
Competitors:  969


In [160]:
# These are all the 5 star reviews from competitors 
content_competitors = document_competitors.content
# And these are good and bad reviews for bokan
content_bokan_good = document_bokan_good.content
content_bokan_bad = document_bokan_bad.content

In [161]:
content_bokan_bad.head()

2     [We were on site so decided to visit. We share...
5     [I dined here with friends on a Saturday night...
17    [I booked Bokan Christmas dinner because we we...
18    [My husband and I were staying at the Novotel ...
23    [We had really bad service, change of courses ...
Name: content, dtype: object

In [165]:
content_competitors.index = range(0, len(content_competitors))
content_bokan_good.index = range(0, len(content_bokan_good))
content_bokan_bad.index = range(0, len(content_bokan_bad))

In [167]:
# Transform reviews into strings
content_competitors = content_competitors.apply(lambda x: "".join(x))
content_bokan_good = content_bokan_good.apply(lambda x: "".join(x))
content_bokan_bad = content_bokan_bad.apply(lambda x: "".join(x))

In [169]:
content_bokan_good

0      Highly recommend the food! The most amazing lo...
1      Whaou It’s really really good, We’re passionne...
2      My brother and I went to Bokan for dinner last...
3      We went to dinner with my partner in this rest...
4      Took my friend for her birthday it was one of ...
                             ...                        
315    We are staying at the hotel, had lunch here. O...
316    I went on a walk on Sunday evening and was not...
317    Spotted this place whilst staying at neighbour...
318    Both bars very nice, decor lovely, views amazi...
319    The atmosphere here was lovely. The view was t...
Name: content, Length: 320, dtype: object

In [170]:
# Split reviews into sentences
content_competitors = content_competitors.apply(lambda x: x.split(". "))
content_bokan_good = content_bokan_good.apply(lambda x: x.split(". "))
content_bokan_bad = content_bokan_bad.apply(lambda x: x.split(". "))

In [171]:
# Put all sentences in a big list for each df
l_competitors = []
l_bokan_good = []
l_bokan_bad = []
for i in content_competitors:
    l_competitors += i
for i in content_bokan_good:
    l_bokan_good += i
for i in content_bokan_bad:
    l_bokan_bad += i

In [213]:
# Transpose the lists into dataframe of sentences
data_competitors = pd.DataFrame(l_competitors, columns=["content"])
print(data_competitors.head())
print("\n")
data_bokan_good = pd.DataFrame(l_bokan_good, columns=["content"])
print(data_bokan_good.head())
print("\n")
data_bokan_bad = pd.DataFrame(l_bokan_bad, columns=["content"])
print(data_bokan_bad.head())
data_bokan_bad.to_csv(r'bokan_bad.csv')

                                             content
0  I took advantage of a groupon offer as I was m...
1  We've both passed it many times on the way to ...
2  Anyway, it's a very nice space, upmarket brass...
3  It looks as if they can cater for private part...
4  Service was charming and attentive, the wine l...


                                             content
0  Highly recommend the food! The most amazing lo...
1  Whaou It’s really really good, We’re passionne...
2  My brother and I went to Bokan for dinner last...
3                        Then the service, brilliant
4  Professional, precise, very knowledgeable and ...


                                             content
0                We were on site so decided to visit
1  We shared a few starters and a main with a cou...
2  We arrived as the restaurant opened and were t...
3  Hopeless...The starters were good, we enjoyed ...
4  For close to £ 300 pounds you would expect muc...


In [180]:
# Cleaning, Tokenization & Stemming functions from last notebooks
def basic_cleaning(series):
    # Remove punctuation
    new_series = series.str.replace('[^\w\s]','')
    # Strip trailing whitespace
    new_series = new_series.str.strip(" ")
    # Decapitalize letters
    new_series = new_series.apply(lambda x: str(x).lower())
    return new_series

def tokenize_filter(sentence):
    # Define stopwords
    stop_words = set(stopwords.words('english')) 
    ## Add personalised stop words
    stop_words |= set(["london", "food", "drink", "restaurant"])
    # Filter the sentence
    word_tokens = word_tokenize(sentence) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    return (word_tokens, filtered_sentence)

def stem_review(tokens):
    porter = PorterStemmer()
    return tokens.apply(lambda x: [porter.stem(x[i]) for i in range(len(x))])

def lemmatize_review(tokens):
    lemmatizer = WordNetLemmatizer()
    return tokens.apply(lambda x: [lemmatizer.lemmatize(x[i]) for i in range(len(x))])

def preprocess_data(data):
    df = data
    df["cleaned_content"] = basic_cleaning(df["content"])
    df["tokenized_content"] = df["cleaned_content"].apply(lambda x: tokenize_filter(x)[1])
    df["clean_content"] = lemmatize_review(df["tokenized_content"])
    return df[["clean_content"]]

df_competitors = preprocess_data(data_competitors)
df_bokan_good = preprocess_data(data_bokan_good)
df_bokan_bad = preprocess_data(data_bokan_bad)

  new_series = series.str.replace('[^\w\s]','')


In [211]:
print(df_competitors.head())
print("\n")
print(df_bokan_good.head())
print("\n")
print(df_bokan_bad.head())

                                       clean_content
0  [took, advantage, groupon, offer, meeting, fri...
1  [weve, passed, many, time, way, somewhere, els...
2  [anyway, nice, space, upmarket, brasseriestyle...
3       [look, cater, private, party, larger, group]
4  [service, charming, attentive, wine, list, go,...


                                       clean_content
0  [highly, recommend, amazing, location, best, c...
1  [whaou, really, really, good, passionned, good...
2  [brother, went, bokan, dinner, last, week, ill...
3                               [service, brilliant]
4  [professional, precise, knowledgeable, lovely,...


                                       clean_content  nb_token
0                             [site, decided, visit]         3
1        [shared, starter, main, couple, cold, beer]         6
2  [arrived, opened, told, two, lobster, left, fa...         7
3  [hopelessthe, starter, good, enjoyed, main, fi...        11
4  [close, 300, pound, would, expect, much, m

## Topic extraction 

### Topic extraction from competitors good reviews using LDA with gensim 

In [359]:
# Dictionary
tokens = df_competitors.clean_content
dictionary = gensim.corpora.Dictionary(tokens)
# Filter out tokens in the dictionary by their frequency
dictionary.filter_extremes(no_below=0.05, no_above=0.9)
# doc2bow: Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples
corpus = [dictionary.doc2bow(tok) for tok in tokens]

In [360]:
ldaModel = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=3, 
                                           random_state=41,
                                           alpha=0.1,
                                           eta=0.1,
                                           per_word_topics=True)

In [361]:
for i,topic in ldaModel.show_topics(formatted=True, num_topics=3, num_words=50):
    print(str(i)+": "+ topic+"\n")

0: 0.025*"staff" + 0.021*"wine" + 0.019*"service" + 0.016*"menu" + 0.015*"would" + 0.011*"friendly" + 0.011*"definitely" + 0.010*"back" + 0.010*"excellent" + 0.010*"tasting" + 0.010*"well" + 0.009*"recommend" + 0.009*"course" + 0.008*"hotel" + 0.008*"night" + 0.008*"meal" + 0.008*"go" + 0.008*"delicious" + 0.008*"really" + 0.008*"presented" + 0.008*"attentive" + 0.006*"dish" + 0.006*"good" + 0.006*"best" + 0.006*"lovely" + 0.006*"chef" + 0.006*"enjoyed" + 0.006*"highly" + 0.006*"view" + 0.005*"peninsula" + 0.005*"taste" + 0.005*"o2" + 0.005*"amazing" + 0.005*"great" + 0.005*"evening" + 0.005*"professional" + 0.004*"nice" + 0.004*"time" + 0.004*"u" + 0.004*"special" + 0.004*"tasty" + 0.004*"star" + 0.004*"helpful" + 0.004*"visit" + 0.004*"michelin" + 0.004*"worth" + 0.004*"even" + 0.003*"experience" + 0.003*"say" + 0.003*"dinner"

1: 0.016*"great" + 0.012*"place" + 0.011*"birthday" + 0.010*"well" + 0.009*"staff" + 0.009*"nice" + 0.009*"friend" + 0.008*"service" + 0.008*"wine" + 0.008*"e

In [362]:
cm = CoherenceModel(model=ldaModel, corpus=corpus, texts=tokens ,coherence="c_v")
cm.get_coherence()

0.30473726857285466

In [363]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldaModel, corpus, dictionary)
vis

**Conclusion for topic extraction on competitors 5 star reviews**
- Topics are not clear
- Further manual analysis required to find the features that competitors have and that we are missing

### Topic extraction from bokan bad reviews using GSDMM 

In [324]:
# Dictionary
tokens = df_bokan_bad.clean_content
dictionary = gensim.corpora.Dictionary(tokens)
# Filter out tokens in the dictionary by their frequency
dictionary.filter_extremes(no_below=0.05, no_above=0.9)
# doc2bow: Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples
corpus = [dictionary.doc2bow(tok) for tok in tokens]

In [325]:
df_bokan_bad['nb_token'] = list(map(len, df_bokan_bad['clean_content']))
docs = df_bokan_bad.clean_content.to_list()
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)

In [326]:
nb_topic = 3
alpha = 0.1
beta = 0.1

mgpModel = MovieGroupProcess(K=nb_topic, alpha=alpha, beta=beta, n_iters=20)
mgpModelFit = mgpModel.fit(tokens, n_terms)

# Save model
#with open(f'model/gsdmm_model.pkl', 'wb') as f:
#    pickle.dump(mgpModel, f)
#    f.close()

In stage 0: transferred 338 clusters with 3 clusters populated
In stage 1: transferred 189 clusters with 3 clusters populated
In stage 2: transferred 117 clusters with 3 clusters populated
In stage 3: transferred 108 clusters with 3 clusters populated
In stage 4: transferred 99 clusters with 3 clusters populated
In stage 5: transferred 96 clusters with 3 clusters populated
In stage 6: transferred 111 clusters with 3 clusters populated
In stage 7: transferred 103 clusters with 3 clusters populated
In stage 8: transferred 107 clusters with 3 clusters populated
In stage 9: transferred 89 clusters with 3 clusters populated
In stage 10: transferred 78 clusters with 3 clusters populated
In stage 11: transferred 90 clusters with 3 clusters populated
In stage 12: transferred 83 clusters with 3 clusters populated
In stage 13: transferred 75 clusters with 3 clusters populated
In stage 14: transferred 69 clusters with 3 clusters populated
In stage 15: transferred 69 clusters with 3 clusters popul

In [327]:
def topWordsPerTopic(clusterDistrib, topIndex, nbWord):
    for index in topIndex:
        clusterWord = clusterDistrib[index]
        sortedCluster = sorted(clusterWord.items(), key=operator.itemgetter(1), reverse=True)
        clusterTopWords = sortedCluster[:nbWord]
        print(f"Cluster {index} : {clusterTopWords}")
        print('*'*20)

**QUESTION:** Can we do a wordcloud on the clusters?

In [328]:
docCount = np.array(mgpModel.cluster_doc_count)
print('Number of documents per topic :', docCount)
print('*'*20)
# Topics sorted by the number of document they are allocated to
topIndex = docCount.argsort()[::-1]
print('Most important clusters (by number of docs inside):', topIndex)
print('*'*20)
# Show the top 30 words in term frequency for each cluster 
topWordsPerTopic(mgpModel.cluster_word_distribution, topIndex, 30)

Number of documents per topic : [137 285 168]
********************
Most important clusters (by number of docs inside): [1 2 0]
********************
Cluster 1 : [('table', 68), ('u', 58), ('time', 45), ('bar', 45), ('drink', 45), ('order', 36), ('minute', 31), ('didnt', 30), ('asked', 29), ('staff', 28), ('ordered', 26), ('could', 25), ('arrived', 24), ('waiting', 24), ('waiter', 23), ('cocktail', 22), ('got', 22), ('even', 21), ('take', 21), ('one', 21), ('like', 20), ('service', 19), ('main', 19), ('took', 18), ('view', 17), ('told', 17), ('would', 17), ('said', 17), ('another', 17), ('however', 16)]
********************
Cluster 2 : [('service', 24), ('view', 22), ('night', 19), ('bokan', 19), ('good', 17), ('time', 17), ('bar', 17), ('would', 13), ('really', 13), ('one', 13), ('experience', 12), ('great', 12), ('customer', 12), ('u', 11), ('visited', 10), ('birthday', 10), ('drink', 10), ('venue', 10), ('friday', 10), ('booked', 10), ('friend', 9), ('day', 9), ('canary', 9), ('wharf'

### pyLDAvis with gsdmm

In [329]:
#Topic-term matrix shape (n_topic, n_term)
def createTopicTermMatrix(vocab, mgp):
    zero = np.zeros((len(mgp.cluster_word_distribution), len(vocab)))
    df = pd.DataFrame(data=zero, columns=list(vocab))
    for i, cluster_word_distrib in tqdm(enumerate(mgp.cluster_word_distribution)):
        for key, val in cluster_word_distrib.items():
            df.loc[i, key] = val
    return df

In [330]:
TopicTermMatrix = createTopicTermMatrix(vocab, mgpModel)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [331]:
#Matrix of document-topic probabilities shape (n_doc, n_topics)
def createDocumentTopicProbaMatrix(mgp, doc_list):
    score_per_doc = []
    for doc in tqdm(doc_list):
        score_per_doc.append(mgp.score(doc))
    df = pd.DataFrame(data=score_per_doc, columns=[i for i in range(nb_topic)])
    return df

In [332]:
probaDocumentTopicMatrix = createDocumentTopicProbaMatrix(mgpModel, docs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=590.0), HTML(value='')))




In [333]:
def deleteNonProcessedDocument(probaDocumentTopicMatrix, document):
    check = probaDocumentTopicMatrix.sum(axis=1)
    toDelete = check[check==0]
    idxToDelete = toDelete.index 
    probaDocumentTopicMatrix = probaDocumentTopicMatrix.drop(idxToDelete)
    
    idx = list(map(lambda x: document.index[x], idxToDelete))
    document = document.drop(idx)
    return probaDocumentTopicMatrix, document

In [334]:
probaDocumentTopicMatrixClean, documentClean = deleteNonProcessedDocument(probaDocumentTopicMatrix, df_bokan_bad)

In [335]:
#doc length shape (n_doc)
docLength = documentClean.nb_token
print(len(documentClean))

590


In [336]:
#doc length shape (n_term)
vocabList = list(vocab) 

In [337]:
#Term frequency shape (n_term)
def TermFrequency(vocab_list, doc_list):
    res = []
    for word in tqdm(vocab_list):
        word_per_doc = sum(list(map(lambda x: x.count(word), doc_list)))
        res.append(word_per_doc)
    return res

In [338]:
termFrequencyList = TermFrequency(vocabList, docs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=1900.0), HTML(value='')))




In [340]:
%%time
vis = pyLDAvis.prepare(TopicTermMatrix, probaDocumentTopicMatrixClean, docLength, vocabList, termFrequencyList, sort_topics=False)
pyLDAvis.display(vis)

CPU times: user 1.33 s, sys: 10.2 ms, total: 1.34 s
Wall time: 1.68 s


**Conclusion: Topic extraction for bad reviews of bokan**
- **Topic 1: Time:** reservation time, time to get the menu, time to get the drinks, service time --> more staff, time guarantees for service
- **Topic 2: Food:** cold, not well cooked, several dishes that come up frequently like duck, fish etc, to be analyzed semi-manually
- **Topic 3: Staff:** rude, not appologizing when mistake is made --> formation, réecriture de l'énoncé de mission personelle pour le restaurant par les employés eux-mêmes

## Topic extraction from bokan good reviews using GSDMM 

In [341]:
# Dictionary
tokens = df_bokan_good.clean_content
dictionary = gensim.corpora.Dictionary(tokens)
# Filter out tokens in the dictionary by their frequency
dictionary.filter_extremes(no_below=0.05, no_above=0.9)
# doc2bow: Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples
corpus = [dictionary.doc2bow(tok) for tok in tokens]

In [342]:
df_bokan_good['nb_token'] = list(map(len, df_bokan_good['clean_content']))
docs = df_bokan_good.clean_content.to_list()
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)

In [343]:
nb_topic = 3
alpha = 0.1
beta = 0.1

mgpModel = MovieGroupProcess(K=nb_topic, alpha=alpha, beta=beta, n_iters=20)
mgpModelFit = mgpModel.fit(tokens, n_terms)

In stage 0: transferred 764 clusters with 3 clusters populated
In stage 1: transferred 500 clusters with 3 clusters populated
In stage 2: transferred 392 clusters with 3 clusters populated
In stage 3: transferred 338 clusters with 3 clusters populated
In stage 4: transferred 319 clusters with 3 clusters populated
In stage 5: transferred 317 clusters with 3 clusters populated
In stage 6: transferred 269 clusters with 3 clusters populated
In stage 7: transferred 294 clusters with 3 clusters populated
In stage 8: transferred 293 clusters with 3 clusters populated
In stage 9: transferred 270 clusters with 3 clusters populated
In stage 10: transferred 257 clusters with 3 clusters populated
In stage 11: transferred 265 clusters with 3 clusters populated
In stage 12: transferred 235 clusters with 3 clusters populated
In stage 13: transferred 242 clusters with 3 clusters populated
In stage 14: transferred 235 clusters with 3 clusters populated
In stage 15: transferred 247 clusters with 3 clust

In [344]:
def topWordsPerTopic(clusterDistrib, topIndex, nbWord):
    for index in topIndex:
        clusterWord = clusterDistrib[index]
        sortedCluster = sorted(clusterWord.items(), key=operator.itemgetter(1), reverse=True)
        clusterTopWords = sortedCluster[:nbWord]
        print(f"Cluster {index} : {clusterTopWords}")
        print('*'*20)

**QUESTION:** Can we do a wordcloud on the clusters?

In [345]:
docCount = np.array(mgpModel.cluster_doc_count)
print('Number of documents per topic :', docCount)
print('*'*20)
# Topics sorted by the number of document they are allocated to
topIndex = docCount.argsort()[::-1]
print('Most important clusters (by number of docs inside):', topIndex)
print('*'*20)
# Show the top 30 words in term frequency for each cluster 
topWordsPerTopic(mgpModel.cluster_word_distribution, topIndex, 30)

Number of documents per topic : [443 630 302]
********************
Most important clusters (by number of docs inside): [1 0 2]
********************
Cluster 1 : [('view', 91), ('bokan', 82), ('definitely', 72), ('amazing', 62), ('great', 55), ('u', 54), ('place', 53), ('bar', 52), ('would', 51), ('staff', 50), ('service', 50), ('back', 49), ('dinner', 47), ('table', 44), ('go', 44), ('went', 40), ('menu', 38), ('recommend', 37), ('time', 36), ('friend', 34), ('visit', 33), ('night', 32), ('evening', 29), ('made', 29), ('experience', 29), ('best', 27), ('drink', 27), ('special', 26), ('birthday', 26), ('well', 26)]
********************
Cluster 0 : [('view', 142), ('staff', 85), ('service', 83), ('great', 82), ('amazing', 57), ('friendly', 48), ('bar', 40), ('good', 39), ('nice', 38), ('cocktail', 38), ('time', 36), ('really', 35), ('u', 33), ('table', 31), ('excellent', 27), ('fantastic', 27), ('atmosphere', 27), ('place', 26), ('bokan', 26), ('would', 23), ('stunning', 21), ('always', 2

### pyLDAvis with gsdmm

In [346]:
#Topic-term matrix shape (n_topic, n_term)
def createTopicTermMatrix(vocab, mgp):
    zero = np.zeros((len(mgp.cluster_word_distribution), len(vocab)))
    df = pd.DataFrame(data=zero, columns=list(vocab))
    for i, cluster_word_distrib in tqdm(enumerate(mgp.cluster_word_distribution)):
        for key, val in cluster_word_distrib.items():
            df.loc[i, key] = val
    return df

In [347]:
TopicTermMatrix = createTopicTermMatrix(vocab, mgpModel)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [348]:
#Matrix of document-topic probabilities shape (n_doc, n_topics)
def createDocumentTopicProbaMatrix(mgp, doc_list):
    score_per_doc = []
    for doc in tqdm(doc_list):
        score_per_doc.append(mgp.score(doc))
    df = pd.DataFrame(data=score_per_doc, columns=[i for i in range(nb_topic)])
    return df

In [349]:
probaDocumentTopicMatrix = createDocumentTopicProbaMatrix(mgpModel, docs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=1375.0), HTML(value='')))




In [350]:
def deleteNonProcessedDocument(probaDocumentTopicMatrix, document):
    check = probaDocumentTopicMatrix.sum(axis=1)
    toDelete = check[check==0]
    idxToDelete = toDelete.index 
    probaDocumentTopicMatrix = probaDocumentTopicMatrix.drop(idxToDelete)
    
    idx = list(map(lambda x: document.index[x], idxToDelete))
    document = document.drop(idx)
    return probaDocumentTopicMatrix, document

In [353]:
probaDocumentTopicMatrixClean, documentClean = deleteNonProcessedDocument(probaDocumentTopicMatrix, df_bokan_good)

In [354]:
#doc length shape (n_doc)
docLength = documentClean.nb_token
print(len(documentClean))

1375


In [355]:
#doc length shape (n_term)
vocabList = list(vocab) 

In [356]:
#Term frequency shape (n_term)
def TermFrequency(vocab_list, doc_list):
    res = []
    for word in tqdm(vocab_list):
        word_per_doc = sum(list(map(lambda x: x.count(word), doc_list)))
        res.append(word_per_doc)
    return res

In [357]:
termFrequencyList = TermFrequency(vocabList, docs)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=2394.0), HTML(value='')))




In [358]:
%%time
vis = pyLDAvis.prepare(TopicTermMatrix, probaDocumentTopicMatrixClean, docLength, vocabList, termFrequencyList, sort_topics=False)
pyLDAvis.display(vis)

CPU times: user 1.99 s, sys: 20.5 ms, total: 2.02 s
Wall time: 2.35 s


**Conclusion: Topic extraction for good reviews of bokan**
- **Topic 1: Staff:** attentive, friendly, great service --> shows that some waiters do a really good job and highlights the fact that some waiters are more welcoming than others
- **Topic 2: Food:** delicious, tasty, well presented, several dishes that come up frequently like lamb, some cocktails and some desserts, to be analyzed semi-manually
- **Topic 3: View:** unanimously liked by all customers, should be put forward when advertizing for Bokan