## References

In [None]:
# https://radimrehurek.com/gensim/tut2.html#Gensim    
# https://markroxor.github.io/gensim/static/notebooks/lda_training_tips.html
# https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/
# https://pypi.org/project/pyLDAvis/1.0.0/
# https://gist.github.com/tokestermw/3588e6fbbb2f03f89798
# https://stackoverflow.com/questions/11162402/lda-topic-modeling-training-and-testing
# https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

## Notebook Setup

In [None]:
# Import libraries
import logging
import matplotlib.pyplot as plt
import os.path
import pyLDAvis.gensim
import pyLDAvis
import pickle
from gensim import corpora, models
from gensim.models import CoherenceModel
from wordcloud import WordCloud

In [None]:
# Display plots within notebook
%matplotlib inline

In [None]:
# Log events
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Load Vectorized Corpora and Dictionaries

In [None]:
# Load no pooling corpus
if (os.path.exists("../outputs/tourism_no_pooling.dict")):
   dictionary_no_pooling = corpora.Dictionary.load("../outputs/tourism_no_pooling.dict")
   corpus_no_pooling = corpora.MmCorpus("../outputs/tourism_no_pooling.mm")
   print("Vectorized no pooling corpus loaded!")
else:
   print("Please run preprocessing script first!")

# load user pooling corpus
if (os.path.exists("../outputs/tourism_user_pooling.dict")):
   dictionary_user_pooling = corpora.Dictionary.load("../outputs/tourism_user_pooling.dict")
   corpus_user_pooling = corpora.MmCorpus("../outputs/tourism_user_pooling.mm")
   print("Vectorized user pooling corpus loaded!")
else:
   print("Please run preprocessing script first!")

# load hashtag pooling corpus
if (os.path.exists("../outputs/tourism_hashtag_pooling.dict")):
   dictionary_hashtag_pooling = corpora.Dictionary.load("../outputs/tourism_hashtag_pooling.dict")
   corpus_hashtag_pooling = corpora.MmCorpus("../outputs/tourism_hashtag_pooling.mm")
   print("Vectorized hashtag pooling corpus loaded!")
else:
   print("Please run preprocessing script first!")

## Load Tokenized Documents

In [None]:
with open ("../outputs/tokenized_documents_no_pooling.p", "rb") as fp:
    tokenized_documents_no_pooling = pickle.load(fp)
with open ("../outputs/tokenized_documents_user_pooling.p", "rb") as fp:
    tokenized_documents_user_pooling = pickle.load(fp)
with open ("../outputs/tokenized_documents_hashtag_pooling.p", "rb") as fp:
    tokenized_documents_hashtag_pooling = pickle.load(fp)

## Implement LDA Models with Different Pooling Methods

Two evaluation metrics for topic models come to mind: coherence values and perplexity. Coherence values will be used to evaluate different LDA models (varying the number of topics) as this metric tends to favor better human interpretable topics (which is the objective of this research). The number of topics will be limited to 8 to avoid too much granularity. However, sometimes the highest coherence values do not give the most human interpretable topics. Visualization of the topic models can additionally help to understand and interprete the topics. The c_v measure will be used as a coherence measure to evaluate the LDA models.

In [None]:
# Define function to train various LDA models with different number of topics
# and evaluate their coherence values (choose the number of topics with the highest coherence value)
def compute_coherence_values(dictionary, corpus, texts, limit=9, start=4, step=1):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    model_list = []
    model_topics = []
    
    for num_topics in range(start, limit, step):
        model= models.LdaModel(corpus=corpus, id2word=dictionary, alpha='auto', eta='auto',
                       eval_every=1, iterations=400, passes=20, num_topics=num_topics)
        model_list.append(model)
        
        model_topics = model.show_topics(formatted=False)
    
        model_topics = [[word for word, prob in topic] for topicid, topic in model_topics]
    
        coherencemodel = CoherenceModel(topics=model_topics, texts=texts, dictionary=dictionary, window_size=10)
        coherence_values.append(coherencemodel.get_coherence())

    return (model_list, coherence_values)

### No Pooling

In [None]:
# Train and evaluate different no pooling models by running the function
no_pooling_models = compute_coherence_values(dictionary=dictionary_no_pooling,
                         corpus=corpus_no_pooling, texts=tokenized_documents_no_pooling)

In [None]:
# Display the coherence score of the different models
model_list_no_pooling = no_pooling_models[0]
coherence_values_no_pooling = no_pooling_models[1]

limit=9; start=4; step=1;
x = range(start, limit, step)
_ = plt.plot(x, coherence_values_no_pooling)
_ = plt.xlabel("Num Topics")
_ = plt.ylabel("Coherence score")
_ = plt.legend(("coherence_values"), loc='best')
_ = plt.savefig("no_pooling_coherence_scores")
_ = plt.show()

Choose the model with the highest coherence score (7 topics).

In [None]:
# Print topics of model with highest coherence score
lda_model_no_pooling = model_list_no_pooling[3] # 7 topics model
_ = lda_model_no_pooling.print_topics()

### Visualize No Pooling Model

In [None]:
pyLDAvis.enable_notebook()
vis_np = pyLDAvis.gensim.prepare(lda_model_no_pooling, corpus_no_pooling, dictionary_no_pooling)

In [None]:
vis_np

Although a topic trend is already visible in the no pooling model, the topics are a little bit mixed up and could be more interpretable. This finding can be attributed to the shortness of tweets.

### User Pooling

In [None]:
# Train and evaluate different user pooling models by running the function
user_pooling_models = compute_coherence_values(dictionary=dictionary_user_pooling,
                         corpus=corpus_user_pooling, texts=tokenized_documents_user_pooling)

In [None]:
# Display the coherence score of the different models
model_list_user_pooling = user_pooling_models[0]
coherence_values_user_pooling = user_pooling_models[1]

limit=9; start=4; step=1;
x = range(start, limit, step)
_ = plt.plot(x, coherence_values_user_pooling)
_ = plt.xlabel("Num Topics")
_ = plt.ylabel("Coherence score")
_ = plt.legend(("coherence_values"), loc='best')
_ = plt.savefig("user_pooling_coherence_scores")
_ = plt.show()

Choose the model with the highest coherence score (8 topics).

In [None]:
# Print topics of model with highest coherence score
lda_model_user_pooling = model_list_user_pooling[4] # 8 topics model
_ = lda_model_user_pooling.print_topics()

### Visualize User Pooling Model

In [None]:
pyLDAvis.enable_notebook()
vis_up = pyLDAvis.gensim.prepare(lda_model_user_pooling, corpus_user_pooling, dictionary_user_pooling)

In [None]:
vis_up

The results of the user pooling model look similar to the no pooling model. However, topics are even more mixed up and less interpretable since users tend to tweet about different topics.

### Hashtag Pooling

In [None]:
# Train and evaluate different hashtag pooling models by running the function
hashtag_pooling_models = compute_coherence_values(dictionary=dictionary_hashtag_pooling,
                         corpus=corpus_hashtag_pooling, texts=tokenized_documents_hashtag_pooling)

In [None]:
# Display the coherence score of the different models
model_list_hashtag_pooling = hashtag_pooling_models[0]
coherence_values_hashtag_pooling = hashtag_pooling_models[1]

limit=9; start=4; step=1;
x = range(start, limit, step)
_ = plt.plot(x, coherence_values_hashtag_pooling)
_ = plt.xlabel("Num Topics")
_ = plt.ylabel("Coherence score")
_ = plt.legend(("coherence_values"), loc='best')
_ = plt.savefig("hashtag_pooling_coherence_scores")
_ = plt.show()

Choose the model with the highest coherence score (7 topics).

In [None]:
# print topics of model with highest coherence score
lda_model_hashtag_pooling = model_list_hashtag_pooling[3] # 7 topics
_ = lda_model_hashtag_pooling.print_topics()

### Visualize Hashtag Pooling Model

In [None]:
pyLDAvis.enable_notebook()
vis_hp = pyLDAvis.gensim.prepare(lda_model_hashtag_pooling, corpus_hashtag_pooling, dictionary_hashtag_pooling)

In [None]:
vis_hp

### Intermediary Result:

After training ~50 models for each pooling method, the following conclusion was reached: An inspection of the topics of the no pooling and user pooling method shows that they are less human interpretable than hashtag pooling models and show some repetitions in words among topics. Moreover, the no pooling model and user pooling model are very unstable as tweets are very short. The best trained hashtag pooling model (meaning the one with the most human interpretable topics) will therefore be saved and used for further purposes. Hashtag pooling was also shown to give the best results in various research papers.

## Save Models

In [None]:
lda_model_no_pooling.save("../outputs/lda_model_no_pooling.model") 
lda_model_user_pooling.save("../outputs/lda_model_user_pooling.model") 
lda_model_hashtag_pooling.save("../outputs/lda_model_hashtag_pooling.model")

## Test Whether TFIDF Can Improve LDA (Instead of BOW)

Sometimes TFIDF improves LDA performance although LDA is mathematically meant to process a BOW input. TFIDF is therefore used to transform the corpus of the chosen model (hashtag pooling model with 7 topics).

In [None]:
# Initialize tfidf model
tfidf_hashtag_pooling = models.TfidfModel(corpus_hashtag_pooling)
   
# Run term frequency inverse document frequency transformation
# (transform bag-of-words integer counts corpus to tfidf real-valued weights
# corpus)
corpus_tfidf_hashtag_pooling = tfidf_hashtag_pooling[corpus_hashtag_pooling]
for doc in corpus_tfidf_hashtag_pooling:
    print(doc)

In [None]:
# Train hashtag pooling model with tfidf corpus
lda_model_hashtag_pooling_tfidf = models.LdaModel(corpus_tfidf_hashtag_pooling,
                                                  id2word=dictionary_hashtag_pooling,
                                                  alpha='auto', eta='auto',
                                                  eval_every=1,
                                                  iterations=400, passes=20, num_topics=7)

In [None]:
# Trint topics of model
_ = lda_model_hashtag_pooling_tfidf.print_topics()

In the case of tweets, however, TFIDF does not improve the results but makes them worse and less interpretable. Very rare terms are weighted heavier but in the case of tweets these seldomly have an interpretable topic (e.g. "#youcanseeourhousefromhere"). The model that will be used as final LDA model is thus the 7 topics hashtag pooling model applied to a BOW corpus.

### Analysis of Topics

In [None]:
pyLDAvis.enable_notebook()
vis_hp = pyLDAvis.gensim.prepare(lda_model_hashtag_pooling, corpus_hashtag_pooling, dictionary_hashtag_pooling)

In [None]:
vis_hp

#### Manual inspection of the topics leads to the following labels:
#### Topic 0: Sightseeing (Sagrada Familia, gaudi, architecture, travel, church ...)
#### Topic 1: Summer, Sun & Friends (beach, friends, summer. smile, sun...)
#### Topic 2: Streetart (graffiti, streetart, arte urbano, massive, streetphotography...)
#### Topic 3: Everyday Life (yum, home, place, call, tapas ...)
#### Topic 4: Lifestyle & Culture (yoga, selfie, contemporaryart, yummy, brianeno ...)
#### Topic 5: Nightlife (night, olgod beer bar, cocktail, beer, raval ...) 
#### Topic 6: Sports, Health & Image (workout, fit, meditation, healthy, video ...)

In [None]:
# Tisplay the 10 most important words for each topic
n_topics = 7
topic_terms = []

for i in range(0, n_topics):
    temp = lda_model_hashtag_pooling.show_topic(i, 10)
    terms = []
    for term in temp:
        terms.append(term)
    topic_terms.append(terms)
    print("Top 10 terms for topic #" + str(i) + ": "+ ", ".join([str(i[0]) for i in terms]))

In [None]:
# Display wordclouds for the topics
def terms_to_wordcounts(terms, multiplier=1000):
    return  " ".join([" ".join(int(multiplier*i[1]) * [i[0]]) for i in terms])

wordclouds = []
i = 0

for topic in topic_terms:
    wordcloud = WordCloud(background_color="black", collocations=False).generate(terms_to_wordcounts(topic))
    
    _ = plt.imshow(wordcloud)
    _ = plt.axis("off")
    _ = plt.savefig("terms_wordcloud_topic" + str(i))
    _ = plt.show()
    
    i += 1