# Latent Dirichlet Allocation (LDA) Example

This notebook contains an applied example of using LDA for topic modelling, roughly following the example [here](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0).    
Please note that this notebook shows how to run LDA, but does not explain how it works. For more information on how LDA works, see [this article](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2).

---
**Load packages:**

Note: If you've never run this before, you might need to run `python3 -m spacy download en_core_web_sm` in the terminal first.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import re
import random
import spacy # spaCy for preprocessin

# Set random seed
random.seed(1234)

---
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. More information can be found here: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

**Fetch data:**

In [None]:
newsgroups = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
categories = newsgroups.target_names
target_num = newsgroups.target
target = [categories[x] for x in target_num]

In [None]:
df = pd.DataFrame()
df['text'] = newsgroups.data
df['target'] = target
df.head()

----
**Clean data:**

In [None]:
# Remove newlines
df['text_processed'] = df['text'].map(lambda x: re.sub('[\n\t]', ' ', x))
# Remove punctuation
df['text_processed'] = df['text_processed'].map(lambda x: re.sub('[,\.!?]', '', x))
# Lowercase
df['text_processed'] = df['text_processed'].map(lambda x: x.lower())
df

In [None]:
# Preserve orig df
df_orig = df.copy()

Filter the dataset down into just 1 subcategory per category for our example:

In [None]:
df['category'] = df.target.str.split('.').str[0]

In [None]:
unq_targets = df.loc[:, ["category", "target"]].drop_duplicates()
chosen_targets = []
for i in unq_targets.category.unique():
    choice = unq_targets[unq_targets.category == i].iloc[[0]].target.iloc[0]
    chosen_targets.append(choice)

In [None]:
df = df.loc[df.target.isin(chosen_targets),].reset_index(drop = True)
df

-----
**Wordcloud:**

In [None]:
# Import the wordcloud library
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(df['text_processed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

------
**Prepare data for LDA:**

- Tokenize text
- Remove stopwords
- Convery to corpus and dictionary

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  #deacc=True removes punctuations

data = df.text_processed.values.tolist()
data_words = list(sent_to_words(data))
print(data_words[:1])

**Bigram and trigram models**  
https://medium.com/analytics-vidhya/topic-modeling-using-gensim-lda-in-python-48eaa2344920

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

In [None]:
# Define function for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
#python3 -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:30])

We can see how the text is reduced to a bag of words:

In [None]:
print(texts[10])
print(df.text_processed.iloc[10])

---------
**Model training:**

To keep things simple, we’ll keep all the parameters to default except for inputting the number of topics. 

In [None]:
from pprint import pprint

# number of topics
num_topics = 7 # This is our expected number of categories

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

These topics don't look very good - from the words they all look pretty similar.

-------
**Analyzing LDA model results:**

In [None]:
import pyLDAvis.gensim
import pyLDAvis
import os

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join('./results/ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

In [None]:
LDAvis_prepared

-----

**Evaluating topic predictions**  
Generate list of assigned topics and probabilites for each text:

In [None]:
topic_result = []
topic_prob = []
for i in corpus:
    topic_result.append(lda_model[i][0][0])
    topic_prob.append(lda_model[i][0][1])
    
df["topic_result"] = topic_result
df["topic_prob"] = topic_prob
df

We can use a confusion matrix to see how our text is classified by the model vs. the actual topic:

**Confusion matrix:** (https://towardsdatascience.com/5-ways-to-use-a-seaborn-heatmap-python-tutorial-c79950f5add3)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
conf_matrix = pd.crosstab(df['target'], df['topic_result'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

It doesn't look like it does too well in distinguishing the subject matter - lots of entries are classified in '0'. What if we get rid of this - does the pattern become clearer?

Look at specific topics - 3 and 5 - as they look like they have stronger relationships with certain categories.

In [None]:
topic_no = 3
print(lda_model.print_topics()[topic_no])

In [None]:
df.loc[df.topic_result == topic_no].head()

In [None]:
for i in range(5):
    print("---- Data row", i, "--------------------------")
    print("Actual category:", df.loc[df.topic_result == topic_no]['target'].iloc[i])
    print(df.loc[df.topic_result == topic_no]['text'].iloc[i])

----- 
**Filtering to topics with probability threshold**  
What if we only selected rows with certain topic_prob?

Look at distribution of topic probabilities for the dataset

In [None]:
# Sort the values in descending order
sorted_values = df['topic_prob'].sort_values(ascending = False)

# Calculate cumulative frequencies
cumulative_frequencies = sorted_values.cumsum()

In [None]:
sorted_values

In [None]:
# Create the cumulative frequency plot
plt.plot(sorted_values, cumulative_frequencies, marker='o')
plt.xlabel('Values')
plt.ylabel('Cumulative Frequency')
plt.title('Cumulative Frequency Plot')
plt.grid(True)
plt.xlim(max(sorted_values), min(sorted_values))
plt.show()

Filter dataset to look at just texts where probability exceeds thresholds.

In [None]:
# Set probability threshold
prob_thresh = 0.9
df_prob_thresh = df.loc[df.topic_prob >= prob_thresh]
df_prop_over_thresh = round((df_prob_thresh.shape[0]/df.shape[0])*100, 1)

In [None]:
print("".join([str(df_prop_over_thresh), "%"]), "of texts have topic probability threshold of", prob_thresh, "or above.")

In [None]:
conf_matrix = pd.crosstab(df_prob_thresh['target'], df_prob_thresh['topic_result'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

In [None]:
topic_no = 4
print(lda_model.print_topics()[topic_no])

In [None]:
df_prob_thresh.loc[df_prob_thresh.topic_result == topic_no].head()

-----

## Evaluating the model

How can we judge how well our topic model works?

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#13viewthetopicsinldamodel

In [None]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

- Lower the perplexity better the model.   
- Higher the topic coherence, the topic is more human interpretable.

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

-----
-----
## Improving the model

It could be that LDA doesn't work particularly well on this data because of the short text length.


**ChatGPT: 
Does LDA work well for short texts?**

_LDA is generally more effective when applied to longer texts compared to short texts. This is because short texts often lack sufficient context and word co-occurrence patterns to accurately determine topic assignments. However, there are variations and adaptations of LDA that have been developed specifically for short texts._

_When dealing with short texts, such as tweets or product reviews, the limited amount of text can lead to sparsity and ambiguity in the word distributions. This can make it challenging for LDA to reliably identify meaningful topics. Additionally, short texts often lack the necessary depth and breadth of information that longer documents provide._

In [None]:
# Explore number of words in texts
df["words"] = df["text"].apply(lambda n: len(n.split()))

In [None]:
# Plotting the histogram
plt.hist(df['words'], bins='auto')

# Adding labels and title
plt.xlabel('Number of words')
plt.ylabel('Count')

# Setting the x-axis limits
x_min = 0  # Replace with the desired minimum x-axis value
x_max = 500  # Replace with the desired maximum x-axis value
plt.xlim(x_min, x_max)

# Display the plot
plt.show()

LDA typically doesn't work so well for <= 50 words. What proprtion of our data is this true for?

In [None]:
df.loc[df.words <= 50].shape[0] / df.shape[0]

----
----
## New model
Note: this model doesn't include use of bigrams and lemmatization. Any improved results could be down to removing this step.

Would be interesting to run the model on just the longer texts.   
Obviously this wouldn't work in practice when applying the technique, but is interesting here to see whether LDA becomes more effective with longer texts.

In [None]:
# Remove shorter texts
df_long = df.loc[df.words > 50]
df_long.shape

Run basic model (note, this doesn't include the use of bigrams):

In [None]:
data_long = df_long.text_processed.values.tolist()
data_words_long = list(sent_to_words(data_long))
# remove stop words
data_words_long = remove_stopwords(data_words_long)
print(data_words_long[:1][0][:30])

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word_long = corpora.Dictionary(data_words_long)

# Create Corpus
texts_long = data_words_long

# Term Document Frequency
corpus_long = [id2word_long.doc2bow(text) for text in texts_long]

# View
print(corpus_long[:1][0][:30])

We can see how the text is reduced to a bag of words:

In [None]:
print(texts_long[10])
print(df_long.text_processed.iloc[10])

---------
**Model training:**

To keep things simple, we’ll keep all the parameters to default except for inputting the number of topics. 

In [None]:
from pprint import pprint

# number of topics
num_topics = 7 # This is our expected number of categories

# Build LDA model
lda_model_long = gensim.models.LdaMulticore(corpus=corpus_long,
                                       id2word=id2word_long,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model_long.print_topics())
doc_lda_long = lda_model_long[corpus_long]

These topics don't look very good - from the words they all look pretty similar.

-------
**Analyzing LDA model results:**

In [None]:
import pyLDAvis.gensim
import pyLDAvis
import os

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join('./results/ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared_long = pyLDAvis.gensim.prepare(lda_model_long, corpus_long, id2word_long)

In [None]:
LDAvis_prepared_long

-----

**Evaluating topic predictions**  
Generate list of assigned topics and probabilites for each text:

In [None]:
topic_result = []
topic_prob = []
for i in corpus_long:
    topic_result.append(lda_model_long[i][0][0])
    topic_prob.append(lda_model_long[i][0][1])

In [None]:
df_long["topic_result"] = topic_result
df_long["topic_prob"] = topic_prob
df_long

We can use a confusion matrix to see how our text is classified by the model vs. the actual topic:

**Confusion matrix:** (https://towardsdatascience.com/5-ways-to-use-a-seaborn-heatmap-python-tutorial-c79950f5add3)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
conf_matrix_long = pd.crosstab(df_long['target'], df_long['topic_result'], rownames=['Actual'], colnames=['Predicted'])
print (conf_matrix_long)

In [None]:
#Plot confusion matrix heatmap
plt.figure(figsize=(10, 10))
sns.set(font_scale=1.5)

sns.heatmap(conf_matrix_long,
            cmap='coolwarm',
            annot=True,
            fmt='.5g',
            vmax=200)

plt.xlabel('Predicted',fontsize=22)
plt.ylabel('Actual',fontsize=22)

This looks like it might be working slighly better - topic 2 seems to be religion and politics, which could have similar language, while 5 is computing and science which are related.

In [None]:
topic_no = 2
print(lda_model.print_topics()[topic_no])

In [None]:
df_long.loc[df_long.topic_result == topic_no].head()

- Lower the perplexity better the model.   
- Higher the topic coherence, the topic is more human interpretable.

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model_long.log_perplexity(corpus_long))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda_long = CoherenceModel(model=lda_model_long, texts=data_words_long, dictionary=id2word_long, coherence='c_v')
coherence_lda_long = coherence_model_lda_long.get_coherence()
print('\nCoherence Score: ', coherence_lda_long)