# Topic Modeling with Gensim LDA
This code replicates the Google Colab topic modeling notebook for use on Jupyter Notebook or another local Python environment.

Below, Gensim is used to create LDA topic models of a pre-loaded dataframe of texts. Methods for evaluating topic coherence and analyzing topic output are also demonstrated. 

LDA topic modeling is based on the assumption that documents are a mixture of topics (# topics is set by the researcher) and that topics are a mixture of words. To determine which words belong to each topic, the LDA model randomly assigns each word in a set of documents to a topic, then iterates through the assignments and makes adjustments until all words are assigned to the topics where they have the highest probability of belonging. 

This code is adapted from [Intro to Topic Modeling with Gensim and pyLDAvis](https://github.com/hawc2/text-analysis-with-python/blob/master/Topic_Modeling.ipynb).

LDA topic modeling can be used with aggregated or disaggregated data.
It works well with disaggregated texts from the Text Sectioning and Disaggregation code from [this repository](https://github.com/SF-Nexus/Extracted-Features).


## Install and Load Packages
This code requires three main packages:
- **NLTK:** Cleaning disaggregated data
- **Gensim:** Preprocessing data and creating word embeddings, coherence models and topic models
- **LDAvis:** Visualizing topic models

Several other packages for wrangling and processing the data, such as io and pandas, will also be installed. 

**BEFORE USING THIS CODE FOR THE FIRST TIME**:

This code was created using a Jupyter Notebook through Anaconda. Before running the code for the first time, **create a new environment in Anaconda** where all packages and libraries will be stored for future usages. Learn how to create an environment using Anaconda Navigator [here.](https://wiki.math.ntnu.no/anaconda/createenvironment)

When using a new environment for the first time, several packages will need to be installed. It is recommended that you install these directly in the terminal rather than through Jupyter Notebook (though both are possible). To install in terminal, open a new terminal window, make sure you are in the correct environment, and input the following commands: 

```conda install nltk```
```conda install gensim```
```conda install pyLDAvis```

Once these packages are installed, they can be imported below. *You only need to install these the FIRST time you are using this code in a new environment.*

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
#Get dictionary of English words to keep 
from nltk.corpus import words
nltk.download('words')
nltk.download('wordnet')
from nltk import WordNetLemmatizer
!pip install wordcloud
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import multiprocessing
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import io
import os
import re
from pprint import pprint

## Retrieve and Convert Corpus to Data Frame

This code requires the upload of a previously created dataframe which contains a corpus of disaggregated texts. For optimal use, this dataframe should also contain labels associated with each text (e.g. book or chapter numbers). 

In [None]:
##Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/home/dssadmin/Desktop/")

#Upload dataframe√
#df = pd.read_csv('mellon_text_ebook_chunks.csv')
df = pd.read_csv('mellon_txt_chunks_clean.csv')

df

In [None]:
#Add values in text column of choice to new list 
data = df.Text.values.tolist()

#Define function to perform simple preprocessing on text
def sent_to_words(sentences):
    for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

#Run preprocessing on data list
data_words = list(sent_to_words(data))

## Building Dictionary and Corpus
Once the dataset is cleaned, two inputs must be created to run the topic model. Creating the dictionary maps every word in the corpus with a unique id number. The variable corpus is calculated by determining the frequency of each word in the document. Another optional cleaning measure can be used here--removing words that are extremely rare (existing in < n number of texts) or common (existing in > n number of texts).  

In [None]:
# Import dictionary
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(data_words)

#Calculate term document frequency for each word in dataset
corpus = [dictionary.doc2bow(doc) for doc in data_words]
corpus

In [None]:
# OPTIONAL cleaning: filter out words that occur less than 20 documents, or more than 80% of the documents.
#dictionary.filter_extremes(no_below=2, no_above=0.8)

In [None]:
#Get number of unique tokens and documents
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

# Multicore Processing (Use this One)

To speed up runtime, use the LdaMulticore model option and set number of workers to the number of your machine's cores minus one. 

In [None]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
cores

In [None]:
#Test LDA model multicore
from gensim.models import ldamulticore
from gensim import corpora, models

import time
start_time = time.time()

# Set training parameters.
num_topics = 100
chunksize = 1000
passes = 400
iterations = 2000
eval_every = None  # Don't evaluate model perplexity, takes too much time.
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

lda_model2 = models.LdaMulticore(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    workers = cores-1,
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=1
)

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model2.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute C_V Coherence Score
coherence_model_lda2 = CoherenceModel(model=lda_model2, texts=data_words, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda2.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Examine Topic Model Output
Once the model has been run, it is possible to retrieve the top words in each topic and visualing the model. These are methods can help further assess model coherence and point to future directions for analysis:
- **Top Words Per Topic:** Evaluate to what extent each topic contains semantically similar words, how/why these words might be meaningful in context of corpus
- **Visualizations:** Determine topic relatedness (how far apart topic circles are on plane) and topic prevalence (how large circles are corresponds to topic prevaence in corpus)

Additional reading: 

https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#15visualizethetopicskeywords

https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/


In [None]:
#Print n number of words in each topic
for idx, topic in lda_model2.print_topics(num_words=20):
    print('Topic: {} \nWords: {}'.format(idx, topic))

In [None]:
#Create dictionary with topics and words
my_dict = {"Topic":[],"Words":[]}
for idx, topic in lda_model2.print_topics(num_words=20):
    my_dict["Topic"].append(idx)
    my_dict["Words"].append(topic)

#Convert dictionary to dataframe
topics_df = pd.DataFrame.from_dict(my_dict)
topics_df.head()

#Change path to where you want to save the files
path = os.chdir("/home/dssadmin/Desktop/")

#Download dataframe as csv
topics_df.to_csv('LDA_topic_word_counts.csv', index=False)

In [None]:
#Remove custom words (stopwords and names not previously filtered out)
#custom_stop_words = ['wa']

#Define word cloud function
from wordcloud import WordCloud 
def create_wordcloud(model, topic):
    text = {word: value for word, value in model.show_topic(topic) if word not in custom_stop_words}
    wc = WordCloud(background_color="white", max_words=1000)
    wc.generate_from_frequencies(text)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title("Topic" + " "+ str(topic))
    plt.show()
    
#Ignore depreciation warnings
import warnings
warnings.filterwarnings("ignore")

#Create word clouds
for i in range(1,num_topics):
    create_wordcloud(lda_model2, topic=i)

In [None]:
#Create visualization of topic model above 
%matplotlib inline
import pyLDAvis.gensim_models
vis = pyLDAvis.gensim_models.prepare(topic_model=lda_model2, corpus=corpus, dictionary=dictionary, mds='mmds')
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

In [None]:
pyLDAvis.save_html(vis, 'Mellon200_multicore')

## Find Top Topics Per Document

Find the topic with highest percentage in each document in corpus. 

Additional reading: 
https://towardsdatascience.com/topic-modelling-in-python-with-spacy-and-gensim-dc8f7748bdbf

In [None]:
#Define function that retrieves dominant topic for each document and puts in dataframe
def format_topics_texts(ldamodel=None, corpus=corpus, texts=data):
    # Init output
    text_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                text_topics_df = text_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    text_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    text_topics_df = pd.concat([text_topics_df, contents], axis=1)
    return(text_topics_df)

In [None]:
#Run dominant topic function on corpus
df_topic_texts_keywords = format_topics_texts(ldamodel=lda_model2, corpus=corpus, texts=data_words)

# Format
df_dominant_topic = df_topic_texts_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)

In [None]:
df_dominant_topic['Title'] = df['Book + Chunk']
df_dominant_topic

In [None]:
#Download top topic per document df to csv
df_dominant_topic.to_csv('LDA_df_dominant_topic.csv', index=False)

## Get DataFrame with Each Topic and Keywords

In [None]:
#Keep only topic and keyword columns
topics_keywords_df = df_dominant_topic[['Dominant_Topic', 'Keywords']].copy()

#Remove duplicates
topics_keywords_df = topics_keywords_df.drop_duplicates()

In [None]:
#Reset index and sort topics in ascending order
topics_keywords_df = topics_keywords_df.reset_index(drop=True)
topics_keywords_df = topics_keywords_df.sort_values(by='Dominant_Topic') 
topics_keywords_df

In [None]:
## Download to csv
topics_keywords_df.to_csv('LDA_topics_keywords_df.csv', index=False)

## Find Top Documents Per Topic
Calculate the top documents attributed to each topic in the model. 

Additional reading:

https://stackoverflow.com/questions/63777101/topic-wise-document-distribution-in-gensim-lda

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb 

https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/#7.-The-most-representative-sentence-for-each-topic 

In [None]:
#Add Book + Chapter labels to dataframe for easier ID
doc_names = df['Book + Chunk']
df_topic_texts_keywords = df_topic_texts_keywords.join(doc_names)
df_topic_texts_keywords

In [None]:
# Get most representative text for each topic 
#Display setting to show more characters in column
pd.options.display.max_colwidth = 100

#Create new dataframe and group topic keywords by dominant topic column
topics_sorted_df = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_texts_keywords.groupby('Dominant_Topic')

#Sort data by percent contribution and select highest n values for each topic
for i, grp in sent_topics_outdf_grpd:
    topics_sorted_df = pd.concat([topics_sorted_df, 
                                             grp.sort_values(['Perc_Contribution'], ascending=False).head(1)], 
                                            axis=0)

In [None]:
# Reset Index of new df
topics_sorted_df.reset_index(drop=True, inplace=True)

# Format
topics_sorted_df.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Representative Text", "Text Name"]
topics_sorted_df = topics_sorted_df.reindex(columns=['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text Name", "Representative Text"])
# Show
topics_sorted_df.head()

In [None]:
#Download top doc per topic to dataframe
topics_sorted_df.to_csv('top_doc_per_LDA_topic.csv', index=False)

# Sources

**More Examples of Topic Modeling Research:**

*   http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html
*   http://www.cs.columbia.edu/~blei/papers/Blei2011.pdf
*   https://maria-antoniak.github.io/resources/2019_cscw_birth_stories.pdf 

**More Topic Modeling Tools:**

*   https://github.com/polsci/colab-gensim-mallet/blob/master/topic-modeling-with-colab-gensim-mallet.ipynb
*   https://github.com/laurejt/authorless-tms 
*   https://colab.research.google.com/github/kldarek/skok/blob/master/_notebooks/2021-05-27-Topic-Models-Introduction.ipynb


# Ignore the rest

## OPTIONAL STEP - Find Optimal Number of Topics

Calculating topic model coherence (i.e. similarity between the highest-scoring words in each topic) is one way to assess the optimal number of topics to use when creating and visualizing topic models. Below are two methods of calculating coherence: **C_V coherence**, which is calculated based on word co-occurences, and **U_Mass coherence**, which is calculated based on how frequently documents containing high-scoring words co-occur in the corpus. Both have been used in prior research and either may yield better results, depending on corpus specifications.

Additional readings: 

https://aclanthology.org/D12-1087.pdf 

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html

### C_V Coherence Calculations

The function below calculates C_V coherence scores for n topic models run based on set parameters. Coherence is calculated on a range of 0 < x < 1 where higher-scoring models are assumed more coherent. For example, a model with a score of .50 is more coherent than a model with a .25 score.  

In [None]:
import time
from tqdm import tqdm
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
from gensim.models import ldamulticore
from gensim import corpora, models
    
#Define list for model and coherence values
coherence = []

start_time = time.time()

#Find coherence for set range of models and append to list
for k in tqdm(range(2,200)):
    #print('Round: '+str(k))
    ldamodel = models.LdaMulticore(corpus, num_topics=k, \
               id2word = dictionary, workers = cores-1, eval_every = None)
    cm = gensim.models.coherencemodel.CoherenceModel(\
         model=ldamodel, texts=data_words,\
         dictionary=dictionary, coherence='c_v')   
                                                
    coherence.append((k,cm.get_coherence()))
    
end_time = time.time()

total_time = end_time - start_time

In [None]:
# Transpose coherence data
x, y = np.array(coherence).T
  
      
# plot our list in X,Y coordinates
optimal, = plt.plot(x, y)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()


In [None]:
#Get coherence score for each num topics sorted from highest to lowest
#Highest value will be optimal number of topics
Data = {'Num Topics': optimal.get_xdata(), 'Coherence': optimal.get_ydata()}
type(Data)
df_optimal = pd.DataFrame.from_dict(Data)
df_optimal.sort_values(by='Coherence',ascending=False).head()

### U_Mass Coherence Calculation
The function below calculates U_Mass coherence scores of n topic models run based on set parameters. Coherence is calculated on a range of -14 < x < 14 where lower-scoring models are more coherent. For example, a model with a score of -.4 is less coherent than a model with a -.9 score. 


In [None]:
#Define list for model and coherence values
coherence2 = []

#Find coherence for set range of models and append to list
for k in range(2,200):
    print('Round: '+str(k))
    Lda = gensim.models.ldamodel.LdaModel
    ldamodel = Lda(corpus, num_topics=k, \
               id2word = dictionary, eval_every = None)
    
    cm = gensim.models.coherencemodel.CoherenceModel(\
         model=ldamodel, texts=data_words,\
         dictionary=dictionary, coherence='u_mass')   
                                                
    coherence2.append((k,cm.get_coherence()))

In [None]:
# Transpose coherence data
x, y = np.array(coherence2).T
        
# plot our list in X,Y coordinates
optimal2, = plt.plot(x, y)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()


In [None]:
#Get coherence score for each num topics sorted from highest to lowest
#Lowest value will be optimal number of topics
Data = {'Num Topics': optimal2.get_xdata(), 'Coherence': optimal2.get_ydata()}
type(Data)
df_optimal2 = pd.DataFrame.from_dict(Data)
df_optimal2.sort_values(by='Coherence',ascending=True).head()

## Create Topic Models with Optimal Parameters (Skip Down to Multicore option)
Input the parameters of the topic model and run. The model has multiple parameters, including: 
- num_topics: Number of topics the model will generate (default = 100)
- chunksize: Number of documents processed at a time (default = 2000)
- passes: Number of times model is trained on corpus (default = 1)
- iterations: Number of times model "loops" over each document (default = 50)

Calculating coherence (above) helps determine topic number, and other methods can be used to determine appropriate values for the other parameters. 

**Chunk size:** Setting chunk size to a larger number than that of documents in the model ensures that all documents are processed at once (though this requires enough memory space). 

**Passes and Iterations:** A common way to determine the best number of passes and iterations is by training a topic model and checking the "log" to see the document convergence rate (what percentage of topic/word assignments attain stability). If convergence is low, increase number of passes and interations. 

In general, as chunksize increases, passes and iterations should increase as well. Also keep in mind that corpus size may effect number of topics--in the cases of smaller corpora, using too many topics will likely make them too general OR too limited to the context of only one text. Consider running multiple models and comparing coherence, perplexity, and top words per topic. 

Additional reading: 

https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#12buildingthetopicmodel

In [None]:
#Import logging to gauge passes and iterations; will output file to working directory
import logging
logging.basicConfig(filename='gensim.log',
                    format="%(asctime)s:%(levelname)s:%(message)s",
                    level=logging.INFO)

In [None]:
# Train LDA model.
from gensim.models import LdaModel

start_time = time.time()

# Set training parameters.
num_topics = 100
chunksize = 1000
passes = 20
iterations = 40
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

lda_model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=1
)

print("--- %s seconds ---" % (time.time() - start_time))

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute C_V Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_words, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)