# Lesson: Topic Modeling

by Devon Mordell and Jay Brodeur for [DMDS 2023-2024](https://scds.github.io/dmds23-24/textanalyses.html).

In this notebook, you'll have a chance to gain hands-on experience performing topic modeling using Python. Topic modeling is a text analysis method that tries to get at what a corpus is about by identifying topics, or groups of words that commonly appear together within it. The assumption is that words that frequently appear together are semantically related. The various words within a topic give a more nuanced impression of what is discussed within a text than a simple word count. Topic modeling is best illustrated by trying it out!

You may also wish to refer to the online workshop, [Exploring Themes with Topic Modeling](https://scds.github.io/text-analysis-3/) for a more detailed explanation of the code and background on topic modeling.

The code in the notebook draws on WJB Mattingly's [Implementing LDA in Python](https://github.com/wjbmattingly/topic_modeling_textbook/blob/main/03_03_lda_model_demo.ipynb). We encourage you to check out more of Mattingly's work on [GitHub](https://github.com/wjbmattingly) - Mattingly has done some incredible work on natural language processing (NLP) for the Humanities.

# 1. Install required packages
Most of the libraries we will use in the script are already installed within Google Colab's runtime environment. If you receive a "ModuleNotFoundError: No module named \[name]" error when you run the code in step 2, then you can add the name of the package after `pyLDAvis` in the code below, separated with a space (e.g. `%pip install pyLDAvis spacy`).

Libraries, delightfully, can have dependencies on each other; sometimes an update to one breaks another. In the cell below, we uninstall the version of numpy that "ships" with Colab and update it to avoid an error with the library central to our script, Gensim. To clear the previous version of the library, we must restart the runtime, which can be done by going to "Runtime" in the Colab menu above and selecting "Restart Runtime".

In [None]:
# Install pyLDAvis and update Gensim with pip for visualization
%pip uninstall numpy # Uninstall Colab version of numpy to address conflict with gensim
%pip install numpy pyLDAvis -U gensim # Reinstall numpy and update gensim

# Important! Restart runtime after running this cell

# 2. Import internal and external libraries
The topic modeling script draws on numerous Python libraries to help us work with text data. Our next step is to import them so that the script can make use of them.

An ignorable warning may appear about NVML, you can... well, ignore it.


In [None]:
# Import internal libraries: glob for grabbing docs from directory
import glob

# Import external libraries: gensim for preprocessing and LDA
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Import external libraries: spaCy for tokenization, lemmatization and stopwords
import spacy
from spacy.lang.en import English                 # For other languages, refer to the SpaCy website: https://spacy.io/usage/models
from spacy.lang.en.stop_words import STOP_WORDS   # Also need to update stopwords for other languages (e.g. spacy.lang.uk.stop_words for Ukrainian)

# Import external libraries: pyLDA for vis
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# 3. Read files containing text data for topic modeling
Our Python script reads data from text files with the .txt extension (one signifcant difference from Mattingly's script; if you are working with JSON data, [Mattingly goes through how to prepare it](https://github.com/wjbmattingly/topic_modeling_textbook/blob/main/03_03_lda_model_demo.ipynb)).

You will need to upload the files to the Google Colab runtime environment. Create a folder called "dir" (must be lowercase) in the "Files" panel and upload your text file or files to the folder. You can refer to Jay's lesson on [Basic text prep with Python](https://colab.research.google.com/drive/1ynkHM3WOQUGj9mj8R060p3BYqI6ThbAj) on how to upload files for use in Colab (in step 3.).

You can use your own documents or you can use a sample corpus from [Project Gutenberg](https://www.gutenberg.org/) (copy the "Plain Txt UTF-8" file and remove the preamble and end text). If using your own documents, they must have the .txt file extension - plain text documents. You can even use a single file, but topic modeling works best with large corpora.

Depending on the encoding of the files, you may need to change the encoding specified in the code block below. Other possible encodings include "mac-roman" (if using text files exported from certain text editors on a Mac, for example).

In [None]:
# Read files from directory and create list from contents
file_list = glob.glob('./dir' + '/*.txt') # directory containing text (.txt) files

texts = []

for filename in file_list:
    with open(filename, mode = 'r', encoding = 'utf-8') as f: # specify encoding as appropriate
        texts.append(f.read())

print(texts[0]) # print the first .txt file in the list to confirm

# 4. Identify stopwords for the corpus

Stopwords are commonly used words - such as "and," "the," "we" and... well, "as" - that can be expected to be found in any text and so, can be omitted from our analysis because they are unlikely to be of any interest. Stopwords will be removed in step 5. - after tokenizing and before lemmatizing - but let us first review what words are in the list and evaluate whether we need to add or remove words.

If you would like to add any stopwords - since they can be context-specific - you can add them by uncommenting (deleting the "#") the `STOP_WORDS.add("[your word here]")` line and adding your word in place of \[your word here] (your word must be surrounded with quotation marks as it is of string data type). The simplified technique below only allows one word to be input at a time but you can rerun the block below to add more words, which will be retained in the current runtime.

You can use `STOP_WORDS.remove("[your word here]")` to remove them in the same way.

You might find yourself coming back to step 4. once you have run the visualization in step 10. if you notice words appearing in topics that you would rather omit. Data visualization also helps us to better understand what is required in our preprocessing stage, to perform initial data analysis.

In [None]:
# Print the initial set of stopwords from SpaCy
# Also available at: https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py
print(STOP_WORDS)

# Add a word to remove or add from the list
# STOP_WORDS.add("[word]")
# STOP_WORDS.remove("[word]")

# Print to confirm that your word has been added or removed
# print(STOP_WORDS)

# 5. Tokenize and lemmatize text data, and remove stopwords
As you observed from the output in step 3., we are currently working with unstructured data - paragraphs and sentences that we can make sense of but a computer has a harder time working with. Our next step is to turn that data into tokens, or a list of individual words.

Since topic modeling is dependent on term frequencies - how often a given word appears in a segment of text and with other words - tokenizing allows for the computer to count and classify comparable terms.

Similarly, in the lemmatization process, SpaCy's lemmatizer tool analyzes tokens to determine their "root" form and reduces them to it. Otherwise, a computer would treat "banana" and "bananas" as two distinct terms when - for the purposes of our analysis - they are not.

The SpaCy `nlp` pipeline - or workflow - also removes the stopwords we added in step 4.

**Note:** there is a maxium of one million characters in the default nlp pipeline; if you try to run with script on text data with more than one million characters, you will get an error to that effect.

The one million character threshold is based on (anticipated) available RAM. You can change the maximum number of characters using `nlp.max_length` as below:

`nlp = spacy.load('en_core_web_sm')`

`nlp.max_length = 1500000 # Or other value, given sufficient RAM`

Increasing the number of characters will increase the amount of RAM required to run the text data through the `nlp` pipleline. A [helpful troubleshooting post on StackExchange](https://datascience.stackexchange.com/questions/38745/increasing-spacy-max-nlp-limit) recommends omitting the RAM intensive parts of the pipeline to free it up for the higher character count:

`doc = nlp(text, disable = ['ner', 'parser'])`

If your machine balks (i.e. stops responding, serves you up a spinning beach ball of death or otherwise appears unhappy) at the higher number of characters after modifying the pipeline steps, you may have to break your document(s) into smaller chunks. With some text analysis techniques (e.g. NER), splitting a corpus has no effect on the end result - but, needless to say, it will make a big difference to the topics created. If you find yourself having to work with smaller subsets of your text data, select or arrange them thoughtfully!

In [None]:
# Lemmatize tokens
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):   # Doing part of speech (PoS) tagging helps with lemmatization
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # For other languages, use models from step 2.
    # nlp.max_length = 1500000 # Uncomment if getting max character error
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return (texts_out)

lemmatized_texts = lemmatization(texts)
print(lemmatized_texts[0][0:90])

# 6. Preprocess texts using the Gensim library
Our next step is to put our tokens (words / terms) into a list that the Gensim visualization tool can work with. Note the difference in output between step 5. and step. 6: after running step 6., we have a list of string data.

In [None]:
# Preprocess texts
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc = True) # If working with other languages, you can set deacc to False
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print(data_words[0][0:20])

#7. Combine bigrams and trigrams
Bigrams and trigrams are sets of consecutive words - two and three in a row, respectively. If they reoccur multiple times within the text, Gensim will connect them with an underscore \('human_right' or 'streaming_service'\) when creating the token list to provide additional context around the sense in which the term is being used. The term “high_speed_internet” is more semantically rich than “high,” “speed” or “internet” on their own.

The code below is from [Selva Prabhakaran](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#9createbigramandtrigrammodels).

In [None]:
bigram_phrases = gensim.models.Phrases(data_words, min_count=5, threshold=50)
trigram_phrases = gensim.models.Phrases(bigram_phrases[data_words], threshold=50)

bigram = gensim.models.phrases.Phraser(bigram_phrases)
trigram = gensim.models.phrases.Phraser(trigram_phrases)

def make_bigrams(texts):
    return [bigram[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram[bigram[doc]] for doc in texts]

data_bigrams = make_bigrams(data_words)
data_bigrams_trigrams = make_trigrams(data_bigrams)

print (data_bigrams_trigrams[0])

# 8. Create a dictionary of words
To keep track of the words in the corpus, we will create a dictionary of tuples - or key / value pairs - where the first number is the index of the word and the second number is how frequently the word appears in the text.

In [None]:
# Create dictionary of all words in texts
id2word = corpora.Dictionary(data_bigrams_trigrams)

# Represent dictionary words as tuples (index, frequency)
corpus = []
for text in data_bigrams_trigrams:
    new = id2word.doc2bow(text)
    corpus.append(new)

print(corpus[0][0:20])  # Prints the first twenty terms in the dictionary, starting at 0 (the first term)

# 9. Retrieve words from dictionary
More of a test or exploratory step - it is not necessary for the purpose of creating a visualization but does give us a chance to verify our work and interpret the results from step 7.

In [None]:
# Retrieve individual words from tuples
word = id2word[[0][:19][0]]   # Change the first number (currently 0) to see the various terms indexed in step 7.
print(word)

# 10. Create topics in Gensim
In our next steps, we create the topics that will form the basis for our visualization in step 10. Printing  the topics, you will note a series of word groups, with each word accompanied by a number (its weight).

Another instance in which the code might throw up a bunch of deprecation warnings that don't affect our output - which appears at the end of them - but sure are fiercely distracting. We think it is resolved but apologies if not.

In [None]:
# Specify number of topics (clusters of words)

num_topics = 10   # Experiment with more and fewer numbers of topics, comparing results

# Create LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,     # Change chunksize to increase or decrease the length of segments
                                           passes=50,         # Can do more passes but will increase the time it takes the block to run
                                           alpha="auto")

# Print topics
lda_model.show_topics()

# 11. Create topic modeling visualization with LDAvis
Visualization offers an analytical modality that may support some researchers' ability to make observations from the data.

Alternatively, other data practitioners may use the weights assigned to words within topics to create sonifications. Sean Graham provides an introduction to "[The Sound of Data](https://programminghistorian.org/en/lessons/sonification)" on the Programming Historian.

The code below produces an interactive visualization using the LDAvis library. And, you guessed it - another warning to ignore!

Your visualization will appear in the same file area as the "dir" folder, and will be named "topicVis*N*.html" where *N* is the number of topics. You can open it as HTML markup in Colab; to view it as a visualization, download the .html file and open it in your browser.

In [None]:
# Output visualization
vis_data = gensimvis.prepare(lda_model, corpus, id2word, R=10, mds='mmds')
vis_data
pyLDAvis.display(vis_data)
pyLDAvis.save_html(vis_data, './topicVis' + str(num_topics) + '.html')