# Introduction to Topic Modeling

In today's lesson, we're going to be working on method of text analysis called "topic modeling".

- [Part 1: What is a topic model?](#Part-1:-What-is-a-topic-model?)
- [Part 2: Topic Modeling Historical *New York Times* Obituaries (1852-2007)](#Part-2:-Topic-Modeling-Historical-*New-York-Times*-Obituaries-(1852-2007))
- [Part 3: Visualizing topic modeling results](#Part-3:-Visualizing-topic-modeling-results)



## Part 1: What is a topic model?

![image](../_images/blei-lda.png)
From David Blei, "Probablistic topic modeling" (2012)

How do I "topic model"?

1. [**MALLET: MAchine Learning for LanguagE Toolkit**](http://mallet.cs.umass.edu/index.php)
    - ![image](../_images/MALLET.png)


2. **David Mimno's in-browswer topic model:**
    - ![image](../_images/mimno-browser1.png)

    - ![image](../_images/mimno-browser2.png)


3. **Today: Topic Modeling using MALLET within Python**

A Note:
1. You can follow along with today's tutiral in Binder
2. You can download this notebook and run it on your own machine.

If you are running this notebook in the cloud, all of the necessary software has already been downloaded. 

If you are running this notebook locally, you need to have a few things installed:
- See [Instructions on how to install Java Development Kit, MALLET,  little_mallet_wrapper and seaborn](https://github.com/sceckert/IntroDHFall2022/blob/main/_week9/topic-modeling-set-up-instructions.md)



## Part 2: Topic Modeling Historical *New York Times* Obituaries (1852-2007)

## Let's get started!

In this particular lesson, we’re going to use [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), a Python wrapper for [MALLET](http://mallet.cs.umass.edu/index.php), to topic model 379 obituaries of significant historical figures published by *The New York Times*. This dataset is based on data originally collected by Matt Lavin for his Programming Historian [TF-IDF tutorial](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf). Melanie Walsh cleaned the obituaries so that the subject’s name and death year is included in each text file name, added 13 more “Overlooked” obituaries, including Karen Spärck Jones, the computer scientist who introduced TF-IDF. 

This dataset can be found on our GitHub page under "_datasets/texts/history/NYT-Obituaries"


## Import Packages

In [None]:
!pip install little_mallet_wrapper

In [17]:
import little_mallet_wrapper
import seaborn
import glob
import os
from pathlib import Path

## Set the path to MALLET
Note: if you're running this notebook on your *local* machine, you need to replace the path below with the path to mallet on your own machine. Run the cell below

In [20]:
path_to_mallet = 'mallet-2.0.8/bin/mallet'

## Define our corpus of texts
In this workshop, we'll be working on a dataset of NYT Obituaries, produced by Matt Lavin as part of his TF-IDF workshop, with a few recent additions added by Melanie Walsh

In [21]:
# Make a variable and assign it to the path to our directory that contains our text files
directory = "../_datasets/NYT-Obituaries/"

Using the `glob` function and the wildcard `*`, we're going to make a list of all the text files in our directory.

In [22]:
files = glob.glob(f"{directory}/*.txt")

In [23]:
files

['../_datasets/NYT-Obituaries/1945-Adolf-Hitler.txt',
 '../_datasets/NYT-Obituaries/1915-F-W-Taylor.txt',
 '../_datasets/NYT-Obituaries/1975-Chiang-Kai-shek.txt',
 '../_datasets/NYT-Obituaries/1984-Ethel-Merman.txt',
 '../_datasets/NYT-Obituaries/1953-Jim-Thorpe.txt',
 '../_datasets/NYT-Obituaries/1964-Nella-Larsen.txt',
 '../_datasets/NYT-Obituaries/1955-Margaret-Abbott.txt',
 '../_datasets/NYT-Obituaries/1984-Lillian-Hellman.txt',
 '../_datasets/NYT-Obituaries/1959-Cecil-De-Mille.txt',
 '../_datasets/NYT-Obituaries/1928-Mabel-Craty.txt',
 '../_datasets/NYT-Obituaries/1973-Eddie-Rickenbacker.txt',
 '../_datasets/NYT-Obituaries/1989-Ferdinand-Marcos.txt',
 '../_datasets/NYT-Obituaries/1991-Martha-Graham.txt',
 '../_datasets/NYT-Obituaries/1997-Deng-Xiaoping.txt',
 '../_datasets/NYT-Obituaries/1938-George-E-Hale.txt',
 '../_datasets/NYT-Obituaries/1885-Ulysses-Grant.txt',
 '../_datasets/NYT-Obituaries/1909-Sarah-Orne-Jewett.txt',
 '../_datasets/NYT-Obituaries/1957-Christian-Dior.txt',
 

## Process our text files
Next we're going to use the wrapper "little_mallet_wrapper" to process our text files and create two variables, "training_data" and "original_texts". We'll be using these variables to tell MALLET what to use as training data, and a copy of our original texts that we can refer back to.

The code we'll use to run this is: `little_mallet_wrapper.process_string(text, numbers='remove')`

 First, we're going to create an empty string, and then iterate over all the files in our `files` variable, processing each of them with the function `little_mallet_wrapper.process_string(text, numbers='remove')`, which takes all our text, makes the text lowercase, removes stopwords

> For the list of stopwords that mallet removes, and to change them, look inside the directory "mallet-2.0.8/stoplists"  
> What implications might that have for our model?

In [24]:
training_data = []
for file in files:
    text = open(file, encoding='utf-8').read()
    processed_text = little_mallet_wrapper.process_string(text, numbers='remove')
    training_data.append(processed_text)

In [25]:
original_texts = []
for file in files:
    text = open(file, encoding='utf-8').read()
    original_texts.append(text)

## Process the titles of obituaries
Since our text files all contain the year and name of the individual in them, we're going to use the filename as part of the way we label the text of that obituary.

In [26]:
obit_titles = [Path(file).stem for file in files]
# The Path().stem function extract the filename without the .txt extension

In [27]:
obit_titles

['1945-Adolf-Hitler',
 '1915-F-W-Taylor',
 '1975-Chiang-Kai-shek',
 '1984-Ethel-Merman',
 '1953-Jim-Thorpe',
 '1964-Nella-Larsen',
 '1955-Margaret-Abbott',
 '1984-Lillian-Hellman',
 '1959-Cecil-De-Mille',
 '1928-Mabel-Craty',
 '1973-Eddie-Rickenbacker',
 '1989-Ferdinand-Marcos',
 '1991-Martha-Graham',
 '1997-Deng-Xiaoping',
 '1938-George-E-Hale',
 '1885-Ulysses-Grant',
 '1909-Sarah-Orne-Jewett',
 '1957-Christian-Dior',
 '1987-Clare-Boothe-Luce',
 '1976-Jacques-Monod',
 '1954-Getulio-Vargas',
 '1979-Stan-Kenton',
 '1990-Leonard-Bernstein',
 '1972-Jackie-Robinson',
 '1998-Fred-W-Friendly',
 '1991-Leo-Durocher',
 '1915-B-T-Washington',
 '1997-James-Stewart',
 '1981-Joe-Louis',
 '1983-Muddy-Waters',
 '1942-George-M-Cohan',
 '1989-Samuel-Beckett',
 '1962-Marilyn-Monroe',
 '2000-Charles-M-Schulz',
 '1967-Gregory-Pincus',
 '1894-R-L-Stevenson',
 '1978-Bruce-Catton',
 '1982-Arthur-Rubinstein',
 '1875-Andrew-Johnson',
 '1974-Charles-Lindbergh',
 '1964-Rachel-Carson',
 '1953-Marjorie-Rawlings',


## Get Statistics on our Training Dataset
The `little_mallet_wrapper.print_dataset_stats()` function gives us some basic statistics on the dataset we want to use as our training data.

In [28]:
little_mallet_wrapper.print_dataset_stats(training_data)

Number of Documents: 379
Mean Number of Words per Document: 1314.6
Vocabulary Size: 35983


## Training the Topic Model
Now for the big part!  We're going to use the variables we defined.
Before we can do that, we need to define some variables. We need to tell our model:

- How many topics to find
- What our training data is
- The location of a directory to output our topic modeling data (including sub-directories

And we need to import all this information into Little MALLET Wrapper



### Set the Number of Topics

In [29]:
num_topics = 15 # Change this number to change the number of topics

### Set the Training Data

In [30]:
training_data = training_data

### Set the Location of the Topic Model Output Files

Topic modeling produces a lot of output files, including the words in topics, and statistics on their relative distributions within the documents. We need to tell Little MALLET Wrapper where to output all of these results. The code below defines a directory called “topic-model-output” and a subdirectory called “NYT-Obits”, all of which will be inside your current directory. 

Notice how we're able to us re-use the path we defined, "output_directory_path" to tell Little MALLET Wrapper where to upt each of the 5 output files.

In [31]:
#Change to your desired output directory
output_directory_path = 'topic-model-output/NYT-Obits'

#No need to change anything below here
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

path_to_training_data           = output_directory_path + '/training.txt'
path_to_formatted_training_data = output_directory_path + '/mallet.training'
path_to_model                   = output_directory_path + '/mallet.model.' + str(num_topics)
path_to_topic_keys              = output_directory_path + '/mallet.topic_keys.' + str(num_topics)
path_to_topic_distributions     = output_directory_path + '/mallet.topic_distributions.' + str(num_topics)
path_to_word_weights            = output_directory_path + '/mallet.word_weights.' + str(num_topics)
path_to_diagnostics             = output_directory_path + '/mallet.diagnostics.' + str(num_topics) + '.xml'



### Import our Data into Little MALLET Wrapper
Here we're importing the variables we just defined into Little MALLET Wrapper

In [32]:
little_mallet_wrapper.import_data(path_to_mallet,
                path_to_training_data,
                path_to_formatted_training_data,
                training_data)

Importing data...
Complete


### Train the Topic Model

The final and most important step: we're going to use `little_mallet_wrapper.train_topic_model()` to train our model (using all of the parameters that we just defined. The topic model should take about 45 seconds to 1 minute to fully train and complete. 

If you're running this notebook locally (not in the cloud), you can look at your Terminal or PowerShell while it’s running and see what the model looks like as it trains.

In [33]:
little_mallet_wrapper.train_topic_model(path_to_mallet,
                      path_to_formatted_training_data,
                      path_to_model,
                      path_to_topic_keys,
                      path_to_topic_distributions,
                      path_to_word_weights,
                      path_to_diagnostics,
                      num_topics)

Training topic model...
Complete


## Display Topics and Top Words
To examine the 15 topics that the topic model extracted from the NYT obituaries, run the cell below. This code uses the `little_mallet_wrapper.load_topic_keys()` function to read and process the MALLET topic model output from your computer, specifically the file “mallet.topic_keys.15”.

Take a look at the topics below. Think about what each topic seems to capture. 

- Are there any that seem to have a clear theme? 
- What about any oddities or outliers?

In [34]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)

for topic_number, topic in enumerate(topics):
    print(f"✨Topic {topic_number}✨\n\n{topic}\n")

✨Topic 0✨

['said', 'years', 'one', 'new', 'first', 'two', 'later', 'time', 'life', 'also', 'man', 'old', 'world', 'many', 'last', 'year', 'never', 'like', 'died', 'made']

✨Topic 1✨

['work', 'art', 'university', 'professor', 'research', 'science', 'picasso', 'institute', 'scientific', 'modern', 'society', 'oppenheimer', 'world', 'artist', 'human', 'prize', 'atomic', 'paris', 'schweitzer', 'nobel']

✨Topic 2✨

['israel', 'king', 'gandhi', 'british', 'minister', 'peace', 'queen', 'india', 'prince', 'prime', 'mrs', 'arab', 'israeli', 'government', 'lord', 'egypt', 'sadat', 'england', 'victoria', 'begin']

✨Topic 3✨

['grant', 'gen', 'general', 'army', 'president', 'harrison', 'upon', 'men', 'made', 'fort', 'command', 'douglass', 'lee', 'miles', 'sent', 'union', 'sherman', 'troops', 'days', 'mckinley']

✨Topic 4✨

['war', 'general', 'hitler', 'united', 'army', 'german', 'france', 'military', 'french', 'germany', 'world', 'american', 'nations', 'secretary', 'troops', 'macarthur', 'churchi

----


## Exercise 1: Training topics

## Your Turn!!
1. Change the number of topics (`num_topics`) from 15 to some other number
2. Then, run command to train the topic model `little_mallet_wrapper.train_topic_model()`
3. Then, re-reun the code in the cell above to display the topics and top numbers.
4. How did your choice do you notice? What implications might this have for someone who wants to use a topic model?

When you're done, change the number of topics back to 15, and re-train the model.

----

## Load Topic Distributions

MALLET also calculates the likely mixture of these topics for every single obituary in the corpus. This mixture is really a probability distribution, that is, the probability that each topic exists in the document. We can use these probability distributions to examine which of the above topics are strongly associated with which specific obituaries.

To get the topic distributions, we’re going to use the little_mallet_wrapper.load_topic_distributions() function, which will read and process the MALLET topic model output, specifically the file “mallet.topic_distributions.15”, as a dataframe.

In [None]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

We can use this dataframe that we just created to look at the probability distributions of each of the 15 toipcs in on obituary.

Let's look at Marilyn Monroe's obituary, which is #32

In [None]:
topic_distributions[32]

It's a little hard to get a sense of what these topics are, so we can pair this with the data we have on the title and the top words in the topic:

In [None]:
obituary_to_check = "1962-Marilyn-Monroe"

obit_number = obit_titles.index(obituary_to_check)

print(f"Topic Distributions for {obit_titles[obit_number]}\n")
for topic_number, (topic, topic_distribution) in enumerate(zip(topics, topic_distributions[obit_number])):
    print(f"✨Topic {topic_number} {topic[:6]} ✨\nProbability: {round(topic_distribution, 3)}\n")

> **💡 CHECK-IN:**  
> Remember, these are PROBABILITIES. Each time that we re-run the model, the probabiliites will change slightly as the model trains.

> **EXERCISE**  
1. Run the "Train the Topic Model" command again
2. Then click on the cell above to output the topic distributions for Marylin Monroe's obituary. 
3. What do you notice? What changed? And what implications might this have for how we use these distribution statistics? 

## Part 3: Visualizing topic modeling results

### Create a Heatmap of Topics and Texts

We can visualize and compare these topic probability distributions with a heatmap by using the `little_mallet_wrapper.plot_categories_by_topics_heatmap()` function.

We have everything we need for the heatmap except for our list of target_labels, the sample of texts that we’d like to visualize and compare with the heatmap. Below we make our list of desired target labels.

In [None]:
target_labels = ['1852-Ada-Lovelace', '1885-Ulysses-Grant',
                 '1900-Nietzsche', '1931-Ida-B-Wells', '1940-Marcus-Garvey',
                 '1941-Virginia-Woolf', '1954-Frida-Kahlo', '1962-Marilyn-Monroe',
                 '1963-John-F-Kennedy', '1964-Nella-Larsen', '1972-Jackie-Robinson',
                 '1973-Pablo-Picasso', '1984-Ray-A-Kroc','1986-Jorge-Luis-Borges', '1991-Miles-Davis',
                 '1992-Marsha-P-Johnson', '1993-Cesar-Chavez']

If you’d like to make a random list of target labels, you can uncomment and run the cell below.

In [None]:
#import random
#target_labels = random.sample(obit_titles, 10)

Now let's use those target_labels to create a heatmap:

In [None]:
little_mallet_wrapper.plot_categories_by_topics_heatmap(obit_titles,
                                      topic_distributions,
                                      topics,
                                      output_directory_path + '/categories_by_topics.pdf',
                                      target_labels=target_labels,
                                      dim= (13, 9)
                                     )

The darker squares in this heatmap represent a high probability for the corresponding topic (compared to everyone else in the heatmap) and the lighter squares in the heatmap represent a low probability for the corresponding topic. For example, if you scan across the row of Marilyn Monroe, you can see a dark square for the topic “miss film theater movie theater broadway”. If you scan across the row of Ada Lovelace, an English mathematician who is now recognized as the first computer programmer, according to her NYT obituary, you can see a dark square for “university professor research science also”.

The `plot_categories_by_topics_heatmap()` function also helpfully outputs a PDF of the heatmap to `output_directory_path + '/categories_by_topics.pdf'`. We can download this PDF and explore it in more detail or embed it in an article or blog post!

In [None]:
from IPython.display import IFrame
IFrame("topic-model-output/NYT-obits/categories_by_topics.pdf", width=1000, height=600)

### Display Top Titles Per Topic
We can also display the obituaries that have the highest probability for every topic with the little_mallet_wrapper.get_top_docs() function.

Because most of the obituaries in our corpus are pretty long, however, it will be more useful for us to simply display the title of each obituary, rather than the entire document—at least as a first step. To do so, we’ll first need to make two dictionaries, which will allow us to find the corresponding obituary title and the original text from a given training document.

In [None]:
training_data_obit_titles = dict(zip(training_data, obit_titles))
training_data_original_text = dict(zip(training_data, original_texts))

Then we’ll make our own function `display_top_titles_per_topic()` that will display the top text titles for every topic. This function accepts a given topic_number as well as a desired `number_of_documents` to display.

In [None]:
def display_top_titles_per_topic(topic_number=0, number_of_documents=5):
    
    print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        print(round(probability, 4), training_data_obit_titles[document] + "\n")
    return

NOTE: The number of the topic (eg Topic 2, Topic 11) does not correspond to its frequency in the corpus (there is no intrinsic difference between a topic that MALLET assigns as topic 0 and a topic that MALLET assigns as 15, it's just a labelling convention.

**Topic 10**

To display the top 5 obituary titles with the highest probability of containing Topic 10, we will run:

In [None]:
display_top_titles_per_topic(topic_number=10, number_of_documents=5)

What descriptive label would you give to Topic 10? Click the cell below to assign one

Topic 10 : [DOUBLE-CLICK HERE TO TYPE IN A LABEL]

**Topic 7**

To display the top 5 obituary titles with the highest probability of containing Topic 7, we will run:

In [None]:
display_top_titles_per_topic(topic_number=7, number_of_documents=5)

Topic 7 : [DOUBLE CLICK HERE TO TYPE IN A LABEL]

**Topic 14**

To display the top 5 obituary titles with the highest probability of containing Topic 14, we will run:

In [None]:
display_top_titles_per_topic(topic_number=14, number_of_documents=5)

Topic 14 : [DOUBLE-CLICK HERE TO TYPE IN A LABEL]

### Display Topic Words in Context of Original Text

Often it’s useful to actually look at the document that has ranked highly for a given topic and puzzle out why it ranks so highly.

To display the original obituary texts that rank highly for a given topic, with the relevant topic words **bolded** for emphasis, we are going to make the function `display_bolded_topic_words_in_context()`.

In the cell below, we’re importing two special Jupyter notebook display modules, which will allow us to make the relevant topic words bolded, as well as the regular expressions library re, which will allow us to find and replace the correct words.

In [None]:
from IPython.display import Markdown, display
import re

def display_bolded_topic_words_in_context(topic_number=3, number_of_documents=3, custom_words=None):

    for probability, document in little_mallet_wrapper.get_top_docs(training_data, topic_distributions, topic_number, n=number_of_documents):
        
        print(f"✨Topic {topic_number}✨\n\n{topics[topic_number]}\n")
        
        probability = f"✨✨✨\n\n**{probability}**"
        obit_title = f"**{training_data_obit_titles[document]}**"
        original_text = training_data_original_text[document]
        topic_words = topics[topic_number]
        topic_words = custom_words if custom_words != None else topic_words

        for word in topic_words:
            if word in original_text:
                original_text = re.sub(f"\\b{word}\\b", f"**{word}**", original_text)

        display(Markdown(probability)), display(Markdown(obit_title)), display(Markdown(original_text))
    return

**Topic 3**

To display the top 3 original obituaries with the highest probability of containing Topic 3 and with relevant topic words bolded, we will run:

In [None]:
display_bolded_topic_words_in_context(topic_number=3, number_of_documents=3)

## Exercise 2: Visualizing and analyzing topics

## Your Turn!

Choose a topic from the results above and write down its corresponding topic number below.

**Topic: [Your Number Choice Here]**

1. Display the top 6 obituary titles for this topic.

In [None]:
# Your Code Here

2. Display the topic words in the context of the original obituary for these 6 top titles.

In [None]:
# Your Code Here

3. Come up with a label for your topic and double-click on the cell below to write it down:

4. Why did you label your topic the way you did? What do you think this topic means in the context of all the *NYT* obituaries?

5. What’s another collection of texts that you think might be interesting to topic model? Why?