# Mallet Topic Modeling Tutorial

<p style="color: red; font-weight: bold;">This notebook is just for development. Do not commit.</p>

Topic modeling is a statistical method for discovering abstract themes or "topics" within a collection of documents. MALLET is the most-widely used tool for topic modeling in the Humanities. It is a Java package which needs to be installed separately from Lexos. However, the Lexos `mallet` module provides an easy way to call MALLET, manage its outputs, and generate some simple visualizations of your topic model. For more on topic modeling, as well as instructions for installing MALLET, see Shawn Graham, Scott Weingart, and Ian Milligan's tutorial [Getting Started with Topic Modeling and MALLET](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet).

## Import the `Mallet` class from the `mallet` Module

First, import the `Mallet` class and utility functions from the Lexos `mallet` module.

In [None]:
# Import the Mallet class and utility functions
from lexos.topic_modeling.mallet import Mallet

## Check Mallet Installation

We need to start by verifying that MALLET is correctly installed and accessible. Make sure you know the path to the MALLET binary file. that the MALLET binary is installed and accessible from your environment. Configure your path in the cell below and then run the cell. If you get a list of MALLET commands, you have MALLET installed and you know the correct path. Use this path when you create your `Mallet` instance below.

In [None]:
mallet_binary_path = "/mnt/c/mallet/bin/mallet"
!$mallet_binary_path

## Load Your Data

Your data must take the form of a list of strings or spaCy `Doc` objects like the one below:

```python
sample_docs = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fast brown fox leaps over sleeping dogs.",
    "Dogs are great pets for families.",
    "Foxes are wild animals found in forests."
]
```

In most cases, you will load your data from a directory of text files or from a single text file. Each method is illustrated below:

### Reading Directories

The `read_dirs()` function will read all text files in a directory (or list of directories) into a list of strings. Each file will be treated as a separate document. Note that the order of the documents in the list is important, as it will be used for document indexes in the topic model.

Make sure you configure the `corpus_dir` before running the cell below. For the purposes of illustration, we will use this tutorial's `sample_data` folder, which is a reproduction of the one distributed with MALLET. It contains text from 12 Wikipedia articles.

In [None]:
# Import the read_dirs function
from lexos.topic_modeling.mallet import read_dirs

# Load all .txt files from a directory as documents
corpus_dir = "sample_data"
docs = read_dirs(corpus_dir)

# Preview the first five documents
for doc in docs[:5]:
    print(f"- {doc[0:100]}...")

### Reading from a File

You can also load documents from a single file using the `read_file()` function. Here each line in the file will be treated as a separate document. Again, the order of documents will be used for document indexes in the topic model.

Note: Technically, MALLET expects the tab-delimited file where the first column is an index, the second is an optional label, and the third is the document text itself. The `read_file()` function accepts files in this format, as well as files containing only texts.

If you want to load data in this format, make sure you configure the `corpus_dir` before running the cell below. If you have already loaded your data from a directory, you can skip the cell below.

In [None]:
# Import the read_file function
from lexos.topic_modeling.mallet import read_file

# Load documents from a single file
corpus_file = "path/to/your/file.txt"
docs = read_file(corpus_file)

# Preview the first five documents
for doc in docs[:5]:
    print(f"- {doc[0:100]}...")

## Train a Topic Model

You are now ready to train a topic model. This involves three steps:

1. Create a topic model involves three steps illustrated below. First, create a `Mallet` instance, providing it with the path to a directory to save the model and, if required (see above), the path to your Mallet binary file.
2. Next, import your data with the `import_data()` method.
3. Finally, train the topic model.

The model `metadata` property returns a dictionary containing paths to output files and model statistics.

Before running the cell below, make sure you configure the path to your `model_dir` and the path to your MALLET binary.

In [None]:

# Configure the path to your model_dir and Mallet binary
model_dir = "mallet_model"
path_to_mallet = "/mnt/c/mallet/bin/mallet"

# Initialize Mallet model (creates the model directory if it does not exist)
mallet_model = Mallet(model_dir=model_dir, path_to_mallet=path_to_mallet)

Now import your training data with `import_data`. You can configure the following parameters:

- `keep_sequence`: Keep the token sequence. Default is `True`.
- `preserve_case`: Preserve case. Default is `True`.
- `remove_stopwords`: Remove stopwords. Default is `True`.
- `training_ids`: A list of integers indicating the IDs of the documents you want to import. If `None`, all documents in your training data will be imported.

When you train a model, MALLET creates a "pipe" file that records the sequence of data processing steps (such as tokenization, stopword removal, case normalization, etc.). This file is saved in your model directory with the extension `.mallet`.

If you later want to import new documents for inference or further modeling, you can use `use_pipe_from` with the path to the `.mallet` file. This guarantees consistency between training and inference, so your new documents are handled identically to your training set.

In [None]:
# Import training data from docs
mallet_model.import_data(training_data=docs)

Finally, train your model.

The `train()` method takes the following parameters:

- `num_topics`: The number of topics to train. The default is 20.
- `num_iterations`: The number of iterations to train for. The default is 100.
- `optimize_interval`: The interval at which to optimize the model. The default is 10.
- `verbose`: Whether to print the MALLET output showing the state at different iterations. The default is `True`.
- `path_to_inferencer`: Optional output filename for saving a trained inferencer object (see below). If not provided, defaults to `model_dir/inferencer.mallet`.

In [None]:
# Train the topic model using the Mallet class
mallet_model.train(num_topics=20, num_iterations=100, verbose=False)

After training a topic model with the `Mallet` class, you can access several useful properties:

- `mallet_model.metadata`: Dictionary containing paths to output files and model statistics.
- `mallet_model.topic_keys`: List of topics and their top words.
- `mallet_model.distributions`: Topic distributions for each document.
- `mallet_model.num_docs`: Number of documents in the model.
- `mallet_model.vocab_size`: Vocabulary size.
- `mallet_model.mean_num_tokens`: Average number of tokens per document.

These properties allow you to inspect the model, analyze results, and use outputs for further processing, as shown below. Uncomment the last two lines to see more.

In [None]:
# Example: Access model properties after training
print("Number of documents:", mallet_model.num_docs)
print("\nVocabulary size:", mallet_model.vocab_size)
print("\nMean number of tokens per document:", mallet_model.mean_num_tokens)
print("\nMetadata:", mallet_model.metadata)
# print("\nTopic keys:", mallet_model.topic_keys)
# print("\nTopic distributions:", mallet_model.distributions)


## Display Topics and Top Words

Once you have created your model, you can display the discovered topics and their top words using `get_keys()`. This method takes the following parameters:

- `num_topics`: The number of topics to get keys for. If None, get keys for all topics.
- `topics`: A list of topic indices to get keys for. If None, get keys for all topics.
- `num_keys`: The number of key terms to output for each topic.
- `as_df`: Whether to return the result as a pandas DataFrame instead of a string. The default is `True`.

In [None]:
# Display the topics and their top terms
df = mallet_model.get_keys(as_df=True)
df = df.data
df.to_markdown()

## Display the Top Documents in Each Topic

You can display the discovered topics and their top words using `get_top_docs()`. This method takes the following parameters:

- `topic`: The topic number to display.
- `n`: The number of top documents to return.
- `metadata`: A Dataframe with the metadata in the same order as the training data. This can include information such as document labels.
- `as_str`: Whether to return the result as a string instead of a pandas DataFrame. The default is `False`.


In [None]:
# Display the topics and their top words
df = mallet_model.get_top_docs(topic=0, n=10)
# df = df.data
df.to_markdown()

## Display the Topic Term Probabilities

You can display the the term distribution for a given topic with `get_topic_term_probabilities()`. This method takes the following parameters:

- `topics`: The topic number (or list of topic numbers) to display. If None, get the probabilities for all topics.
- `n`: The number of key terms to display.
- `as_df`: Whether to return the result as a string instead of a pandas DataFrame. The default is `False`.

In [None]:
# Display the topics and their top terms
df = mallet_model.get_topic_term_probabilities(topics=[0, 1], n=10, as_df=True)
# df = df.data
df.to_markdown()

## Visualizing Topic Probabilities by Category with Boxplots

The `plot_categories_by_topic_boxplots()` method lets you visualize how topic probabilities are distributed across different categories (e.g., genres, labels, or other groupings). This is useful for understanding which topics are most associated with which categories in your data.

The function takes a number of parameters that allow you to choose your topics and categories, as well as to customize the appearance of the box plots.

- `categories`: List of category labels for each document (must match the order of your training data).
- `topics`: Topic index or list of indices to plot. If `None`, all topics are plotted.
- `output_path`: Path to save the figure. If `None`, the plot is shown but not saved.
- `target_labels`: List of unique category labels to include. If `None`, all categories are included.
- `num_keys`: Number of top keywords to display in the plot title.
- `figsize`: Size of the figure (tuple, e.g., `(8, 6)`).
- `font_scale`: Font scaling for the plot.
- `color`: Color for the boxplots (matplotlib color name or object).
- `show`: Whether to display the plot (`True`) or just return the figure object (`False`).
- `title`: Custom title for the plot. If not provided, a default is used.
- `overlay`: How to display individual data points (`'strip'`, `'swarm'`, or `'none'`).
- `overlay_kws`: Dictionary of keyword arguments for the overlay plot (e.g., point size, color).

**Overlay advice:**
- Use `'strip'` (default) for most cases, especially when you have a moderate number of documents per category. It shows individual points with jitter for visibility.
- Use `'swarm'` when you have a small number of documents and want to avoid overlapping points; it arranges points to minimize overlap.
- Use `'none'` if you only want to see the boxplot summary and not individual data points (useful for large datasets).

The cell below will run a basic example. You can uncomment the other code blocks to see how you can modify the parameters.

In [None]:
# Example: Visualize topic probabilities by category
categories = ["People", "Concepts", "People", "People", "People", "Battles", "Texts", "Texts", "Animals", "Planets", "People", "People"]

# Basic usage: plot for all topics
mallet_model.plot_categories_by_topic_boxplots(categories, topics=1, show=True)

# Uncomment the code blocks below for more examples

# Plot for a specific topic with custom options
# mallet_model.plot_categories_by_topic_boxplots(
#     categories,
#     topics=0,  # plot only topic 0
#     num_keys=5,  # show top 5 keywords in title
#     figsize=(8, 6),
#     font_scale=1.5,
#     color="lightblue",
#     show=True,
#     title="Topic 0 by Category",
#     overlay="swarm",  # show individual points as a swarm plot
#     overlay_kws={"size": 6, "color": "red"}
# )

# # Save the plot to a file and only include selected categories
# mallet_model.plot_categories_by_topic_boxplots(
#     categories,
#     topics=[0, 1],
#     target_labels=["People"],
#     output_path="topic_boxplot.png",
#     show=True
# )

## Visualizing Topic-Category Associations with a Heatmap

The `plot_categories_by_topics_heatmap()` method creates a heatmap showing how topics are distributed across different categories. This is useful for quickly spotting which topics are most associated with which categories, especially when you have many topics or categories. It takes the following parameters:

- `categories`: List of category labels for each document (must match the order of your training data).
- `output_path`: Path to save the figure. If `None`, the plot is shown but not saved.
- `target_labels`: List of unique category labels to include. If `None`, all categories are included.
- `num_keys`: Number of top keywords to display in the topic labels.
- `figsize`: Size of the figure (tuple, e.g., `(10, 8)`).
- `font_scale`: Font scaling for the plot.
- `cmap`: Colormap for the heatmap (e.g., `"rocket_r"`, `"viridis"`, or any matplotlib colormap).
- `show`: Whether to display the plot (`True`) or just return the figure object (`False`).
- `title`: Custom title for the plot. If not provided, a default is used.

### Example Usage

In [None]:
# Example: Visualize topic-category associations with a heatmap
categories = ["People", "Concepts", "People", "People", "People", "Battles", "Texts", "Texts", "Animals", "Planets", "People", "People"]

# Sample heatmap with some customization: select colormap, figure size, and title
mallet_model.plot_categories_by_topics_heatmap(
    categories=categories,
    num_keys=0,  # show top 3 keywords in topic labels
    figsize=(8, 6),
    font_scale=1,
    cmap="viridis",
    show=True,
    title="Topic-Category Heatmap",
#     output_path="topic_heatmap.png" # Uncomment to save the heatmap to a file
)


## Visualizing Topics with Word Clouds

The `topic_clouds()` method in the Mallet class generates word clouds for each topic, providing a visual summary of the most important terms per topic. This is useful for quickly understanding the main themes captured by your model.

**Parameters:**

- `topics`: (int or list[int], optional) Topics to include. If None, all topics are shown.
- `max_terms`: (int, optional) Maximum number of keywords per topic cloud (default: 30).
- `figsize`: (tuple, optional) Size of the overall figure (default: (10, 10)).
- `output_path`: (str, optional) If provided, saves the figure to this path.
- `show`: (bool, optional) If True, displays the figure; if False, returns the matplotlib Figure object.
- `round_mask`: (bool|int|str, optional) Whether to use a circular mask for the clouds (True/False or integer radius).
- `title`: (str, optional) Title for the figure.
- `**kwargs`: Additional keyword arguments for customization (see below).

**Customization:**

- Pass `opts` in `**kwargs` to control word cloud appearance (e.g., background color, colormap). Accepts arguments for the Python [wordcloud](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud) package.
- Pass `figure_opts` in `**kwargs` to control figure-level options using `matplotlib` (e.g., facecolor).

### Example Usage

The following cell generates a basic plot of topic clouds. Uncomment the other code blocks to see how to apply various configurations.

In [None]:
# Show word clouds for all topics (default settings)
mallet_model.topic_clouds()

# Uncomment the code blocks below for more examples

# # Show clouds for specific topics, with custom options
# mallet_model.topic_clouds(topics=[0, 1], max_terms=20, figsize=(12, 6), round_mask=True, title="Top Terms for Selected Topics")

# # Save the figure instead of displaying
# mallet_model.topic_clouds(output_path="topic_clouds.png", show=False)

# # Advanced: Customize word cloud appearance
# mallet_model.topic_clouds(max_terms=40, opts={"background_color": "black", "colormap": "plasma"}, figure_opts={"facecolor": "black"})

## Visualizing Topic Trends Over Time

The `plot_topics_over_time()` method in the Mallet class allows you to visualize how the probability of a specific topic changes across a sequence of documents, such as those ordered by time or another variable. This is useful for exploring temporal or sequential patterns in your corpus.

**Parameters:**

- `times`: (list) Sequence of time points or other ordering variable, one per document.
- `topic_index`: (int) The topic to plot (0-based index).
- `topic_distributions`: (list[list[float]], optional) Topic distributions per document. If None, uses the model's distributions.
- `topic_keys`: (list[list[str]], optional) Topic keys. If None, uses the model's keys.
- `output_path`: (str, optional) If provided, saves the figure to this path.
- `figsize`: (tuple, optional) Size of the figure (default: (7, 2.5)).
- `font_scale`: (float, optional) Seaborn font scale (default: 1.2).
- `color`: (str, optional) Line color (default: "cornflowerblue").
- `show`: (bool, optional) If True, displays the figure; if False, returns the matplotlib Figure object.
- `title`: (str, optional) Title for the figure. If not supplied, uses topic keywords.

**Note:**

- The `times` list must be the same length as the number of documents.


In [None]:
# Example time points (e.g., years, document order)
times = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]  # Replace with your actual sequence

# Plot the probability of topic 0 over time
mallet_model.plot_topics_over_time(times=times, topic_index=0)

# Uncomment the code below to try out customization options

# Customize appearance and save the figure
# mallet_model.plot_topics_over_time(times=times, topic_index=1, color="darkred", figsize=(10, 3), title="Topic 1 Trend", output_path="topic1_trend.png", show=True)


## Advanced: Infer Topics for New Documents

Sometimes you want train a model and then feed it new documents after training. To help you do this, Lexos creates an inferencer file when you initially train the model. It will automatically be saved as `inferencer.mallet` in your model's folder, but you can use the `path_to_inferencer` parameter if you want to save its somewhere else.

In [None]:

# Delete this in the final draft
# from lexos.topic_modeling.mallet import Mallet
# mallet_model = Mallet(model_dir="mallet_model", path_to_mallet="/mnt/c/mallet/bin/mallet")

# Import the os module and define a path for the inferencer file
import os
inferencer_path = os.path.join(mallet_model.model_dir, "inferencer.mallet")

# Export the inferencer file during training using the Mallet class
mallet_model.train(num_topics=20, verbose=False)
mallet_model.metadata['path_to_inferencer']# = inferencer_path

Next, use the inferencer to infer new topic distributions. You will need to define the following paths:

- `pipe_file`: Path to the `.mallet` pipe file created during training. Ensures new documents are processed identically to training data.
- `output_path`: Where to save the topic distributions for the new documents (as a text file).
- path_to_inferencer`: Path to the inferencer file created during training. This is used to infer topics for new documents.

In the cell below, we use the default paths that should have been created in your model folder.


In [None]:
# Import the os module
import os

# Prepare new documents for inference
new_docs = [
    "A fox runs quickly through the forest.",
    "Dogs are loyal and friendly pets."
]

# Define paths to your pipe file, output path, and inferencer file
pipe_file = os.path.join(mallet_model.model_dir, "training_data.mallet")
output_path = os.path.join(mallet_model.model_dir, "new_doc_topics.txt")
path_to_inferencer = mallet_model.metadata['path_to_inferencer']

# Use the Mallet class's infer() method to get topic distributions
inferred_topics = mallet_model.infer(new_docs, path_to_inferencer=path_to_inferencer, use_pipe_from=pipe_file, output_path=output_path)

# Display the inferred topic distributions (also saved to the output_path)
print("First Two Distributions:")
for i, dist in enumerate(inferred_topics[:2]):
    print(f"Document {i}: {dist}\n")

Now combine the old and the new distributions. Since some of our visualization methods involve categories, we create categories for our new documents as well.

In [None]:
# Combine training and new distributions
all_distributions = mallet_model.distributions + inferred_topics

# Create a combined categories list (must match the length of all_distributions)
# For example, if you have 12 training docs and 2 new docs:
categories_training = ["People", "Concepts", "People", "People", "People", "Battles", "Texts", "Texts", "Animals", "Planets", "People", "People"]
categories_new = ["Animals", "Animals"]  # Label your new documents as needed
all_categories = categories_training + categories_new

Now we can create a boxplot with our new distributions.

In [None]:
# Now plot using the combined data
mallet_model.plot_categories_by_topic_boxplots(
    topics=3,
    categories=all_categories,
    topic_distributions=all_distributions,  # Pass the combined distributions
    show=True
)


And here are example heatmap and topics over time plots.

In [None]:
# Example: Visualize topic-category associations with a heatmap using combined distributions
mallet_model.plot_categories_by_topics_heatmap(
    all_categories,
    topic_distributions=all_distributions,
    num_keys=3,
    figsize=(8, 6),
    font_scale=1,
    cmap="viridis",
    show=True,
    title="Topic-Category Heatmap (Combined Distributions)"
)

# Example: Plot topic probability over time for new and training documents
# Here we use the combined distributions and a matching time sequence
times = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014]  # Extend as needed for new docs
mallet_model.plot_topics_over_time(
    times=times,  # Make sure this matches the length of all_distributions
    topic_index=0,
    topic_distributions=all_distributions,
    title="Topic 0 Trend (Combined Distributions)",
    color="blue",
    figsize=(10, 3),
    show=True
)