An interactive topic modeling framework inspired by BERTopic and designed for iterative refinement and human-in-the-loop validation. The goal is to enable better exploration of data with natural language processing tools and made accessible via a user-friendly interface. See below for details.
- Single-line functions: Perform complex operations with minimal coding skills required
- Interactive visualizations: Explore text data quickly and intuitively
- Interactive topic refinement: Split topics, merge topics, manually reassign documents
- Hierarchical structures: Create topic hierarchies via splitting or grouping operations with support for multiple embedding spaces
- Semantic search: Find documents with similar meanings without relying on them using the same words
- Preview-then-commit workflow: Preview topic splits before committing changes
- Full undo/redo support: Track and reverse all modeling operations
- Document validation: Mark high-confidence assignments and optionally freeze them during refitting
- Representative documents: Automatically identifies most representative documents for each topic
- Rich topic metadata: Auto-generated labels, top terms via c-TF-IDF, centroid embeddings
- Flexible scoring: Multiple modes for document-topic similarity (embedding, TF-IDF, harmonic mean)
pip install interactive-topic-modelRefer to vignette for more details.
import pandas as pd
from interactive_topic_model import InteractiveTopicModel
# Prepare your documents
texts = [
"Machine learning is a subset of artificial intelligence...",
"Deep neural networks have revolutionized computer vision...",
# ... more documents
]
# Initialize model
itm = InteractiveTopicModel(texts)
# Fit initial model
itm.fit()
# View topic summary
print(itm.get_topic_info())
# Access individual topics
topic = itm.topics[0]
print(f"Topic {topic.topic_id}: {topic.label}")
print(f"Top terms: {topic.get_top_terms(n=5)}")
print(topic.get_examples(n=5))
print(topic.get_representative_docs(n=5))
# Interactive refinement
preview = topic.split()
print(preview)
topic.commit_split()
# Merge topics
itm.merge_topics([1, 2], into="new", new_label="Combined Topic")
# Reassign a document
itm.assign_doc(doc_id=42, topic_id=3, validated=True)
# Visualize
fig = visualize_documents(itm)
fig.show()
# Undo if needed
itm.undo()
# Check undo history
print(itm.get_history())Topic modeling is commonly treated like an objective task, that is, one that has a correct answer. In that view, the task of designing a topic model is to make it as accurate as possible by making it increasingly sophisticated.
I have a different view. Because texts are so rich, the range of acceptable ways to group them is very large. The question then becomes less of "Is this the correct classification?" and more of "Is this a useful/interesting classification?" This is not a question that can be answered without interpretive input, as the question of what is useful or interesting is relative to one's values and purposes.
As such, I have kept topic modeling operations (in the strict sense) very basic. The outputs to these basic topic models then serve as a basis for further exploration through interaction. In other words, topic modeling (embedding -> dimension reduction -> clustering) along with other tools like semantic search (embedding -> cosine similarity) serve as tools to augment the interpretive process, not replace it.
In line with this reasoning, ITM is also very flexible, allowing for multiple pretrained embedding models to be used in concert since each model encodes different linguistic features differently.
In short, this approach empowers people to tailor topic models to their own questions and values, making ITM a tool for assisting the exploration/discovery process rather than automating it.
The main engine that orchestrates topic modeling operations:
- Manages document assignments, strengths, and validation status
- Coordinates semantic spaces via
BasicTopicModelinstances - Provides operations: fit, assign, split, merge, group, archive
- Tracks undo/redo history
Represents a single topic with:
- Access to documents:
get_doc_ids(),get_texts(),get_examples() - Representations:
get_embedding(),get_ctfidf(),get_top_terms() - Representative documents:
get_representative_doc_ids(),get_representative_docs() - Operations:
split(),commit_split() - Metadata:
label,parent,activestatus
Manages a semantic space:
- Embeds documents via
embedder(e.g., sentence-transformers) - Reduces dimensions via
reducer(e.g., UMAP) - Clusters via
clusterer(e.g., HDBSCAN) - Caches embeddings and reduced representations
- Can have its own vocabulary (for sub-splits with custom vectorizers)
Topics are characterized by multiple representations:
- Representative documents: Documents with highest cluster membership strength
- c-TF-IDF vectors: Class-based TF-IDF distinguishing each topic from siblings
- Centroid embeddings: Mean embedding of representative documents
- Top terms: Highest-weighted terms from c-TF-IDF
- Auto-generated labels: Based on top terms
All representations are computed lazily and invalidated when assignments change.
ITM provides default components but allows full customization:
from interactive_topic_model import (
InteractiveTopicModel,
SentenceTransformerEmbedder,
UMAPReducer,
HDBSCANClusterer,
custom_vectorizer,
harmonic_scorer,
)
# Custom components
itm = InteractiveTopicModel(
texts=texts,
embedder=SentenceTransformerEmbedder("all-mpnet-base-v2"),
reducer=UMAPReducer(n_components=5, n_neighbors=15),
clusterer=HDBSCANClusterer(min_cluster_size=10),
vectorizer=custom_vectorizer(
ngram_range=(1, 3),
min_df=3,
max_df=0.85,
use_en_stopwords=True,
),
scorer=harmonic_scorer,
n_representative_docs=5,
)Embedders:
SentenceTransformerEmbedder- sentence-transformers modelse5_base_embedder()- Convenience function for E5-base-v2
Reducers:
UMAPReducer- UMAP dimensionality reduction
Clusterers:
HDBSCANClusterer- Hierarchical DBSCAN clustering
Scorers:
embedding_scorer- Cosine similarity on embeddingstfidf_scorer- Cosine similarity on TF-IDF vectorsharmonic_scorer- Harmonic mean of both (default)
Vectorizers:
default_vectorizer()- Balanced defaults for general usecustom_vectorizer()- Customizable n-grams, stopwords, etc.
Fitting:
fit()- Fit initial topic model
Assignment Operations:
assign_doc(doc_id, topic_id, strength, validated)- Assign document to topicvalidate_doc(doc_id)- Mark document as validatedsuggest_assignment(doc_id, mode, threshold)- Suggest best topic
Topic Operations:
merge_topics(topic_ids, into, new_label, archive_sources, validate_refits)- Merge topicsarchive_topic(topic_id)- Deactivate a topicgroup_topics(topic_ids, label)- Create hierarchical grouping
Information Retrieval:
get_topic_info(top_words_n, show_inactive, show_outliers)- Summary DataFrameget_representative_documents(topic_ids, n)- Representative docs per topicget_examples(topics, n, include_descendants, random_state)- Sample documentsget_active_topics()- List of active topic IDsget_outlier_count()- Count of outlier documentsget_validated_count()- Count of validated documents
Undo/Redo:
undo(steps)- Undo operationsredo(steps)- Redo operationscan_undo()- Check if undo is possiblecan_redo()- Check if redo is possibleget_history(stack, reverse, include_timestamp)- View edit history
Document Access:
get_count(include_descendants)- Number of documentsget_doc_ids(include_descendants)- List of document IDsget_texts(include_descendants)- List of document textsget_examples(n, include_descendants, random_state)- Sample documentsget_representative_doc_ids()- IDs of representative documentsget_representative_docs()- Texts of representative documents
Representations:
get_embedding()- Centroid embeddingget_ctfidf()- c-TF-IDF vectorget_top_terms(n)- Top n terms with scoresget_auto_label(n)- Generate label from top terms
Operations:
split(clusterer, ...)- Preview topic splitcommit_split()- Commit the split preview
Properties:
label- Topic label (get/set)parent- Parent topic or ITMactive- Whether topic is activesemantic_space- Associated BasicTopicModel
- Update vignette to demonstrate more features and navigation
- Add serialization features
- Add advanced semantic features (idea training, etc.)
Add assignment export/import features
MIT License