# Topic Modeling Analysis with BERTopic

This notebook demonstrates how to use the **topic_modeling_analysis** module for building, training, and analyzing topic models on scientific literature.

## Key Features
- ✅ **Dual format support**: Works with both legacy JSON (nested paragraph lists) and new format from `retrieve_query_texts.py`
- ✅ **Multiple text modes**: Analyze query results alone ("result") or with full context ("result-with-context")
- ✅ **Automatic query level detection**: Distinguishes sentence-level vs paragraph-level queries
- ✅ **Standardized naming**: Models automatically named by query_level + text_mode + nr_topics
- ✅ **Modular architecture**: Separate classes for data loading, embedding generation, modeling, evaluation, and visualization
- ✅ **GPU acceleration**: Apple Silicon MPS and NVIDIA CUDA support for embeddings
- ✅ **Coherence evaluation**: Multiple metrics (C_V, U_Mass) for model quality assessment
- ✅ **Interactive visualizations**: Plotly-based 3D scatter plots, hierarchies, barcharts, and heatmaps
- ✅ **Checkpoint/resume**: Save and load models at any stage
- ✅ **Batch processing**: Build models for all text modes with single function call

## Text Modes Explained
- **"result"**: Analyzes only the matched text from the query (the `text` field)
- **"result-with-context"**: Analyzes the matched text together with `context_before` and `context_after` fields, providing more contextual information for topic modeling

## Model Naming Convention
Models are automatically named following the pattern: `{query_level}_{text_mode}_{nr_topics}`

Examples:
- `sentence_result_auto` - Sentence-level query, result only, automatic topic count
- `sentence_context_20topics` - Sentence-level query, with context, 20 topics
- `paragraph_result_auto` - Paragraph-level query, result only, automatic topic count

## Workflow Overview
1. Setup environment and load data
2. Choose text mode (result or result-with-context)
3. Generate or load embeddings
4. Configure and build topic model
5. Evaluate model quality (coherence)
6. Generate visualizations
7. Export results and distributions
8. (Optional) Test different topic numbers
9. (Optional) Build models for all text modes to compare


## 1. Setup and Configuration

Import the topic modeling module and configure environment.

In [1]:
# Import the topic modeling module
from topic_modeling_analysis import (
    DataLoader,
    EmbeddingGenerator,
    TopicModeler,
    ModelEvaluator,
    Visualizer,
    TopicDistributionAnalyzer,
    ModelConfig,
    EnvironmentSetup,
    test_topic_numbers,
    build_all_text_mode_models
)

from pathlib import Path
import pandas as pd
import numpy as np

# Setup environment (GPU detection, working directory, etc.)
env_info = EnvironmentSetup.setup_local_environment()

print("\n" + "="*60)
print("Environment Setup Complete")
print("="*60)
print(f"GPU Available: {env_info['gpu_available']}")
if env_info['gpu_available']:
    print(f"GPU: {env_info['gpu_name']}")
    if env_info['gpu_memory'] is not None:
        print(f"GPU Memory: {env_info['gpu_memory']:.1f} GB")


Working directory: /Users/wiktorrorot/Library/CloudStorage/GoogleDrive-w.rorot@uw.edu.pl/My Drive/ornak/corpus-study
GPU available: Apple Silicon (MPS)

Environment Setup Complete
GPU Available: True
GPU: Apple Silicon (MPS)


In [2]:
# Configure paths and parameters
DATA_DIR = Path("./compositionality-data")  # Change to your data directory
OUTPUT_DIR = Path("./compositionality-data/topic_models")  # Output directory for models and visualizations

# Input data file (can be legacy format or retrieve_query_texts.py output)
INPUT_JSON = DATA_DIR / "sentence_query_results_text_20251113_1139.json"  # Change to your file

# Text mode: "result" (just query text) or "result-with-context" (text + context_before/after)
TEXT_MODE = "result"  # Change to "result-with-context" to include context

# Create directories
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Input file: {INPUT_JSON}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Text mode: {TEXT_MODE}")


Input file: compositionality-data/sentence_query_results_text_20251113_1139.json
Output directory: compositionality-data/topic_models
Text mode: result


## 2. Load Data

The `DataLoader` automatically detects the format (legacy or new) and loads the data accordingly.

In [3]:
# Initialize data loader with text mode
loader = DataLoader(INPUT_JSON, text_mode=TEXT_MODE)

# Load data (automatically detects format and query level)
docs, metadata_df = loader.load()

print(f"\nLoaded {len(docs)} documents")
print(f"Format detected: {loader.format_type}")
print(f"Query level detected: {loader.query_level}")
print(f"\nMetadata columns: {list(metadata_df.columns)}")
print(f"\nFirst 3 documents:")
for i, doc in enumerate(docs[:3]):
    print(f"\n{i+1}. {doc[:200]}...")


Loading data from: compositionality-data/sentence_query_results_text_20251113_1139.json
Text mode: result
Detected format: retrieve_query_texts
Query level: sentence
Loading retrieve_query_texts format (mode: result)...
Loaded 14000 documents

Loaded 14000 documents
Format detected: retrieve_query_texts
Query level detected: sentence

Metadata columns: ['query_idx', 'query', 'rank', 'distance', 'corpusid', 'collection', 'sentence_number', 'sentence_indices', 'text', 'context_before', 'context_after']

First 3 documents:

1. Some people have such responses to swear-words, some to racist or homophobic language, while others are unbothered by such things....

2. This process, Kirby argues, leads to many of the features of language commonly interpreted as reflecting biological specialization, such as the way that all languages break down utterances into words ...

3. This process, Kirby argues, leads to many of the features of language commonly interpreted as reflecting biological speciali

## 3. Generate or Load Embeddings

Generate embeddings using sentence transformers or load pre-computed embeddings.

In [4]:
# Create preliminary config to generate standardized filename
temp_config = ModelConfig(
    query_level=loader.query_level,
    text_mode=TEXT_MODE,
    nr_topics=None  # We'll set this later
)
model_suffix = temp_config.get_model_suffix()

# Path for saving/loading embeddings (unique per text mode and query level)
EMBEDDINGS_FILE = OUTPUT_DIR / f"embeddings_{model_suffix}.pkl"

# Check if embeddings already exist
if EMBEDDINGS_FILE.exists():
    print(f"Loading pre-computed embeddings from {EMBEDDINGS_FILE}")
    embeddings = EmbeddingGenerator.load(EMBEDDINGS_FILE)
else:
    print("Generating new embeddings...")
    # Initialize embedding generator
    embedding_gen = EmbeddingGenerator(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    
    # Generate embeddings
    embeddings = embedding_gen.generate(docs, batch_size=128, show_progress=True)
    
    # Save for future use
    embedding_gen.save(embeddings, EMBEDDINGS_FILE)

print(f"\nEmbeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")


Loading pre-computed embeddings from compositionality-data/topic_models/embeddings_sentence_result_auto.pkl
Loaded embeddings from: compositionality-data/topic_models/embeddings_sentence_result_auto.pkl

Embeddings shape: (14000, 384)
Embedding dimension: 384


## 4. Configure Model

Create a configuration object with your desired parameters.

In [16]:
# Create model configuration
config = ModelConfig(
    # Model parameters
    embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
    nr_topics=22,  # None = automatic, or specify a number (e.g., 20)
    calculate_probabilities=False,
    
    # Data characteristics (set from loader)
    query_level=loader.query_level,
    text_mode=TEXT_MODE,
    
    # UMAP parameters
    umap_n_neighbors=15,
    umap_n_components=10,
    umap_min_dist=0.0,
    
    # HDBSCAN parameters
    hdbscan_min_cluster_size=50,
    
    # Outlier reduction
    outlier_strategy="distributions",  # Options: "distributions", "embeddings", "c-tf-idf"
    outlier_threshold=0.1,
    
    # GPU
    use_gpu=True,  # Set to False if no GPU available
    
    # Output
    output_dir=OUTPUT_DIR
)

# Get standardized model suffix
model_suffix = config.get_model_suffix()

print("Model configuration:")
print(f"  Embedding model: {config.embedding_model_name}")
print(f"  Query level: {config.query_level}")
print(f"  Text mode: {config.text_mode}")
print(f"  Number of topics: {config.nr_topics or 'automatic'}")
print(f"  UMAP neighbors: {config.umap_n_neighbors}")
print(f"  HDBSCAN min cluster size: {config.hdbscan_min_cluster_size}")
print(f"  GPU enabled: {config.use_gpu}")
print(f"  Model suffix: {model_suffix}")


Model configuration:
  Embedding model: sentence-transformers/all-MiniLM-L6-v2
  Query level: sentence
  Text mode: result
  Number of topics: 22
  UMAP neighbors: 15
  HDBSCAN min cluster size: 50
  GPU enabled: True
  Model suffix: sentence_result_22topics


## 5. Build and Train Topic Model

Fit the BERTopic model to the documents.

In [17]:
# Initialize modeler
modeler = TopicModeler(config)

# Fit model
print("Fitting topic model...")
model, topics, probs = modeler.fit(docs, embeddings)

# Display topic information
topic_info = model.get_topic_info()
print(f"\n{'='*60}")
print(f"Model fitted successfully!")
print(f"{'='*60}")
print(f"Number of topics: {len(topic_info)}")
print(f"Number of outliers: {(np.array(topics) == -1).sum()}")
print(f"\nTop 10 topics:")
print(topic_info.head(10))

Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Fitting topic model...
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 22 topics

Model fitted successfully!
Number of topics: 22
Number of outliers: 4799

Top 10 topics:
   Topic  Count                                               Name  \
0     -1    864              -1_language_linguistic_words_meanings   
1      0   2035      0_utterance_utterances_speaker_interpretation   
2      1   1730  1_communication_human_human communication_comm...   
3      2   1723  2_compositionality_compositional_composition_l...   
4      3   1033                3_language_evolution_cultural_human   
5      4   1013            4_syntax_syntactic_structure_structures   
6      5    838             5_context_contexts_contextual_language   
7      6    933          6_meaning_expression_meanings_expressions   
8      7    387               7_language_use_communication_symbols   
9      8    609        8_words_representations_semantic_linguistic   

                                      Representation  \
0  [language, linguistic, meaning, words, meaning...   

## 6. Evaluate Model Quality

Calculate coherence metrics to assess topic quality.

In [18]:
# Initialize evaluator
evaluator = ModelEvaluator(coherence_metrics=["c_v", "u_mass"])

# Calculate coherence
coherence_scores = evaluator.evaluate(docs, model)

print("\n" + "="*60)
print("Coherence Scores")
print("="*60)
for metric, score in coherence_scores.items():
    if score is not None:
        print(f"{metric.upper()}: {score:.4f}")
    else:
        print(f"{metric.upper()}: Failed to calculate")

# Save coherence scores
coherence_df = pd.DataFrame([coherence_scores])
coherence_file = OUTPUT_DIR / "coherence_scores.csv"
coherence_df.to_csv(coherence_file, index=False)
print(f"\nSaved coherence scores to: {coherence_file}")

Evaluating model coherence...
Number of topics: 22
Evaluating 21 valid topics
Calculating c_v coherence...
c_v coherence = 0.4530
Calculating u_mass coherence...
u_mass coherence = -4.8878

Coherence Scores
C_V: 0.4530
U_MASS: -4.8878

Saved coherence scores to: compositionality-data/topic_models/coherence_scores.csv


## 7. Export Results and Distributions

Extract document-level topic assignments and distributions. **Note:** These must be computed before generating interactive visualizations.

In [19]:
# Initialize analyzer
analyzer = TopicDistributionAnalyzer(OUTPUT_DIR)

# Compute document info (topic assignments)
doc_info = analyzer.compute_document_info(
    model=model,
    docs=docs,
    metadata_df=metadata_df,
    name_suffix=f"_{model_suffix}"
)

print(f"\nDocument info shape: {doc_info.shape}")
print(f"\nColumns: {list(doc_info.columns)}")
print(f"\nFirst 5 documents:")
print(doc_info[['Document', 'Topic', 'Name', 'Probability']].head())


Computing document-level topic information...
Built-in method failed: list index out of range
Creating document info manually...
✅ Saved document info to: compositionality-data/topic_models/document_info_sentence_result_22topics.csv

Document info shape: (14000, 15)

Columns: ['Document', 'Topic', 'Probability', 'Name', 'query_idx', 'query', 'rank', 'distance', 'corpusid', 'collection', 'sentence_number', 'sentence_indices', 'text', 'context_before', 'context_after']

First 5 documents:
                                            Document  Topic     Name  \
0  Some people have such responses to swear-words...     -1  Outlier   
1  This process, Kirby argues, leads to many of t...     -1  Outlier   
2  This process, Kirby argues, leads to many of t...     -1  Outlier   
3  This process, Kirby argues, leads to many of t...     -1  Outlier   
4  This process, Kirby argues, leads to many of t...     -1  Outlier   

   Probability  
0          0.0  
1          0.0  
2          0.0  
3      

In [20]:
# Compute topic distributions (probabilities for all topics per document)
topic_dist = analyzer.compute_topic_distributions(
    model=model,
    docs=docs,
    metadata_df=metadata_df,
    name_suffix=f"_{model_suffix}"
)

print(f"\nTopic distributions shape: {topic_dist.shape}")
print(f"\nFirst 3 rows (showing first few topic columns):")
topic_cols = [col for col in topic_dist.columns if col.startswith('topic_id=')]
print(topic_dist[topic_cols[:5]].head(3))


Computing topic distributions...
✅ Saved topic distributions to: compositionality-data/topic_models/topic_distributions_sentence_result_22topics.csv

Topic distributions shape: (14000, 32)

First 3 rows (showing first few topic columns):
   topic_id=0  topic_id=1  topic_id=2  topic_id=3  topic_id=4
0    0.000000         0.0    0.045659    0.192322    0.082855
1    0.552463         0.0    0.000000    0.161949    0.163583
2    0.552463         0.0    0.000000    0.161949    0.163583


## 8. Generate Visualizations

Create interactive visualizations for exploring topics. **Note:** This step requires topic distributions to be computed first (previous cells).

In [21]:
# Initialize visualizer
visualizer = Visualizer(OUTPUT_DIR)

# Debug: Check what we have available
print("\n" + "="*60)
print("Debug: Checking available data before visualization")
print("="*60)
print(f"Number of docs: {len(docs)}")
print(f"Embeddings shape: {embeddings.shape}")
print(f"Number of topics: {len(topics)}")
print(f"Model object: {type(model)}")
print(f"Has probabilities: {probs is not None}")
if probs is not None:
    print(f"Probs shape: {probs.shape}")
print(f"Topic info shape: {topic_info.shape}")

# Try to check if model has required reduced embeddings
try:
    if hasattr(model, 'umap_model') and model.umap_model is not None:
        print(f"✅ Model has UMAP model")
    else:
        print(f"⚠️ Model does not have UMAP model")
except Exception as e:
    print(f"⚠️ Error checking UMAP model: {e}")

print("\n" + "="*60)
print("Generating visualizations...")
print("="*60)

# Generate all visualizations using standardized naming
visualizer.generate_all(
    model=model,
    docs=docs,
    embeddings=embeddings,
    name_suffix=f"_{model_suffix}"
)

print(f"\nVisualizations saved to: {OUTPUT_DIR}")
print("Files created:")
print(f"  - vis_documents_{model_suffix}_umap_2d.html")
print(f"  - vis_documents_{model_suffix}_umap_3d.html")
print(f"  - vis_documents_{model_suffix}_tsne_2d.html")
print(f"  - vis_documents_{model_suffix}_tsne_3d.html")
print(f"  - vis_hierarchy_{model_suffix}.html")
print(f"  - vis_barchart_{model_suffix}.html")
print(f"  - vis_heatmap_{model_suffix}.html (if >8 topics)")


Debug: Checking available data before visualization
Number of docs: 14000
Embeddings shape: (14000, 384)
Number of topics: 14000
Model object: <class 'bertopic._bertopic.BERTopic'>
Has probabilities: True
Probs shape: (14000,)
Topic info shape: (22, 8)
✅ Model has UMAP model

Generating visualizations...
Generating visualizations...
Creating custom document visualizations...
  Documents: 14000
  Embeddings shape: (14000, 384)
  Outliers: 864
  Topics: 21
  Sample topic names: [(-1, 'nan'), (0, '0_utterance_utterances_speaker_interpretation'), (1, '1_communication_human_human communication_communicative')]
  Reducing dimensions with UMAP...
    Input embeddings shape: (14000, 384)
    ✓ UMAP 2D coords shape: (14000, 2)
    ✓ UMAP 2D range: X=[-9.25, 23.37], Y=[-10.39, 21.73]
    ✓ UMAP 3D coords shape: (14000, 3)
    ✓ UMAP 3D range: X=[-5.95, 19.18], Y=[-7.60, 18.55], Z=[-9.25, 15.51]
  Reducing dimensions with t-SNE...
    ✓ t-SNE 2D coords shape: (14000, 2)
    ✓ t-SNE 2D range: X=[

100%|██████████| 20/20 [00:00<00:00, 651.32it/s]
Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.


✅ Hierarchical visualization complete
Creating barchart visualization...


Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.


✅ Barchart visualization complete
Creating heatmap visualization...


Numba: Attempted to fork from a non-main thread, the TBB library may be in an invalid state in the child process.


✅ Heatmap visualization complete
Visualization generation completed

Visualizations saved to: compositionality-data/topic_models
Files created:
  - vis_documents_sentence_result_22topics_umap_2d.html
  - vis_documents_sentence_result_22topics_umap_3d.html
  - vis_documents_sentence_result_22topics_tsne_2d.html
  - vis_documents_sentence_result_22topics_tsne_3d.html
  - vis_hierarchy_sentence_result_22topics.html
  - vis_barchart_sentence_result_22topics.html
  - vis_heatmap_sentence_result_22topics.html (if >8 topics)


## 9. Save Model

Save the trained model for future use.

In [22]:
# Save the model with standardized naming
model_file = OUTPUT_DIR / f"topic_model_{model_suffix}.safetensors"
modeler.save(model_file)

# Also save topic info as CSV
topic_info_file = OUTPUT_DIR / f"topic_info_{model_suffix}.csv"
topic_info.to_csv(topic_info_file, index=False)

print(f"Model saved to: {model_file}")
print(f"Topic info saved to: {topic_info_file}")
print(f"\nModel suffix used: {model_suffix}")
print(f"  Query level: {config.query_level}")
print(f"  Text mode: {config.text_mode}")
print(f"  Topics: {config.nr_topics or 'automatic'}")


Saved model to: compositionality-data/topic_models/topic_model_sentence_result_22topics.safetensors
Model saved to: compositionality-data/topic_models/topic_model_sentence_result_22topics.safetensors
Topic info saved to: compositionality-data/topic_models/topic_info_sentence_result_22topics.csv

Model suffix used: sentence_result_22topics
  Query level: sentence
  Text mode: result
  Topics: 22


## 10. Load Pretrained Model (Optional)

Load a previously saved model and use it for analysis.

In [None]:
# Example: Load a pretrained model
# loaded_model = TopicModeler.load(model_file)

# # Use loaded model to transform new documents
# new_topics, new_probs = loaded_model.transform(docs, embeddings)

# print(f"Loaded model has {len(loaded_model.get_topic_info())} topics")

print("To load a model, uncomment the code above and provide the path to the saved model.")

## 11. Test Different Numbers of Topics (Optional)

Find the optimal number of topics by testing a range and comparing coherence scores.

In [15]:
# WARNING: This can be time-consuming!
# Uncomment to run topic number testing

results = test_topic_numbers(
    docs=docs,
    embeddings=embeddings,
    range_min=14,
    range_max=30,
    step=1,
    output_dir=OUTPUT_DIR
)

print("\nCoherence scores by number of topics:")
print(results)

# Plot results
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# C_V coherence
if 'c_v' in results.columns:
    axes[0].plot(results.index, results['c_v'], marker='o')
    axes[0].set_xlabel('Number of Topics')
    axes[0].set_ylabel('C_V Coherence')
    axes[0].set_title('C_V Coherence by Number of Topics')
    axes[0].grid(True)

# U_Mass coherence
if 'u_mass' in results.columns:
    axes[1].plot(results.index, results['u_mass'], marker='o', color='orange')
    axes[1].set_xlabel('Number of Topics')
    axes[1].set_ylabel('U_Mass Coherence')
    axes[1].set_title('U_Mass Coherence by Number of Topics')
    axes[1].grid(True)

plt.tight_layout()
# plt.savefig(OUTPUT_DIR / 'coherence_comparison.png', dpi=300)
plt.show()

# print("To test different numbers of topics, uncomment the code above.")


Testing 14 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 14 topics
Evaluating model coherence...
Number of topics: 14
Evaluating 13 valid topics
Calculating c_v coherence...
c_v coherence = 0.4078
Calculating u_mass coherence...
u_mass coherence = -5.1206
Coherence for 14 topics: {'c_v': 0.40776530349658885, 'u_mass': -5.120634748865859}
Elapsed: 18.5s
Memory usage after cleanup: 74.9%

Testing 15 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 15 topics
Evaluating model coherence...
Number of topics: 15
Evaluating 14 valid topics
Calculating c_v coherence...
c_v coherence = 0.4290
Calculating u_mass coherence...
u_mass coherence = -4.8967
Coherence for 15 topics: {'c_v': 0.4290201466281963, 'u_mass': -4.896694373914412}
Elapsed: 18.8s
Memory usage after cleanup: 77.2%

Testing 16 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 16 topics
Evaluating model coherence...
Number of topics: 16
Evaluating 15 valid topics
Calculating c_v coherence...
c_v coherence = 0.4395
Calculating u_mass coherence...
u_mass coherence = -4.8730
Coherence for 16 topics: {'c_v': 0.43952121458416066, 'u_mass': -4.873030960301079}
Elapsed: 19.3s
Memory usage after cleanup: 78.5%

Testing 17 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 17 topics
Evaluating model coherence...
Number of topics: 17
Evaluating 16 valid topics
Calculating c_v coherence...
c_v coherence = 0.4372
Calculating u_mass coherence...
u_mass coherence = -4.7596
Coherence for 17 topics: {'c_v': 0.437170971820403, 'u_mass': -4.7596147780331375}
Elapsed: 19.5s
Memory usage after cleanup: 78.3%

Testing 18 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 18 topics
Evaluating model coherence...
Number of topics: 18
Evaluating 17 valid topics
Calculating c_v coherence...
c_v coherence = 0.4419
Calculating u_mass coherence...
u_mass coherence = -4.6498
Coherence for 18 topics: {'c_v': 0.4419416882439639, 'u_mass': -4.649759881843556}
Elapsed: 19.8s
Memory usage after cleanup: 81.5%

Testing 19 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 19 topics
Evaluating model coherence...
Number of topics: 19
Evaluating 18 valid topics
Calculating c_v coherence...
c_v coherence = 0.4628
Calculating u_mass coherence...
u_mass coherence = -4.5800
Coherence for 19 topics: {'c_v': 0.46279035237173516, 'u_mass': -4.580029445537191}
Elapsed: 20.3s
Memory usage after cleanup: 78.0%

Testing 20 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 20 topics
Evaluating model coherence...
Number of topics: 20
Evaluating 19 valid topics
Calculating c_v coherence...
c_v coherence = 0.4498
Calculating u_mass coherence...
u_mass coherence = -4.8338
Coherence for 20 topics: {'c_v': 0.4497692968123441, 'u_mass': -4.83379909036446}
Elapsed: 20.3s
Memory usage after cleanup: 79.5%

Testing 21 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 21 topics
Evaluating model coherence...
Number of topics: 21
Evaluating 20 valid topics
Calculating c_v coherence...
c_v coherence = 0.4484
Calculating u_mass coherence...
u_mass coherence = -4.7174
Coherence for 21 topics: {'c_v': 0.4484347065989428, 'u_mass': -4.717360538935319}
Elapsed: 20.2s
Memory usage after cleanup: 79.9%

Testing 22 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 22 topics
Evaluating model coherence...
Number of topics: 22
Evaluating 21 valid topics
Calculating c_v coherence...
c_v coherence = 0.4530
Calculating u_mass coherence...
u_mass coherence = -4.8878
Coherence for 22 topics: {'c_v': 0.4529674393856303, 'u_mass': -4.887781348581775}
Elapsed: 20.1s
Memory usage after cleanup: 79.7%

Testing 23 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 23 topics
Evaluating model coherence...
Number of topics: 23
Evaluating 22 valid topics
Calculating c_v coherence...
c_v coherence = 0.4433
Calculating u_mass coherence...
u_mass coherence = -4.9388
Coherence for 23 topics: {'c_v': 0.4433236616963484, 'u_mass': -4.938778315872197}
Elapsed: 21.0s
Memory usage after cleanup: 79.4%

Testing 24 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 24 topics
Evaluating model coherence...
Number of topics: 24
Evaluating 23 valid topics
Calculating c_v coherence...
c_v coherence = 0.4441
Calculating u_mass coherence...
u_mass coherence = -5.0195
Coherence for 24 topics: {'c_v': 0.44414753907426524, 'u_mass': -5.019542618986798}
Elapsed: 21.9s
Memory usage after cleanup: 77.9%

Testing 25 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 25 topics
Evaluating model coherence...
Number of topics: 25
Evaluating 24 valid topics
Calculating c_v coherence...
c_v coherence = 0.4728
Calculating u_mass coherence...
u_mass coherence = -4.8388
Coherence for 25 topics: {'c_v': 0.47278260267490085, 'u_mass': -4.838792107865115}
Elapsed: 21.6s
Memory usage after cleanup: 77.7%

Testing 26 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 26 topics
Evaluating model coherence...
Number of topics: 26
Evaluating 25 valid topics
Calculating c_v coherence...
c_v coherence = 0.4516
Calculating u_mass coherence...
u_mass coherence = -5.0912
Coherence for 26 topics: {'c_v': 0.45164716536088967, 'u_mass': -5.091223874227555}
Elapsed: 21.7s
Memory usage after cleanup: 78.7%

Testing 27 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 27 topics
Evaluating model coherence...
Number of topics: 27
Evaluating 26 valid topics
Calculating c_v coherence...
c_v coherence = 0.4461
Calculating u_mass coherence...
u_mass coherence = -5.4188
Coherence for 27 topics: {'c_v': 0.44610950563098856, 'u_mass': -5.418775575968378}
Elapsed: 21.5s
Memory usage after cleanup: 80.9%

Testing 28 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 28 topics
Evaluating model coherence...
Number of topics: 28
Evaluating 27 valid topics
Calculating c_v coherence...
c_v coherence = 0.4418
Calculating u_mass coherence...
u_mass coherence = -5.5626
Coherence for 28 topics: {'c_v': 0.4418246523516525, 'u_mass': -5.562647775753709}
Elapsed: 21.7s
Memory usage after cleanup: 79.2%

Testing 29 topics
Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2
Building BERTopic model...
cuML not available, using CPU UMAP
cuML not available, using CPU HDBSCAN
Fitting model...
Reducing outliers...




Model fitted with 29 topics
Evaluating model coherence...
Number of topics: 29
Evaluating 28 valid topics
Calculating c_v coherence...
c_v coherence = 0.4367
Calculating u_mass coherence...
u_mass coherence = -5.7432
Coherence for 29 topics: {'c_v': 0.4367410029304702, 'u_mass': -5.743212848594207}
Elapsed: 21.8s
Memory usage after cleanup: 80.3%

✅ Saved coherence test results to: compositionality-data/topic_models/coherence_test_14-30-1.csv

Coherence scores by number of topics:
         c_v    u_mass
14  0.407765 -5.120635
15  0.429020 -4.896694
16  0.439521 -4.873031
17  0.437171 -4.759615
18  0.441942 -4.649760
19  0.462790 -4.580029
20  0.449769 -4.833799
21  0.448435 -4.717361
22  0.452967 -4.887781
23  0.443324 -4.938778
24  0.444148 -5.019543
25  0.472783 -4.838792
26  0.451647 -5.091224
27  0.446110 -5.418776
28  0.441825 -5.562648
29  0.436741 -5.743213


RuntimeError: Failed to process string with tex because latex could not be found

Error in callback <function _draw_all_if_interactive at 0x312663d90> (for post_execute), with arguments args (),kwargs {}:


RuntimeError: Failed to process string with tex because latex could not be found

RuntimeError: Failed to process string with tex because latex could not be found

<Figure size 1400x500 with 2 Axes>

## 12. Build Models for All Text Modes (Advanced)

This convenience function builds topic models for both "result" and "result-with-context" modes automatically, making it easy to compare how including context affects topic discovery.

In [None]:
# WARNING: This will build TWO complete models, which can take significant time!
# Uncomment to run

# # Base configuration (will be used for both text modes)
# base_config = ModelConfig(
#     embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
#     nr_topics=None,  # Automatic topic discovery
#     umap_n_neighbors=15,
#     umap_n_components=10,
#     hdbscan_min_cluster_size=50,
#     use_gpu=True,
#     output_dir=OUTPUT_DIR
# )
# 
# # Build models for both text modes
# all_results = build_all_text_mode_models(
#     data_path=INPUT_JSON,
#     config=base_config,
#     text_modes=["result", "result-with-context"],
#     output_dir=OUTPUT_DIR
# )
# 
# # Compare coherence scores
# print("\n" + "="*60)
# print("Coherence Comparison Across Text Modes")
# print("="*60)
# for text_mode, results in all_results.items():
#     print(f"\n{text_mode}:")
#     print(f"  Model suffix: {results['model_suffix']}")
#     print(f"  Number of documents: {len(results['docs'])}")
#     print(f"  Number of topics: {len(results['model'].get_topic_info())}")
#     print(f"  Coherence scores: {results['coherence']}")
# 
# # Access individual models
# result_model = all_results["result"]["model"]
# context_model = all_results["result-with-context"]["model"]

print("To build models for all text modes, uncomment the code above.")
print("This will create separate models for:")
print("  1. 'result' - just the matched query text")
print("  2. 'result-with-context' - text with surrounding context")


## Summary

This notebook demonstrated the complete workflow for topic modeling analysis:

1. ✅ **Setup**: Configured environment and paths
2. ✅ **Text Mode Selection**: Chose between "result" and "result-with-context"
3. ✅ **Data Loading**: Loaded data (automatically detected format and query level)
4. ✅ **Embeddings**: Generated or loaded document embeddings
5. ✅ **Model Configuration**: Set up BERTopic parameters with data characteristics
6. ✅ **Model Training**: Fitted topic model to documents
7. ✅ **Evaluation**: Calculated coherence metrics
8. ✅ **Visualizations**: Created interactive plots
9. ✅ **Export**: Saved results and distributions
10. ✅ **Model Persistence**: Saved model with standardized naming

## New Features in Version 2.1.0

- **Text Mode Selection**: Analyze query results with or without context
- **Query Level Detection**: Automatically detects sentence vs paragraph queries
- **Standardized Naming**: Models named by `{query_level}_{text_mode}_{topics}`
- **Batch Processing**: `build_all_text_mode_models()` creates models for all text modes
- **Apple Silicon GPU**: Full MPS support for M1/M2/M3/M4 chips

## Comparing Text Modes

To understand how including context affects topic discovery:

1. Run the notebook with `TEXT_MODE = "result"` (just the matched text)
2. Run again with `TEXT_MODE = "result-with-context"` (text + context)
3. Compare:
   - Number of topics discovered
   - Topic coherence scores
   - Topic word distributions
   - Document assignments

Or use `build_all_text_mode_models()` to build both automatically!

## Next Steps

- **Explore visualizations**: Open the HTML files in `OUTPUT_DIR` to interactively explore topics
- **Analyze distributions**: Use the exported CSV files for downstream analysis
- **Compare text modes**: Build models with both "result" and "result-with-context"
- **Tune parameters**: Adjust `ModelConfig` parameters and re-run to optimize results
- **Test topic numbers**: Use `test_topic_numbers()` to find optimal topic count
- **Apply to new data**: Load the saved model and apply to new documents

## Key Files Generated

All files use the standardized naming pattern with model suffix:

- `embeddings_{model_suffix}.pkl`: Document embeddings
- `topic_model_{model_suffix}.safetensors`: Trained BERTopic model
- `topic_info_{model_suffix}.csv`: Topic information (words, counts)
- `document_info_{model_suffix}.csv`: Document-level topic assignments
- `topic_distributions_{model_suffix}.csv`: Full topic distributions
- `coherence_scores.csv`: Model quality metrics
- `vis_*_{model_suffix}.html`: Interactive visualizations

Where `{model_suffix}` follows the pattern: `{query_level}_{text_mode}_{topics}`

Examples:
- `embeddings_sentence_result_auto.pkl`
- `topic_model_sentence_context_20topics.safetensors`
- `vis_documents_paragraph_result_auto.html`

## Documentation

For more details on the module API, see the docstrings in `topic_modeling_analysis.py`.
