# Full Tutorial: Topic Modeling Workflow

This notebook demonstrates a complete topic modeling workflow from data preparation to result analysis.

## Workflow Overview

1. **Data Loading** - Load and explore raw text data
2. **Preprocessing** - Clean and tokenize documents
3. **Vocabulary** - Build vocabulary with filtering
4. **Model Selection** - Choose and configure model
5. **Training** - Train with monitoring
6. **Evaluation** - Compute metrics
7. **Visualization** - Explore results
8. **Saving** - Export model and results

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# NBTM imports
from nbtm.data import (
    Corpus,
    load_corpus,
    TextPreprocessor,
    Vocabulary,
    ENGLISH_STOPWORDS,
)
from nbtm.models import create_model, get_available_models
from nbtm.training import Trainer
from nbtm.evaluation import (
    compute_coherence,
    compute_topic_diversity,
    compute_perplexity,
)
from nbtm.visualization import (
    plot_topic_words,
    plot_topic_heatmap,
    plot_document_topics,
    plot_topic_distribution,
    plot_training_history,
    plot_topic_wordcloud,
    plot_all_topic_wordclouds,
)

## 1. Data Loading

For this tutorial, we'll create a synthetic corpus. In practice, you would load your own data.

In [None]:
# Synthetic document collection (simulating academic abstracts)
raw_documents = [
    # Machine Learning topic
    "Machine learning algorithms enable computers to learn patterns from data without explicit programming.",
    "Deep neural networks have revolutionized computer vision and natural language processing tasks.",
    "Supervised learning requires labeled training data to learn a mapping from inputs to outputs.",
    "Gradient descent optimization is fundamental to training neural network models.",
    "Convolutional neural networks excel at image recognition and computer vision applications.",
    "Recurrent neural networks process sequential data like text and time series.",
    "Transfer learning enables models to leverage knowledge from pre-trained networks.",
    "Regularization techniques prevent overfitting in machine learning models.",
    
    # Statistics topic
    "Bayesian inference provides a principled framework for updating beliefs given new evidence.",
    "Hypothesis testing allows researchers to make decisions based on statistical evidence.",
    "The central limit theorem states that sample means approach a normal distribution.",
    "Maximum likelihood estimation finds parameter values that maximize data probability.",
    "Confidence intervals quantify uncertainty in parameter estimates.",
    "Regression analysis models relationships between dependent and independent variables.",
    "The bootstrap method estimates sampling distributions through resampling.",
    "Markov chain Monte Carlo enables sampling from complex probability distributions.",
    
    # NLP topic
    "Natural language processing enables computers to understand human language.",
    "Word embeddings represent words as dense vectors capturing semantic meaning.",
    "Transformer models have achieved state-of-the-art results in NLP tasks.",
    "Named entity recognition identifies and classifies entities in text.",
    "Sentiment analysis determines the emotional tone of text documents.",
    "Machine translation converts text from one language to another.",
    "Text classification assigns categories to documents based on content.",
    "Language models predict the probability of word sequences.",
]

print(f"Loaded {len(raw_documents)} documents")
print(f"\nSample document:\n{raw_documents[0]}")

## 2. Text Preprocessing

Clean and tokenize the documents.

In [None]:
# Create preprocessor
preprocessor = TextPreprocessor(
    lowercase=True,
    remove_punctuation=True,
    remove_numbers=True,
    remove_stopwords=True,
    stopwords=ENGLISH_STOPWORDS,
    min_word_length=3,
)

# Preprocess documents
documents = [preprocessor.process(doc) for doc in raw_documents]

print(f"Original: {raw_documents[0]}")
print(f"\nProcessed: {documents[0]}")

In [None]:
# Explore document statistics
doc_lengths = [len(doc) for doc in documents]

print(f"Document statistics:")
print(f"  Total documents: {len(documents)}")
print(f"  Min length: {min(doc_lengths)} words")
print(f"  Max length: {max(doc_lengths)} words")
print(f"  Mean length: {np.mean(doc_lengths):.1f} words")

## 3. Build Vocabulary

Create a vocabulary with word frequency filtering.

In [None]:
# Build vocabulary
vocab = Vocabulary()
vocab.build_from_documents(
    documents,
    min_df=2,  # Minimum document frequency
    max_df_ratio=0.8,  # Maximum document frequency ratio
)

print(f"Vocabulary size: {len(vocab)}")
print(f"\nTop 20 words by frequency:")
for word, count in vocab.get_top_words(20):
    print(f"  {word}: {count}")

## 4. Model Configuration

Choose and configure the topic model.

In [None]:
# List available models
print("Available models:")
for name, desc in get_available_models().items():
    print(f"  {name}: {desc}")

In [None]:
# Create model
model = create_model(
    "lda_gibbs",
    num_topics=3,
    alpha=0.1,      # Document-topic prior (lower = sparser)
    beta=0.01,      # Topic-word prior (lower = sparser)
    random_state=42,
)

print(f"Model configuration:")
print(f"  Type: {model.__class__.__name__}")
print(f"  Topics: {model.num_topics}")
print(f"  Alpha: {model.alpha}")
print(f"  Beta: {model.beta}")

## 5. Training

Train the model with progress monitoring.

In [None]:
# Train model
print("Training model...")
model.fit(
    documents,
    num_iterations=500,
)

print(f"\nTraining complete!")
print(f"Final log-likelihood: {model.log_likelihood():.2f}")

In [None]:
# Plot training history
if model.training_history:
    fig = plot_training_history(model)
    plt.show()

## 6. Evaluation

Compute evaluation metrics.

In [None]:
# Compute metrics
coherence_umass = compute_coherence(model, documents, measure="umass")
diversity = compute_topic_diversity(model, top_n=10)

print("Evaluation Metrics:")
print(f"  Topic Coherence (UMass): {coherence_umass:.4f}")
print(f"  Topic Diversity: {diversity:.4f}")
print(f"  Number of Topics: {model.num_topics}")

In [None]:
# Interpretation guide
print("\nMetric Interpretation:")
print("  Coherence: Higher is better (less negative for UMass)")
print("  Diversity: Higher is better (0-1, measures topic uniqueness)")

## 7. Visualization

Explore the learned topics visually.

In [None]:
# Print topics
print("Learned Topics:")
print("=" * 60)
model.print_topics(top_n=8)

In [None]:
# Topic-word bar chart
fig = plot_topic_words(model, top_n=8)
plt.tight_layout()
plt.show()

In [None]:
# Topic-word heatmap
fig = plot_topic_heatmap(model, top_n=10)
plt.show()

In [None]:
# Word clouds for each topic
fig = plot_all_topic_wordclouds(model, ncols=3)
plt.show()

In [None]:
# Document-topic distribution
fig = plot_document_topics(model, doc_indices=list(range(10)))
plt.show()

In [None]:
# Topic distribution across corpus
fig = plot_topic_distribution(model)
plt.show()

## 8. Analyze Specific Documents

Examine topic assignments for individual documents.

In [None]:
# Get document-topic matrix
doc_topics = model.get_document_topics()

# Analyze a few documents
for i in [0, 8, 16]:  # One from each topic
    print(f"\nDocument {i}:")
    print(f"  Text: {raw_documents[i][:80]}...")
    print(f"  Topic distribution: {doc_topics[i]}")
    dominant_topic = np.argmax(doc_topics[i])
    print(f"  Dominant topic: {dominant_topic} ({doc_topics[i][dominant_topic]:.2%})")

## 9. Save and Load Model

Export the trained model for later use.

In [None]:
# Save model
model_path = "outputs/full_tutorial_model.pkl"
model.save(model_path)
print(f"Model saved to: {model_path}")

In [None]:
# Load model
from nbtm.models import GibbsLDA

loaded_model = GibbsLDA.load(model_path)
print(f"Loaded model: {loaded_model}")
print(f"\nVerify topics match:")
loaded_model.print_topics(top_n=5)

## 10. Inference on New Documents

Apply the trained model to new documents.

In [None]:
# New documents
new_docs_raw = [
    "Neural networks learn hierarchical representations of data.",
    "Statistical hypothesis testing requires careful consideration of p-values.",
]

# Preprocess
new_docs = [preprocessor.process(doc) for doc in new_docs_raw]

# Infer topics
new_doc_topics = model.transform(new_docs)

for i, (doc, topics) in enumerate(zip(new_docs_raw, new_doc_topics)):
    print(f"\nNew Document {i+1}:")
    print(f"  Text: {doc}")
    print(f"  Topics: {topics}")
    print(f"  Dominant: Topic {np.argmax(topics)}")

## Summary

This tutorial covered the complete topic modeling workflow:

1. **Data Loading** - Load raw text documents
2. **Preprocessing** - Clean and tokenize with `TextPreprocessor`
3. **Vocabulary** - Build filtered vocabulary with `Vocabulary`
4. **Model** - Create model with `create_model()`
5. **Training** - Train with `model.fit()`
6. **Evaluation** - Compute coherence and diversity
7. **Visualization** - Plot topics with various visualizations
8. **Saving** - Export model with `model.save()`
9. **Inference** - Apply to new documents with `model.transform()`

### Next Steps

- Experiment with different numbers of topics
- Try different models (HDP for automatic topic selection)
- Fine-tune hyperparameters (alpha, beta)
- Use CLI for batch experiments: `nbtm train --config config.yaml`