# CPU-Bound Offline Topic Modeling with Meno

This notebook demonstrates how to use Meno in a CPU-bound offline environment with:
- Minimal dependencies installation (`meno[minimal,embeddings]`)
- Local model usage (offline copy of all-MiniLM-L6-v2)
- Automatic topic number detection
- Topic labeling without LLM dependencies
- Complete HTML reports and visualizations

This approach is ideal for:
- Air-gapped environments with no internet
- Systems with limited GPU capabilities
- When you want minimal dependencies
- Processing datasets with unknown topic counts

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import time

# Add parent directory to path to import meno if needed
parent_dir = str(Path().resolve().parent)
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

# Import meno components
from meno import MenoWorkflow
from meno.modeling.embeddings import DocumentEmbedding
from meno.visualization.lightweight_viz import plot_topic_landscape, plot_topic_heatmap

## 2. Configuration

Set up paths and configuration for offline, CPU-optimized usage.

In [None]:
# Set up paths and configuration
# Point to your downloaded model directory - update this path for your system
LOCAL_MODEL_PATH = os.path.expanduser("~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/24485cc25a8c8b310657ded9e17f6d18d1bdf0ae")

# You can also check if the model exists in the standard HuggingFace cache location
# Uncomment this code to automatically find the model in the cache
"""
try:
    cache_home = os.path.expanduser("~/.cache/huggingface/hub")
    model_files_dir = os.path.join(cache_home, "models--sentence-transformers--all-MiniLM-L6-v2")
    if os.path.exists(model_files_dir):
        # Find snapshots directory
        snapshots_dir = os.path.join(model_files_dir, "snapshots")
        if os.path.exists(snapshots_dir):
            snapshot_dirs = [d for d in os.listdir(snapshots_dir) 
                            if os.path.isdir(os.path.join(snapshots_dir, d))]
            if snapshot_dirs:
                latest_snapshot = sorted(snapshot_dirs)[-1]
                LOCAL_MODEL_PATH = os.path.join(snapshots_dir, latest_snapshot)
                print(f"Found model in HuggingFace cache: {LOCAL_MODEL_PATH}")
            else:
                print("No snapshot directories found")
        else:
            print("Snapshots directory not found")
    else:
        print("Model directory not found in cache")
except Exception as e:
    print(f"Error finding local model: {e}")
"""

# Check if the specified path exists
if not os.path.exists(LOCAL_MODEL_PATH):
    print(f"WARNING: Model path {LOCAL_MODEL_PATH} does not exist!")
    print("Please update the LOCAL_MODEL_PATH to point to your downloaded model.")
else:
    print(f"Using model from: {LOCAL_MODEL_PATH}")

# Create output directory
OUTPUT_DIR = Path("./cpu_output")
OUTPUT_DIR.mkdir(exist_ok=True)
print(f"Output will be saved to: {OUTPUT_DIR.absolute()}")

# Configure for optimal CPU performance and offline use
CPU_CONFIG = {
    "preprocessing": {
        "normalization": {
            "lowercase": True,
            "remove_punctuation": True,
            "lemmatize": True,
            "language": "en",
        },
        "stopwords": {
            "use_default": True,
        },
    },
    "modeling": {
        "embeddings": {
            # CPU optimizations
            "device": "cpu",
            "use_gpu": False,
            "batch_size": 32,  # Adjust based on available RAM
            "quantize": True,  # Memory-efficient model loading
            
            # Offline settings
            "local_model_path": LOCAL_MODEL_PATH,
            "local_files_only": True,
        },
        "topic_detection": {
            "min_topic_size": 5,  # Adjust based on your data
            "auto_detect_topics": True,  # Enable automatic topic number detection
        },
    },
    "visualization": {
        "plots": {
            "width": 900,
            "height": 600,
        },
    },
}

## 3. Generate Sample Data

For this example, we'll generate synthetic data with clear topics.

In [None]:
# Sample data generation function
def generate_sample_data(n_samples=200):
    """Generate synthetic data for demonstration."""
    print(f"Generating {n_samples} sample documents...")
    
    # Create topic templates
    topics = {
        "Technology": [
            "artificial intelligence machine learning data algorithms computers",
            "software development programming code application web system",
            "cloud computing storage server infrastructure network security",
            "mobile devices apps smartphones tablets technology hardware"
        ],
        "Healthcare": [
            "medical health doctors patients hospital treatment therapy",
            "disease diagnosis symptoms medication prescription clinical",
            "healthcare insurance coverage benefits claims provider",
            "wellness prevention fitness nutrition exercise lifestyle"
        ],
        "Finance": [
            "investment market stocks bonds trading portfolio assets",
            "banking financial loans credit mortgage debt interest rates",
            "retirement savings pension fund planning wealth management",
            "insurance risk coverage policy premium claims benefits"
        ],
        "Education": [
            "learning teaching students school curriculum education classroom",
            "academic university college degree research scholarship campus",
            "training skills development career professional certification",
            "online courses e-learning digital education virtual classroom"
        ]
    }
    
    # Generate documents from topics
    documents = []
    doc_ids = []
    doc_topics = []
    
    topic_names = list(topics.keys())
    doc_id = 1
    
    for _ in range(n_samples):
        # Select a random topic
        topic = np.random.choice(topic_names)
        doc_topics.append(topic)
        
        # Select a random template
        template = np.random.choice(topics[topic])
        
        # Create variations by adding noise and varying length
        words = template.split()
        
        # Add some noise and vary length
        num_words = len(words) + np.random.randint(-3, 10)
        if num_words < 5:
            num_words = 5
            
        # Select random words with replacement to create variations
        selected_words = np.random.choice(words, size=num_words, replace=True)
        
        # Add some random transitional words
        transitions = ["and", "also", "including", "with", "for", "about", "regarding"]
        for i in range(2, len(selected_words), 5):
            if i < len(selected_words):
                selected_words[i] = np.random.choice(transitions)
        
        document = " ".join(selected_words)
        documents.append(document)
        doc_ids.append(f"doc_{doc_id}")
        doc_id += 1
    
    # Create DataFrame
    df = pd.DataFrame({
        "text": documents,
        "id": doc_ids,
        "actual_topic": doc_topics
    })
    
    print(f"Generated {len(df)} documents across {len(topic_names)} topics")
    return df

# Generate the sample data
df = generate_sample_data(n_samples=200)

# Display a few sample documents
print("\nSample documents:")
for topic in df["actual_topic"].unique():
    sample = df[df["actual_topic"] == topic].sample(1)["text"].values[0]
    print(f"\n{topic}: {sample}")

# Save the data for reference
df.to_csv(OUTPUT_DIR / "sample_data.csv", index=False)
print(f"\nSample data saved to {OUTPUT_DIR / 'sample_data.csv'}")

## 4. Initialize the Workflow

Now we'll set up the MenoWorkflow with offline settings.

In [None]:
# Initialize the workflow with offline settings
print("Initializing MenoWorkflow with offline settings...")
start_time = time.time()

workflow = MenoWorkflow(
    config_overrides=CPU_CONFIG,
    local_model_path=LOCAL_MODEL_PATH,
    local_files_only=True,
    offline_mode=True  # Critical for bypassing internet checks
)

# Load the data
workflow.load_data(data=df, text_column="text", id_column="id")
print(f"Loaded {len(df)} documents into the workflow")

# Measure initialization time
init_time = time.time() - start_time
print(f"Initialization completed in {init_time:.2f} seconds")

## 5. Preprocessing

Generate preprocessing reports and process the documents.

In [None]:
# Start preprocessing timer
start_time = time.time()

print("Generating preprocessing reports...")

# Generate acronym report
try:
    workflow.generate_acronym_report(
        output_path=OUTPUT_DIR / "acronyms.html", 
        open_browser=False
    )
    print(f"Acronym report saved to {OUTPUT_DIR / 'acronyms.html'}")
except Exception as e:
    print(f"Could not generate acronym report: {e}")

# Generate misspelling report
try:
    workflow.generate_misspelling_report(
        output_path=OUTPUT_DIR / "misspellings.html", 
        open_browser=False
    )
    print(f"Misspelling report saved to {OUTPUT_DIR / 'misspellings.html'}")
except Exception as e:
    print(f"Could not generate misspelling report: {e}")

# Preprocess documents
print("\nPreprocessing documents...")
workflow.preprocess_documents()

# Get the preprocessed data
preprocessed_df = workflow.get_preprocessed_data()
print(f"Preprocessing completed for {len(preprocessed_df)} documents")

# Display sample of preprocessed text
print("\nSample of preprocessed text:")
sample_processed = preprocessed_df[["text", "processed_text"]].head(3)
display(sample_processed)

# Measure preprocessing time
preproc_time = time.time() - start_time
print(f"Preprocessing completed in {preproc_time:.2f} seconds")

## 6. Topic Modeling with Automatic Topic Detection

Now we'll run topic modeling with automatic topic number detection.

In [None]:
# Start topic modeling timer
start_time = time.time()

print("Discovering topics with automatic topic detection...")
workflow.discover_topics(
    method="embedding_cluster",
    auto_detect_topics=True,
    modeling_approach="lightweight",  # Use efficient model like NMF
)

# Get topic information
topics_df = workflow.get_topic_assignments()
topic_info = workflow.modeler.get_topic_info()
print(f"\nDiscovered {len(topic_info)} topics automatically")

# Display topic distribution
print("\nTopic distribution:")
display(topic_info[['Topic', 'Size', 'Name']])

# Display top words per topic
print("\nTop words per topic:")
for _, row in topic_info.iterrows():
    topic_id = row["Topic"]
    topic_words = workflow.modeler.get_topic_words(topic_id, top_n=10)
    word_str = ", ".join([word for word, _ in topic_words[:10]])
    print(f"Topic {topic_id} ({row['Size']} docs): {word_str}")

# Compare with ground truth
if "actual_topic" in df.columns:
    # Get document assignment with original IDs
    doc_topics = topics_df.merge(df[["id", "actual_topic"]], on="id")
    
    # Show contingency table
    print("\nContingency table (Discovered vs. Actual):")
    contingency = pd.crosstab(doc_topics["topic"], doc_topics["actual_topic"])
    display(contingency)
    
    # Calculate adjusted mutual information
    from sklearn.metrics import adjusted_mutual_info_score
    ami = adjusted_mutual_info_score(doc_topics["topic"], doc_topics["actual_topic"])
    print(f"\nAdjusted Mutual Information: {ami:.4f}")

# Measure topic modeling time
topic_time = time.time() - start_time
print(f"\nTopic modeling completed in {topic_time:.2f} seconds")

## 7. Auto-Label Topics (No LLM Required)

Create meaningful topic labels from keywords without using an LLM.

In [None]:
# Auto-label topics using keywords
print("Generating topic labels from keywords...")
topic_labels = {}

for topic_id in topic_info["Topic"].unique():
    if topic_id == -1:
        topic_labels[topic_id] = "Outliers"
        continue
        
    # Get top words for this topic
    top_words = workflow.modeler.get_topic_words(topic_id, top_n=5)
    keywords = [word for word, _ in top_words]
    
    # Create a descriptive label from top words
    label = " & ".join(keywords[:2]) + " Topic"
    topic_labels[topic_id] = label

# Print auto-generated labels
print("\nAuto-generated topic labels:")
for topic_id, label in topic_labels.items():
    topic_size = topic_info[topic_info["Topic"] == topic_id]["Size"].values[0]
    print(f"Topic {topic_id} ({topic_size} docs): {label}")

# Save topic information with labels
topic_info["Label"] = topic_info["Topic"].map(topic_labels)
topic_info.to_csv(OUTPUT_DIR / "topic_summary.csv", index=False)
print(f"\nTopic summary saved to {OUTPUT_DIR / 'topic_summary.csv'}")

# Save document-topic assignments
topics_df["Topic_Label"] = topics_df["topic"].map(topic_labels)
topics_df.to_csv(OUTPUT_DIR / "document_topics.csv", index=False)
print(f"Document-topic assignments saved to {OUTPUT_DIR / 'document_topics.csv'}")

## 8. Generate Comprehensive Report

Create an interactive HTML report with all the topic modeling results.

In [None]:
# Start report generation timer
start_time = time.time()

print("Generating comprehensive report...")
report_path = workflow.generate_comprehensive_report(
    output_path=OUTPUT_DIR / "topic_report.html",
    open_browser=False,
    title="CPU-Optimized Topic Analysis Report"
)

print(f"Report generated at {report_path}")

# Measure report generation time
report_time = time.time() - start_time
print(f"Report generation completed in {report_time:.2f} seconds")

## 9. Additional Visualizations

Create additional CPU-efficient visualizations.

In [None]:
# Start visualization timer
start_time = time.time()

print("Creating additional visualizations...")

# Topic distribution
try:
    dist_fig = workflow.modeler.visualize_topic_distribution(return_figure=True)
    dist_fig.write_html(OUTPUT_DIR / "topic_distribution.html")
    print(f"Topic distribution saved to {OUTPUT_DIR / 'topic_distribution.html'}")
except Exception as e:
    print(f"Could not create topic distribution: {e}")

# Embedding visualization
try:
    embed_fig = workflow.modeler.visualize_embeddings(return_figure=True)
    embed_fig.write_html(OUTPUT_DIR / "topic_embeddings.html")
    print(f"Topic embeddings saved to {OUTPUT_DIR / 'topic_embeddings.html'}")
except Exception as e:
    print(f"Could not create embedding visualization: {e}")

# Get model and documents for specialized visualizations
model = workflow.modeler.topic_model
documents = workflow.get_preprocessed_data()["processed_text"].tolist()

# Create PCA-based visualization (CPU-efficient alternative to UMAP)
try:
    print("\nGenerating PCA-based topic landscape...")
    landscape = plot_topic_landscape(
        model=model,
        documents=documents,
        method="pca",  # PCA is more CPU-efficient than UMAP
        width=900,
        height=600
    )
    landscape.write_html(OUTPUT_DIR / "topic_landscape_pca.html")
    print(f"PCA topic landscape saved to {OUTPUT_DIR / 'topic_landscape_pca.html'}")
except Exception as e:
    print(f"Could not create topic landscape: {e}")

# Topic heatmap (if multiple models were trained)
try:
    print("\nGenerating topic heatmap...")
    heatmap = plot_topic_heatmap(
        model=model,
        documents=documents,
        width=900,
        height=600
    )
    heatmap.write_html(OUTPUT_DIR / "topic_heatmap.html")
    print(f"Topic heatmap saved to {OUTPUT_DIR / 'topic_heatmap.html'}")
except Exception as e:
    print(f"Could not create topic heatmap: {e}")

# Measure visualization time
viz_time = time.time() - start_time
print(f"\nVisualizations completed in {viz_time:.2f} seconds")

## 10. Performance Summary

Summarize the performance metrics of our CPU-optimized workflow.

In [None]:
# Create performance summary
print("Performance Summary")
print("===================\n")
print(f"Dataset size: {len(df)} documents")
print(f"Topics discovered: {len(topic_info)}")
print("\nProcessing times:")
print(f"- Initialization: {init_time:.2f} seconds")
print(f"- Preprocessing: {preproc_time:.2f} seconds")
print(f"- Topic modeling: {topic_time:.2f} seconds")
print(f"- Report generation: {report_time:.2f} seconds")
print(f"- Visualizations: {viz_time:.2f} seconds")
print(f"- Total processing time: {init_time + preproc_time + topic_time + report_time + viz_time:.2f} seconds")
print(f"- Documents per second: {len(df)/(init_time + preproc_time + topic_time):.2f}")

if "actual_topic" in df.columns:
    print(f"\nAdjusted Mutual Information: {ami:.4f}")

## 11. Summary and Conclusion

This notebook demonstrated a complete CPU-optimized workflow for topic modeling in an offline environment using Meno. We were able to:

1. Use local models without internet connectivity
2. Configure for optimal CPU performance
3. Automatically detect the optimal number of topics
4. Generate meaningful topic labels without LLMs
5. Create comprehensive reports and visualizations

All outputs have been saved to the `cpu_output` directory for further analysis.

### Key Benefits of this Approach

- Works completely offline
- Requires minimal dependencies (just `meno[minimal,embeddings]`)
- Efficiently uses CPU resources
- Provides high-quality topic detection with auto-determined topic counts
- Creates meaningful topic labels from document content
- Generates interactive visualizations optimized for performance