# Quality-First CPU Topic Modeling with Meno

This notebook demonstrates how to prioritize quality over speed when running Meno on CPU-bound systems. This approach is ideal when:

- Processing time is not a critical concern
- You want the highest quality results possible
- You don't have GPU acceleration available
- You need superior topic separation and visualization

We'll use:
- The full-featured `all-MiniLM-L6-v2` embedding model
- UMAP dimensionality reduction (slower but better quality than PCA)
- BERTopic with HDBSCAN clustering for optimal topic coherence
- Detailed visualizations optimized for quality

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path to import meno if needed
parent_dir = str(Path().resolve().parent)
if parent_dir not in sys.path:
    sys.path.insert(0, parent_dir)

# Import meno components
from meno import MenoWorkflow, MenoTopicModeler
from meno.modeling.embeddings import DocumentEmbedding
from meno.modeling.bertopic_model import BERTopicModel
from meno.visualization.bertopic_viz import create_bertopic_hierarchy

## 2. Configuration

Set up paths and configuration optimized for quality results.

In [None]:
# Set up paths and configuration
# Point to your downloaded model directory - update this path for your system
LOCAL_MODEL_PATH = os.path.expanduser("~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/24485cc25a8c8b310657ded9e17f6d18d1bdf0ae")

# You can also check if the model exists in the standard HuggingFace cache location
# Uncomment this code to automatically find the model in the cache
"""
try:
    cache_home = os.path.expanduser("~/.cache/huggingface/hub")
    model_files_dir = os.path.join(cache_home, "models--sentence-transformers--all-MiniLM-L6-v2")
    if os.path.exists(model_files_dir):
        # Find snapshots directory
        snapshots_dir = os.path.join(model_files_dir, "snapshots")
        if os.path.exists(snapshots_dir):
            snapshot_dirs = [d for d in os.listdir(snapshots_dir) 
                            if os.path.isdir(os.path.join(snapshots_dir, d))]
            if snapshot_dirs:
                latest_snapshot = sorted(snapshot_dirs)[-1]
                LOCAL_MODEL_PATH = os.path.join(snapshots_dir, latest_snapshot)
                print(f"Found model in HuggingFace cache: {LOCAL_MODEL_PATH}")
            else:
                print("No snapshot directories found")
        else:
            print("Snapshots directory not found")
    else:
        print("Model directory not found in cache")
except Exception as e:
    print(f"Error finding local model: {e}")
"""

# Check if the specified path exists
if not os.path.exists(LOCAL_MODEL_PATH):
    print(f"WARNING: Model path {LOCAL_MODEL_PATH} does not exist!")
    print("Please update the LOCAL_MODEL_PATH to point to your downloaded model.")
else:
    print(f"Using model from: {LOCAL_MODEL_PATH}")

# Create output directory
OUTPUT_DIR = Path("./quality_output")
OUTPUT_DIR.mkdir(exist_ok=True)
print(f"Output will be saved to: {OUTPUT_DIR.absolute()}")

# Configure for quality-first CPU usage
QUALITY_CONFIG = {
    "preprocessing": {
        "normalization": {
            "lowercase": True,
            "remove_punctuation": True,
            "remove_stopwords": True,
            "lemmatize": True,
            "language": "en",
        },
    },
    "modeling": {
        "embeddings": {
            # Use best embedding model
            "model_name": "all-MiniLM-L6-v2",
            "local_model_path": LOCAL_MODEL_PATH,
            "local_files_only": True,
            
            # CPU settings (but not optimized for speed)
            "device": "cpu",
            "use_gpu": False,
            
            # Quality-focused settings
            "precision": "float32",  # Full precision for best quality
            "quantize": False,        # No quantization for best quality
            "batch_size": 16,         # Smaller batch size for better memory management
        },
        # High-quality HDBSCAN clustering settings
        "clustering": {
            "min_cluster_size": 5,
            "min_samples": 5,
            "prediction_data": True,
        },
    },
    "visualization": {
        # High-quality UMAP settings
        "umap": {
            "n_neighbors": 15,  # Higher for more global structure
            "n_components": 3,   # 3D visualization
            "min_dist": 0.1,
            "metric": "cosine",
            "low_memory": False,  # Quality over memory efficiency
        },
        "plots": {
            "width": 1000,       # Larger plots for detail
            "height": 800,
            "template": "plotly_white",
        },
    },
}

## 3. Generate Sample Data

For this example, we'll generate synthetic data with subtle topic overlaps to demonstrate quality-focused modeling.

In [None]:
# Sample data generation function with more nuanced topics
def generate_quality_sample_data(n_samples=300):
    """Generate synthetic data for demonstration with subtle topic overlaps."""
    print(f"Generating {n_samples} sample documents with nuanced topics...")
    
    # Create topic templates with some overlapping terms
    topics = {
        "AI Technology": [
            "artificial intelligence neural networks deep learning algorithms training data",
            "machine learning models prediction classification regression computer vision",
            "natural language processing transformers bert gpt text generation tokens",
            "reinforcement learning agents environments rewards optimization policy"
        ],
        "Data Science": [
            "data analysis statistics regression visualization insights correlation",
            "big data processing pipelines hadoop spark streaming computation",
            "predictive modeling machine learning algorithms classification accuracy",
            "data science projects python pandas numpy visualization matplotlib"
        ],
        "Healthcare Analytics": [
            "medical data analysis patient outcomes treatment effectiveness metrics",
            "healthcare analytics prediction hospital readmission prevention care",
            "clinical decision support systems algorithms evidence patient data",
            "medical imaging analysis deep learning detection diagnosis pathology"
        ],
        "Financial Technology": [
            "fintech innovation banking technology digital payments blockchain",
            "algorithmic trading market prediction financial models risk analysis",
            "cryptocurrency blockchain transactions distributed ledger smart contracts",
            "financial data analysis machine learning fraud detection patterns"
        ],
        "Sustainable Energy": [
            "renewable energy solar wind hydroelectric power generation efficiency",
            "smart grid optimization data analysis consumption forecasting models",
            "energy storage technology batteries capacity efficiency innovation",
            "carbon emissions reduction monitoring data analysis climate impact"
        ]
    }
    
    # Create some cross-topic terms to make distinctions more subtle
    cross_topic_terms = {
        ("AI Technology", "Data Science"): 
            ["algorithms", "machine learning", "prediction", "models", "classification"],
        ("AI Technology", "Healthcare Analytics"): 
            ["medical imaging", "diagnosis", "prediction", "deep learning"],
        ("Data Science", "Financial Technology"): 
            ["data analysis", "prediction", "models", "algorithms"],
        ("AI Technology", "Sustainable Energy"): 
            ["optimization", "prediction", "models", "forecasting"],
        ("Data Science", "Healthcare Analytics"): 
            ["data analysis", "prediction", "patient data", "outcomes"]
    }
    
    # Generate documents from topics
    documents = []
    doc_ids = []
    doc_topics = []
    doc_subtopics = []
    
    topic_names = list(topics.keys())
    doc_id = 1
    
    for _ in range(n_samples):
        # Select a random primary topic
        primary_topic = np.random.choice(topic_names)
        doc_topics.append(primary_topic)
        
        # Select a random template
        template = np.random.choice(topics[primary_topic])
        words = template.split()
        
        # With some probability, add influence from another topic
        if np.random.random() < 0.3:  # 30% chance of topic overlap
            # Select a secondary topic that has cross-topic terms with the primary
            candidates = [t for t in topic_names if t != primary_topic and (primary_topic, t) in cross_topic_terms or (t, primary_topic) in cross_topic_terms]
            if candidates:
                secondary_topic = np.random.choice(candidates)
                doc_subtopics.append(secondary_topic)
                
                # Get cross-topic terms
                if (primary_topic, secondary_topic) in cross_topic_terms:
                    terms = cross_topic_terms[(primary_topic, secondary_topic)]
                else:
                    terms = cross_topic_terms[(secondary_topic, primary_topic)]
                
                # Add some cross-topic terms
                for term in np.random.choice(terms, size=min(3, len(terms)), replace=False):
                    words.append(term)
            else:
                doc_subtopics.append("None")
        else:
            doc_subtopics.append("None")
        
        # Create variations by adding noise and varying length
        num_words = len(words) + np.random.randint(-3, 10)
        if num_words < 5:
            num_words = 5
            
        # Select random words with replacement and shuffle for more realistic text
        selected_words = list(np.random.choice(words, size=num_words, replace=True))
        np.random.shuffle(selected_words)
        
        # Add some random transitional words for more natural text
        transitions = ["and", "also", "including", "with", "for", "about", "regarding", 
                      "related to", "concerning", "in terms of", "specifically"]
        for i in range(2, len(selected_words), 5):
            if i < len(selected_words):
                selected_words[i] = np.random.choice(transitions)
        
        document = " ".join(selected_words)
        documents.append(document)
        doc_ids.append(f"doc_{doc_id}")
        doc_id += 1
    
    # Create DataFrame
    df = pd.DataFrame({
        "text": documents,
        "id": doc_ids,
        "primary_topic": doc_topics,
        "secondary_topic": doc_subtopics
    })
    
    print(f"Generated {len(df)} documents across {len(topic_names)} primary topics")
    return df

# Generate the sample data
df = generate_quality_sample_data(n_samples=300)

# Display a few sample documents
print("\nSample documents:")
for topic in df["primary_topic"].unique():
    sample = df[df["primary_topic"] == topic].sample(1)
    secondary = sample["secondary_topic"].values[0]
    secondary_info = f" (with {secondary} influence)" if secondary != "None" else ""
    print(f"\n{topic}{secondary_info}: {sample['text'].values[0]}")

# Show topic distribution
print("\nPrimary topic distribution:")
display(df["primary_topic"].value_counts())

# Show secondary topic influence
print("\nSecondary topic influence:")
display(df["secondary_topic"].value_counts())

# Save the data for reference
df.to_csv(OUTPUT_DIR / "quality_sample_data.csv", index=False)
print(f"\nSample data saved to {OUTPUT_DIR / 'quality_sample_data.csv'}")

## 4. Initialize the Workflow

Now we'll set up the MenoWorkflow with quality-first settings.

In [None]:
# Initialize the workflow with quality-first settings
print("Initializing MenoWorkflow with quality-first settings...")
start_time = time.time()

workflow = MenoWorkflow(
    config_overrides=QUALITY_CONFIG,
    local_model_path=LOCAL_MODEL_PATH,
    local_files_only=True,
    offline_mode=True
)

# Load the data
workflow.load_data(data=df, text_column="text", id_column="id")
print(f"Loaded {len(df)} documents into the workflow")

# Measure initialization time
init_time = time.time() - start_time
print(f"Initialization completed in {init_time:.2f} seconds")

## 5. Preprocessing

Generate preprocessing reports and process the documents with thorough preprocessing.

In [None]:
# Start preprocessing timer
start_time = time.time()

print("Preprocessing documents with extensive cleaning...")
workflow.preprocess_documents(
    lowercase=True,
    remove_punctuation=True,
    remove_stopwords=True,
    lemmatize=True,
    remove_numbers=True
)

# Get the preprocessed data
preprocessed_df = workflow.get_preprocessed_data()
print(f"Preprocessing completed for {len(preprocessed_df)} documents")

# Display sample of preprocessed text
print("\nSample of preprocessed text:")
sample_processed = preprocessed_df[["text", "processed_text"]].head(3)
display(sample_processed)

# Measure preprocessing time
preproc_time = time.time() - start_time
print(f"Preprocessing completed in {preproc_time:.2f} seconds")

## 6. High-Quality Topic Modeling with BERTopic

Run high-quality topic modeling using BERTopic with UMAP and HDBSCAN.

In [None]:
# Start topic modeling timer
start_time = time.time()

print("Discovering topics with high-quality settings (UMAP + HDBSCAN)...")
workflow.discover_topics(
    method="embedding_cluster",  # Use BERTopic for high-quality results
    clustering_method="hdbscan",
    min_topic_size=5,
    num_topics="auto"  # Let HDBSCAN determine optimal topic count
)

# Get topic information
topics_df = workflow.get_topic_assignments()
topic_info = workflow.modeler.get_topic_info()
print(f"\nDiscovered {len(topic_info) - 1} meaningful topics automatically")

# Display topic distribution
print("\nTopic distribution:")
display(topic_info[['Topic', 'Size', 'Name']])

# Display top words per topic
print("\nTop words per topic:")
for _, row in topic_info.iterrows():
    topic_id = row["Topic"]
    if topic_id != -1:  # Skip outlier topic
        topic_words = workflow.modeler.get_topic_words(topic_id, top_n=10)
        word_str = ", ".join([word for word, _ in topic_words[:10]])
        print(f"Topic {topic_id} ({row['Size']} docs): {word_str}")

# Compare with ground truth
if "primary_topic" in df.columns:
    # Get document assignment with original IDs
    doc_topics = topics_df.merge(df[["id", "primary_topic"]], on="id")
    
    # Show contingency table
    print("\nContingency table (Discovered vs. Actual):")
    contingency = pd.crosstab(doc_topics["topic"], doc_topics["primary_topic"])
    display(contingency)
    
    # Calculate adjusted mutual information
    from sklearn.metrics import adjusted_mutual_info_score
    ami = adjusted_mutual_info_score(
        doc_topics["topic"].apply(lambda x: str(x)), 
        doc_topics["primary_topic"]
    )
    print(f"\nAdjusted Mutual Information: {ami:.4f}")

# Measure topic modeling time
topic_time = time.time() - start_time
print(f"\nTopic modeling completed in {topic_time:.2f} seconds")

## 7. Enhanced Topic Labeling

Create more descriptive topic labels based on the top keywords.

In [None]:
# Enhanced topic labeling function for better descriptions
def generate_enhanced_topic_label(topic_words, topic_id):
    """Generate a more descriptive topic label from keywords."""
    if topic_id == -1:
        return "Miscellaneous/Outliers"
    
    # Get key terms with weights
    keywords = [word for word, _ in topic_words[:5]]
    weights = [weight for _, weight in topic_words[:5]]
    
    # Check for specific domain indicators
    domains = {
        "ai": ["ai", "artificial", "intelligence", "machine", "learning", "neural", "deep"],
        "healthcare": ["medical", "health", "patient", "clinical", "hospital", "doctor"],
        "finance": ["financial", "finance", "banking", "investment", "market", "trading"],
        "energy": ["energy", "power", "renewable", "sustainable", "carbon", "grid"],
        "data": ["data", "analysis", "analytics", "processing", "visualization"]
    }
    
    # Check if keywords match any domains
    domain_matches = {}
    for domain, terms in domains.items():
        matches = sum(1 for kw in keywords if kw in terms)
        if matches > 0:
            domain_matches[domain] = matches
    
    # If we have domain matches, use the best one as prefix
    if domain_matches:
        best_domain = max(domain_matches.items(), key=lambda x: x[1])[0]
        domain_prefix = {
            "ai": "AI & ",
            "healthcare": "Healthcare & ",
            "finance": "Financial & ",
            "energy": "Energy & ",
            "data": "Data Science & "
        }[best_domain]
    else:
        domain_prefix = ""
    
    # Combine top keywords with weights for emphasis
    primary_keywords = [keywords[0], keywords[1]] if len(keywords) > 1 else [keywords[0]]
    secondary_keywords = keywords[2:4] if len(keywords) > 3 else keywords[2:]
    
    # Format the label based on available keywords
    if secondary_keywords:
        label = f"{domain_prefix}{' & '.join(primary_keywords).title()} ({', '.join(secondary_keywords)})"
    else:
        label = f"{domain_prefix}{' & '.join(primary_keywords).title()}"
    
    return label

# Generate enhanced topic labels
print("Generating enhanced topic labels...")
topic_labels = {}

for topic_id in topic_info["Topic"].unique():
    # Get top words for this topic
    top_words = workflow.modeler.get_topic_words(topic_id, top_n=10) if topic_id != -1 else []
    
    # Generate enhanced label
    label = generate_enhanced_topic_label(top_words, topic_id)
    topic_labels[topic_id] = label

# Print enhanced labels
print("\nEnhanced topic labels:")
for topic_id, label in topic_labels.items():
    if topic_id != -1:  # Skip outlier topic in display
        topic_size = topic_info[topic_info["Topic"] == topic_id]["Size"].values[0]
        print(f"Topic {topic_id} ({topic_size} docs): {label}")

# Save topic information with enhanced labels
topic_info["Enhanced_Label"] = topic_info["Topic"].map(topic_labels)
topic_info.to_csv(OUTPUT_DIR / "topic_summary_enhanced.csv", index=False)
print(f"\nEnhanced topic summary saved to {OUTPUT_DIR / 'topic_summary_enhanced.csv'}")

# Save document-topic assignments with enhanced labels
topics_df["Topic_Label"] = topics_df["topic"].map(topic_labels)
topics_df.to_csv(OUTPUT_DIR / "document_topics_enhanced.csv", index=False)
print(f"Document-topic assignments saved to {OUTPUT_DIR / 'document_topics_enhanced.csv'}")

## 8. Generate Comprehensive HTML Report

Create a high-quality HTML report with all the topic modeling results.

In [None]:
# Start report generation timer
start_time = time.time()

print("Generating comprehensive report with enhanced visualizations...")
report_path = workflow.generate_comprehensive_report(
    output_path=OUTPUT_DIR / "high_quality_topic_report.html",
    open_browser=False,
    title="High-Quality CPU Topic Analysis Report",
    include_interactive=True,
    topic_labels=topic_labels  # Use our enhanced labels
)

print(f"Report generated at {report_path}")

# Measure report generation time
report_time = time.time() - start_time
print(f"Report generation completed in {report_time:.2f} seconds")

## 9. High-Quality BERTopic Visualizations

Create detailed UMAP-based visualizations that prioritize quality.

In [None]:
# Start visualization timer
start_time = time.time()

print("Creating high-quality UMAP visualizations (this may take some time)...")

# Get the BERTopic model
model = workflow.modeler.topic_model

# 3D embedding visualization with UMAP (higher quality than PCA)
try:
    print("\nGenerating 3D UMAP visualization...")
    embed_fig = workflow.modeler.visualize_embeddings(
        return_figure=True,
        plot_3d=True,  # 3D visualization for better separation
        width=1000,
        height=800,
        include_topic_labels=True,  # Add topic labels
        hover_data=["topic", "Topic_Label"],  # Add custom hover info
        topic_label_dict=topic_labels  # Use enhanced labels
    )
    embed_fig.write_html(OUTPUT_DIR / "3d_topic_embeddings.html")
    print(f"3D UMAP visualization saved to {OUTPUT_DIR / '3d_topic_embeddings.html'}")
except Exception as e:
    print(f"Could not create 3D UMAP visualization: {e}")

# Topic similarity network (high-quality visualization)
try:
    print("\nGenerating topic similarity network...")
    network_fig = model.visualize_topics(
        topics="all",  # Include all topics
        top_n_topics=None
    )
    network_fig.write_html(OUTPUT_DIR / "topic_similarity_network.html")
    print(f"Topic similarity network saved to {OUTPUT_DIR / 'topic_similarity_network.html'}")
except Exception as e:
    print(f"Could not create topic similarity network: {e}")

# Topic hierarchy visualization (detailed hierarchical clustering)
try:
    print("\nGenerating topic hierarchy visualization...")
    hierarchy_fig = create_bertopic_hierarchy(
        model=model,
        orientation="left",
        width=1200,
        height=800
    )
    if hierarchy_fig:
        hierarchy_fig.write_html(OUTPUT_DIR / "topic_hierarchy.html")
        print(f"Topic hierarchy saved to {OUTPUT_DIR / 'topic_hierarchy.html'}")
except Exception as e:
    print(f"Could not create topic hierarchy: {e}")

# Inter-topic distance map (uses UMAP for topic coordinates)
try:
    print("\nGenerating inter-topic distance map...")
    distance_fig = model.visualize_topics_over_time(
        topics="all",
        top_n_topics=None,
        custom_labels=topic_labels,
        width=1000,
        height=800
    )
    distance_fig.write_html(OUTPUT_DIR / "topic_distance_map.html")
    print(f"Inter-topic distance map saved to {OUTPUT_DIR / 'topic_distance_map.html'}")
except Exception as e:
    print(f"Could not create inter-topic distance map: {e}")

# Term score decline curves
try:
    print("\nGenerating term score decline curves...")
    for topic_id in [t for t in topic_info["Topic"].unique() if t != -1][:5]:  # Show first 5 topics
        term_fig = model.visualize_term_rank(topic=topic_id)
        term_fig.write_html(OUTPUT_DIR / f"term_rank_topic_{topic_id}.html")
        print(f"Term rank curve for Topic {topic_id} saved")
except Exception as e:
    print(f"Could not create term score decline curves: {e}")

# Measure visualization time
viz_time = time.time() - start_time
print(f"\nHigh-quality visualizations completed in {viz_time:.2f} seconds")

## 10. Advanced Topic Analysis

Perform deeper analysis of the topic structure and document assignments.

In [None]:
# Start advanced analysis timer
start_time = time.time()

print("Performing advanced topic analysis...")

# Extract and save representative documents per topic
representative_docs = []
for topic_id in [t for t in topic_info["Topic"].unique() if t != -1]:  # Skip outlier topic
    # Get top documents for this topic
    try:
        topic_docs = model.get_representative_docs(topic=topic_id, nr_docs=3)
        
        for doc in topic_docs:
            representative_docs.append({
                "topic_id": topic_id,
                "topic_label": topic_labels.get(topic_id, f"Topic {topic_id}"),
                "document": doc
            })
    except Exception as e:
        print(f"Could not get representative docs for Topic {topic_id}: {e}")
        continue

# Create DataFrame of representative documents
if representative_docs:
    rep_docs_df = pd.DataFrame(representative_docs)
    rep_docs_df.to_csv(OUTPUT_DIR / "representative_documents.csv", index=False)
    print(f"\nSaved {len(representative_docs)} representative documents to {OUTPUT_DIR / 'representative_documents.csv'}")
    
    # Display a sample
    print("\nSample representative documents:")
    display(rep_docs_df.groupby("topic_label").head(1)[["topic_label", "document"]])

# Calculate topic similarity matrix
try:
    print("\nCalculating topic similarity matrix...")
    topic_ids = [t for t in topic_info["Topic"].unique() if t != -1]
    similarity_matrix = np.zeros((len(topic_ids), len(topic_ids)))
    
    for i, topic1 in enumerate(topic_ids):
        for j, topic2 in enumerate(topic_ids):
            if i == j:
                similarity_matrix[i, j] = 1.0
            else:
                # Get topic vectors
                vector1 = model.topic_vectors.get(topic1, None)
                vector2 = model.topic_vectors.get(topic2, None)
                
                if vector1 is not None and vector2 is not None:
                    # Calculate cosine similarity
                    similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
                    similarity_matrix[i, j] = similarity
    
    # Create similarity DataFrame with enhanced labels
    topic_labels_list = [topic_labels.get(t, f"Topic {t}") for t in topic_ids]
    similarity_df = pd.DataFrame(similarity_matrix, index=topic_labels_list, columns=topic_labels_list)
    
    # Save to CSV
    similarity_df.to_csv(OUTPUT_DIR / "topic_similarity_matrix.csv")
    print(f"Topic similarity matrix saved to {OUTPUT_DIR / 'topic_similarity_matrix.csv'}")
    
    # Create heatmap visualization of topic similarity
    import plotly.graph_objects as go
    fig = go.Figure(data=go.Heatmap(
        z=similarity_matrix,
        x=topic_labels_list,
        y=topic_labels_list,
        colorscale='Viridis',
        showscale=True
    ))
    
    fig.update_layout(
        title="Topic Similarity Matrix",
        width=1000,
        height=800,
        xaxis=dict(title="Topic"),
        yaxis=dict(title="Topic")
    )
    
    fig.write_html(OUTPUT_DIR / "topic_similarity_heatmap.html")
    print(f"Topic similarity heatmap saved to {OUTPUT_DIR / 'topic_similarity_heatmap.html'}")
    
except Exception as e:
    print(f"Could not calculate topic similarity matrix: {e}")

# Measure advanced analysis time
analysis_time = time.time() - start_time
print(f"\nAdvanced analysis completed in {analysis_time:.2f} seconds")

## 11. Performance Summary

Summarize the performance metrics of our quality-focused workflow.

In [None]:
# Create performance summary
print("Performance Summary")
print("===================\n")
print(f"Dataset size: {len(df)} documents")
print(f"Topics discovered: {len(topic_info) - 1} (excluding outliers)")
print("\nProcessing times:")
print(f"- Initialization: {init_time:.2f} seconds")
print(f"- Preprocessing: {preproc_time:.2f} seconds")
print(f"- Topic modeling: {topic_time:.2f} seconds")
print(f"- Report generation: {report_time:.2f} seconds")
print(f"- Visualizations: {viz_time:.2f} seconds")
print(f"- Advanced analysis: {analysis_time:.2f} seconds")
print(f"- Total processing time: {init_time + preproc_time + topic_time + report_time + viz_time + analysis_time:.2f} seconds")

if "primary_topic" in df.columns:
    print(f"\nAdjusted Mutual Information: {ami:.4f}")

## 12. Summary and Conclusion

This notebook demonstrated a quality-first approach to CPU-bound topic modeling using Meno. We prioritized result quality over processing speed by using:

1. **Full-featured embedding models**: Using `all-MiniLM-L6-v2` without quantization for best embedding quality
2. **UMAP dimensionality reduction**: Slower but produces superior topic separation vs. PCA
3. **High-quality BERTopic**: Using HDBSCAN clustering for optimal topic coherence
4. **Enhanced visualizations**: Creating detailed, information-rich visualizations
5. **Advanced topic analysis**: Performing deeper analysis of topic structure and relationships

### Key Benefits of this Approach

- **Superior topic separation**: Better distinguishes between related topics
- **Higher topic coherence**: Topics are more internally consistent
- **More detailed visualizations**: Richer visual representations of topic relationships
- **Enhanced topic labels**: More meaningful descriptions of topic content
- **Works entirely on CPU**: No GPU required, just more processing time

This approach is ideal when you prioritize result quality over processing speed, especially for more nuanced datasets where topics have subtle differences or overlap.