# Meno + BERTopic Integration Tutorial

This notebook demonstrates how to integrate the [Meno](https://github.com/yourusername/meno) topic modeling toolkit with [BERTopic](https://github.com/MaartenGr/BERTopic) for advanced topic modeling.

## Overview

This notebook covers:

1. Setting up the environment
2. Loading and preprocessing data with Meno
3. Topic modeling with BERTopic
4. Hyperparameter optimization for BERTopic
5. Combining Meno's workflow with BERTopic's topic modeling
6. Visualizing and analyzing results

By the end of this notebook, you'll understand how to leverage the strengths of both frameworks.

## 1. Environment Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
from pathlib import Path
from time import time
from datasets import load_dataset

# Add parent directory to path for Meno imports (if needed)
sys.path.insert(0, os.path.abspath('..'))

# Import Meno
from meno import MenoWorkflow
from meno.modeling.bertopic_model import BERTopicModel

# Import BERTopic components
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.dimensionality import UMAPReducer
from bertopic.cluster import HDBSCANClusterer

# Create output directory
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

print("Environment setup complete!")

## 2. Loading and Preprocessing Data with Meno

Meno provides powerful data preprocessing capabilities, including acronym detection and expansion, spelling correction, and text normalization. Let's use it to prepare our data.

In [None]:
# Load sample dataset from Hugging Face
def load_data():
    """Load insurance dataset from Hugging Face."""
    print("Loading insurance dataset from Hugging Face...")
    dataset = load_dataset("soates/australian-insurance-pii-dataset-corrected")
    
    # Convert to DataFrame
    df = pd.DataFrame({
        "text": dataset["train"]["original_text"],
        "id": dataset["train"]["id"]
    })
    
    # Take a sample for faster processing
    df = df.sample(n=200, random_state=42)
    print(f"Loaded {len(df)} documents")
    
    return df

# Load the data
df = load_data()

In [None]:
# Initialize Meno Workflow
workflow = MenoWorkflow()

# Load data into workflow
workflow.load_data(
    data=df,
    text_column="text",
    id_column="id"
)

# Generate acronym report
print("Detecting acronyms...")
acronym_report = workflow.generate_acronym_report(
    min_length=2,
    min_count=3,
    output_path=str(output_dir / "insurance_acronyms.html"),
    open_browser=False
)

# Define custom acronym mappings based on report
acronym_mappings = {
    "PDS": "Product Disclosure Statement",
    "CTP": "Compulsory Third Party",
    "RSA": "Roadside Assistance"
}

# Apply acronym expansions
print("Expanding acronyms...")
workflow.expand_acronyms(custom_mappings=acronym_mappings)

# Generate spelling report
print("Detecting potential misspellings...")
spelling_report = workflow.generate_misspelling_report(
    min_length=5,
    min_count=2,
    output_path=str(output_dir / "insurance_misspellings.html"),
    open_browser=False
)

# Define custom spelling corrections
spelling_corrections = {
    "recieved": "received",
    "reciept": "receipt"
}

# Apply spelling corrections
print("Correcting spelling...")
workflow.correct_spelling(custom_corrections=spelling_corrections)

# Preprocess text data
print("Preprocessing documents...")
preprocessed_df = workflow.preprocess_documents(
    lowercase=True,
    remove_punctuation=True,
    remove_stopwords=True,
    additional_stopwords=["insurance", "policy", "claim", "customer"]
)

# Display a sample preprocessed document
print("\nSample preprocessed document:")
print(preprocessed_df["processed_text"].iloc[0])

## 3. Topic Modeling with BERTopic

Now that we've preprocessed the data with Meno, let's use BERTopic for topic modeling. We'll demonstrate three approaches:

1. Using Meno's BERTopicModel wrapper
2. Using BERTopic directly with a simple configuration
3. Using BERTopic with a custom pipeline

### 3.1 Using Meno's BERTopicModel Wrapper

In [None]:
# Initialize Meno's BERTopicModel
meno_bertopic = BERTopicModel(
    n_topics=10,
    min_topic_size=5,
    n_neighbors=15,
    n_components=5,
    verbose=True
)

# Fit the model
print("Fitting Meno BERTopicModel...")
meno_bertopic.fit(preprocessed_df["processed_text"])

# Display topic information
print("\nDiscovered topics:")
for topic_id, topic_name in meno_bertopic.topics.items():
    if topic_id != -1:  # Skip outlier topic
        topic_size = meno_bertopic.topic_sizes[topic_id]
        print(f"Topic {topic_id} ({topic_size} documents): {topic_name}")

### 3.2 Using BERTopic Directly with Simple Configuration

In [None]:
# Create a simple BERTopic model
simple_bertopic = BERTopic(
    embedding_model="all-MiniLM-L6-v2",
    nr_topics=10,
    min_topic_size=5,
    verbose=True
)

# Fit the model
print("Fitting simple BERTopic model...")
topics, probs = simple_bertopic.fit_transform(preprocessed_df["processed_text"].tolist())

# Get topic information
topic_info = simple_bertopic.get_topic_info()
print(f"\nDiscovered {len(topic_info[topic_info['Topic'] != -1])} topics")

# Print top words for each topic
print("\nTop words per topic:")
for topic_id in sorted(simple_bertopic.get_topics().keys()):
    if topic_id != -1:  # Skip outlier topic
        topic_words = simple_bertopic.get_topic(topic_id)
        words = [word for word, _ in topic_words[:5]]
        print(f"Topic {topic_id}: {', '.join(words)}")

### 3.3 Using BERTopic with Custom Pipeline

In [None]:
# Create custom components
print("Creating custom BERTopic pipeline...")

# Enhanced c-TF-IDF transformer
ctfidf_model = ClassTfidfTransformer(
    reduce_frequent_words=True,  # Reduce impact of frequent terms
    bm25_weighting=True          # Use BM25 weighting for better results
)

# Dimensionality reduction with UMAP
umap_model = UMAPReducer(
    n_neighbors=15,              # Balance local vs global structure
    n_components=5,              # Intermediate dimensionality
    min_dist=0.1,                # How tightly to pack points
    metric="cosine",             # Distance metric
    low_memory=True              # Memory optimization
)

# Clustering with HDBSCAN
hdbscan_model = HDBSCANClusterer(
    min_cluster_size=5,          # Minimum size of clusters (smaller for demonstration)
    min_samples=3,               # Sample size for core points
    metric="euclidean",          # Distance metric
    prediction_data=True,        # Store data for predicting new points
    cluster_selection_method="eom"  # Excess of mass method
)

# Topic representation model - combining two approaches
keybert_model = KeyBERTInspired()                # Better keywords
mmr_model = MaximalMarginalRelevance(diversity=0.3)  # Diversity in representation

# Create BERTopic model with custom pipeline
custom_bertopic = BERTopic(
    # Main embedding model
    embedding_model="all-MiniLM-L6-v2",
    
    # Custom components
    ctfidf_model=ctfidf_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    representation_model=[keybert_model, mmr_model],  # Pipeline of representations
    
    # Model parameters
    nr_topics=10,
    calculate_probabilities=True,
    verbose=True
)

# Fit the model
print("Fitting custom BERTopic model...")
topics, probs = custom_bertopic.fit_transform(preprocessed_df["processed_text"].tolist())

# Get topic information
topic_info = custom_bertopic.get_topic_info()
print(f"\nDiscovered {len(topic_info[topic_info['Topic'] != -1])} topics")

# Print top words for each topic
print("\nTop words per topic:")
for topic_id in sorted(custom_bertopic.get_topics().keys()):
    if topic_id != -1:  # Skip outlier topic
        topic_words = custom_bertopic.get_topic(topic_id)
        words = [word for word, _ in topic_words[:5]]
        print(f"Topic {topic_id}: {', '.join(words)}")

## 4. Hyperparameter Optimization for BERTopic

BERTopic has several components that can be tuned. Let's create a function to optimize the most important hyperparameters.

In [None]:
def evaluate_bertopic_model(model, documents):
    """Evaluate a BERTopic model.
    
    Parameters
    ----------
    model : BERTopic
        The fitted BERTopic model
    documents : list
        List of document texts
        
    Returns
    -------
    dict
        Dictionary of evaluation metrics
    """
    # Get model information
    topic_info = model.get_topic_info()
    topics, _ = model.transform(documents)
    
    # Calculate metrics
    n_topics = len(topic_info[topic_info['Topic'] != -1])
    outlier_percentage = topic_info.iloc[0]['Count'] / len(documents) * 100
    avg_topic_size = topic_info[topic_info['Topic'] != -1]['Count'].mean()
    
    # Calculate topic diversity based on word overlap
    all_topic_words = []
    for topic_id in model.get_topics():
        if topic_id != -1:  # Skip outlier topic
            words = [word for word, _ in model.get_topic(topic_id)[:10]]
            all_topic_words.append(set(words))
    
    # Calculate average Jaccard similarity (lower is better / more diverse)
    diversity_score = 0
    comparisons = 0
    for i in range(len(all_topic_words)):
        for j in range(i+1, len(all_topic_words)):
            intersection = len(all_topic_words[i].intersection(all_topic_words[j]))
            union = len(all_topic_words[i].union(all_topic_words[j]))
            if union > 0:
                similarity = intersection / union
                diversity_score += similarity
                comparisons += 1
    
    word_diversity = 1 - (diversity_score / max(1, comparisons))
    
    return {
        "n_topics": n_topics,
        "outlier_percentage": outlier_percentage,
        "avg_topic_size": avg_topic_size,
        "word_diversity": word_diversity,
        "combined_score": n_topics * word_diversity * (100 - outlier_percentage) / 100
    }

In [None]:
def optimize_bertopic_hyperparameters(documents, n_trials=3):
    """Optimize BERTopic hyperparameters.
    
    Parameters
    ----------
    documents : list
        List of document texts
    n_trials : int, optional
        Number of trials to run, by default 3
        
    Returns
    -------
    tuple
        (best_params, best_model, best_score)
    """
    import itertools
    
    # Define hyperparameter grid
    param_grid = {
        "n_neighbors": [5, 15, 30],
        "n_components": [5, 10, 15],
        "min_cluster_size": [5, 10, 15],
        "diversity": [0.1, 0.3, 0.5]
    }
    
    # For the notebook demo, use a smaller grid
    # In a real scenario, you would use the full grid or even more parameters
    if n_trials < 5:  # Use a reduced grid for quick demonstration
        param_grid = {
            "n_neighbors": [15],
            "n_components": [5],
            "min_cluster_size": [5, 10],
            "diversity": [0.3]
        }
    
    # Generate parameter combinations
    param_names = list(param_grid.keys())
    param_values = list(param_grid.values())
    param_combinations = list(itertools.product(*param_values))
    
    # If too many combinations, select a random subset
    import random
    if len(param_combinations) > n_trials:
        random.seed(42)
        param_combinations = random.sample(param_combinations, n_trials)
    
    # Evaluate each combination
    best_score = -float('inf')
    best_params = None
    best_model = None
    
    print(f"Running {len(param_combinations)} trials...")
    
    for i, values in enumerate(param_combinations):
        params = dict(zip(param_names, values))
        print(f"\nTrial {i+1}/{len(param_combinations)}: {params}")
        
        # Create components with these parameters
        umap_model = UMAPReducer(
            n_neighbors=params["n_neighbors"],
            n_components=params["n_components"],
            min_dist=0.1,
            metric="cosine",
            low_memory=True
        )
        
        hdbscan_model = HDBSCANClusterer(
            min_cluster_size=params["min_cluster_size"],
            min_samples=params["min_cluster_size"] // 2,
            metric="euclidean",
            prediction_data=True,
            cluster_selection_method="eom"
        )
        
        # Topic representation with diversity parameter
        keybert_model = KeyBERTInspired()
        mmr_model = MaximalMarginalRelevance(diversity=params["diversity"])
        
        # Create BERTopic model
        model = BERTopic(
            embedding_model="all-MiniLM-L6-v2",
            umap_model=umap_model,
            hdbscan_model=hdbscan_model,
            representation_model=[keybert_model, mmr_model],
            calculate_probabilities=True,
            verbose=False  # Reduce output during optimization
        )
        
        start_time = time()
        
        # Fit the model
        try:
            model.fit_transform(documents)
            
            # Evaluate the model
            metrics = evaluate_bertopic_model(model, documents)
            print(f"Metrics: {metrics}")
            
            # Check if this is the best model so far
            if metrics["combined_score"] > best_score:
                best_score = metrics["combined_score"]
                best_params = params
                best_model = model
                print(f"New best model! Score: {best_score:.2f}")
                
        except Exception as e:
            print(f"Error with parameters {params}: {str(e)}")
            
        print(f"Trial completed in {time() - start_time:.1f} seconds")
    
    return best_params, best_model, best_score

In [None]:
# Run the hyperparameter optimization
print("Optimizing BERTopic hyperparameters...")
best_params, best_model, best_score = optimize_bertopic_hyperparameters(
    preprocessed_df["processed_text"].tolist(),
    n_trials=2  # Use a small number for demonstration
)

print(f"\nBest hyperparameters: {best_params}")
print(f"Best score: {best_score:.2f}")

# Get topic information from the best model
topic_info = best_model.get_topic_info()
print(f"\nNumber of topics in best model: {len(topic_info[topic_info['Topic'] != -1])}")

# Print top words for each topic
print("\nTop words per topic in best model:")
for topic_id in sorted(best_model.get_topics().keys()):
    if topic_id != -1:  # Skip outlier topic
        topic_words = best_model.get_topic(topic_id)
        words = [word for word, _ in topic_words[:5]]
        print(f"Topic {topic_id}: {', '.join(words)}")

## 5. Combining Meno's Workflow with BERTopic

Now let's demonstrate how to integrate the optimized BERTopic model with Meno's workflow.

In [None]:
# Assign topics from the best BERTopic model to the documents
topics, probs = best_model.transform(preprocessed_df["processed_text"].tolist())

# Create a DataFrame with topic assignments
topic_df = pd.DataFrame({
    "topic": [f"Topic_{t}" if t >= 0 else "Outlier" for t in topics],
    "topic_probability": [max(prob) if max(prob) > 0 else 0 for prob in probs]
})

# Update the Meno workflow with BERTopic results
workflow.set_topic_assignments(topic_df)

# Generate Meno visualizations
print("Generating Meno visualizations...")

# Topic embedding visualization
embedding_viz = workflow.visualize_topics(plot_type="embeddings")
embedding_viz.write_html(str(output_dir / "topic_embeddings.html"))

# Topic distribution visualization
distribution_viz = workflow.visualize_topics(plot_type="distribution")
distribution_viz.write_html(str(output_dir / "topic_distribution.html"))

# Generate comprehensive report
print("\nGenerating comprehensive report...")
report_path = workflow.generate_comprehensive_report(
    output_path=str(output_dir / "meno_bertopic_report.html"),
    title="Meno + BERTopic Integration Results",
    include_interactive=True,
    include_raw_data=True,
    open_browser=False
)

## 6. Visualizing and Analyzing Results

Let's create some visualizations to analyze our results.

In [None]:
# Generate BERTopic visualizations
print("Generating BERTopic visualizations...")

# Topic similarity visualization
best_model.visualize_topics().write_html(str(output_dir / "bertopic_similarity.html"))

# Topic hierarchy
best_model.visualize_hierarchy().write_html(str(output_dir / "bertopic_hierarchy.html"))

# Topic barchart
best_model.visualize_barchart(top_n_topics=10).write_html(str(output_dir / "bertopic_barchart.html"))

# List all generated files
print("\nGenerated files:")
for file in output_dir.glob("*.html"):
    print(f"- {file.name}")

## Conclusion

In this notebook, we've demonstrated how to:

1. Use Meno's preprocessing capabilities to clean and prepare text data
2. Apply BERTopic for topic modeling, using both simple and advanced configurations
3. Optimize BERTopic hyperparameters for better topic quality
4. Integrate Meno's workflow system with BERTopic's topic modeling
5. Generate comprehensive visualizations and reports

This integration combines Meno's strength in data preprocessing and workflow management with BERTopic's state-of-the-art topic modeling capabilities, resulting in a powerful solution for text analysis.