In this example, we evaluate CAST, BERTopic, Top2Vec, LDA, and TopClus on the 20NewsGroups benchmark dataset, which is loaded from the [OCTIS library](https://github.com/MIND-Lab/OCTIS).

We use the default LDA implementation from OCTIS and the default parameters for TopClus. For CAST, Top2Vec, and BERTopic, we use the "all-mpnet-base-v2" model as an example. Additionally, we fix the UMAP and HDBSCAN parameters across these models to ensure a fair comparison. Each number of topics is evaluated three times, and the scores are averaged.

## Preparation

In [16]:
from octis.dataset.dataset import Dataset # load preprocessed data
from CAST import CAST, Candidate
import time
from sentence_transformers import SentenceTransformer
from evaluation_metrics import TopicCoherence, TopicDiversity
import json
import itertools

In [3]:
# Load dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")
corpus = dataset.get_corpus()
documents = [" ".join(words) for words in corpus]

## Evaluation

### CAST

On the first run, CAST will store the sentence and word embeddings in the `embeddings` folder. For subsequent runs, CAST will load these precomputed embeddings to save time.

In [6]:
MODEL_NAME = 'sentence-transformers/all-mpnet-base-v2'  # you can also use other sentence embedding models

params = {"n_dimensions": 5, 
          "min_cluster_size": 15, 
          "min_count": 50, 
          "self_sim_threshold": 0.4,
          "candidate_mode": 'word_level'
          }

In [None]:
for nr_topics in [10]:
    results = []
    for j in range(3):
        start = time.time()
        model = CAST(documents=documents, model_name=MODEL_NAME, nr_topics=nr_topics, **params)

        twords = model.pipeline()
        end = time.time()
        computation_time = float(end - start)

        for cluster_id, keywords in twords.items():
            print(f"Cluster {cluster_id}: {', '.join(keywords)}") 

        twords_list = list(twords.values())

        scores={}
        coherence_metric = TopicCoherence(texts=corpus, topk=10, measure="c_npmi")
        coherence_score = coherence_metric.score(twords_list)

        diversity_metric = TopicDiversity(topk=10)
        diversity_score = diversity_metric.score(twords_list)


        print(f"TC: {coherence_score}; TD: {diversity_score};")
        scores['npmi'] = coherence_score
        scores['diversity'] = diversity_score


        result = {
            "Model": MODEL_NAME,
            "Dataset Size": len(documents),
            "nr_topics": len(twords_list),
            "Params": params,
            "Scores": scores,
            "Computation Time": computation_time,
            "Topic words": twords_list,
        }
        results.append(result)


    with open(f"20News_CAST_{MODEL_NAME.split('/')[1]}_{nr_topics}.json", "w") as file:
        json.dump(results, file) 

### BERTopic

For more information: https://github.com/MaartenGr/BERTopic

In [14]:
from bertopic import BERTopic

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Matplotlib is building the font cache; this may take a moment.


In [None]:
MODEL_NAME = 'all-mpnet-base-v2'
embedding_model = SentenceTransformer(MODEL_NAME)
embeddings = embedding_model.encode(documents, show_progress_bar=True)

In [None]:
params = {"embedding_model": MODEL_NAME,
          'min_topic_size': 15,
          "verbose": True
          }

In [None]:
for nr_topics in [11]:
    results = []
    for j in range(3):
        start = time.time()
        model = BERTopic(**params, nr_topics = nr_topics)
        topics, probs = model.fit_transform(documents, embeddings)
        print (f"number of topics: {len(set(topics))}")
        end = time.time()
        computation_time = float(end - start)

        twords = [
                    [
                        vals[0] for vals in model.get_topic(i)[:10]
                    ]
                    for i in range(nr_topics-1)
                ]
        print (f"the length of twords: {len(twords)}")
        print (twords)
        coherence_metric = TopicCoherence(texts=corpus, topk=10, measure="c_npmi")
        coherence_score = coherence_metric.score(twords)

        diversity_metric = TopicDiversity(topk=10)
        diversity_score = diversity_metric.score(twords)
        
        scores = {}
        scores['npmi'] = coherence_score
        scores['diversity'] = diversity_score

        result = {
            "Model": MODEL_NAME,
            "Dataset Size": len(documents),
            "Params": params,
            "Scores": scores,
            "Computation Time": computation_time,
            "Topic words": twords
        }
        results.append(result)


    with open(f"20NewsGroup_Bertopic_{MODEL_NAME}_{nr_topics-1}.json", "w") as file:
        json.dump(results, file)

### Top2Vec

In [None]:
from top2vec import Top2Vec

Top2Vec supports the following `embedding_model`:
- doc2vec
- universal-sentence-encoder
- universal-sentence-encoder-large
- universal-sentence-encoder-multilingual
- universal-sentence-encoder-multilingual-large
- distiluse-base-multilingual-cased
- all-MiniLM-L6-v2
- paraphrase-multilingual-MiniLM-L12-v2

For those not in the list, Top2Vec also supports callable embedding_model, more information please see (https://top2vec.readthedocs.io/en/latest/api.html). In this example, we build the `embed_with_mpnet` to use the `all-mpnet-base-v2` model.

In [None]:
def embed_with_mpnet(texts):
    model = SentenceTransformer("all-mpnet-base-v2")
    return model.encode(texts)

In [None]:
params = {"umap_args": {"n_components": 5},
          "hdbscan_args": {'min_cluster_size': 15},
          "min_count": 50
          }

In [None]:
for num_topics in [10]:
    results = []
    for i in range(3):
        print (f"num_topic: {num_topics}")
        start = time.time()
        model = Top2Vec(documents, embedding_model=embed_with_mpnet, **params)
        topic_words, _, _ = model.get_topics()

        _ = model.hierarchical_topic_reduction(num_topics)
        topic_words, _, _ = model.get_topics(num_topics=num_topics, reduced =True)
        end = time.time()
        computation_time = float(end - start)
        twords = [topic_word[:10].tolist() for topic_word in topic_words]
        
        coherence_metric = TopicCoherence(texts=corpus, topk=10, measure="c_npmi")
        coherence_score = coherence_metric.score(twords)

        diversity_metric = TopicDiversity(topk=10)
        diversity_score = diversity_metric.score(twords)
        
        scores = {}
        scores['npmi'] = coherence_score
        scores['diversity'] = diversity_score

        result = {
            "Model": MODEL_NAME,
            "num_topic": len(topic_words),
            "Dataset Size": len(documents),
            "Params": params,
            "Scores": scores,
            "Computation Time": computation_time,
            "Topic words": twords
        }
        results.append(result)


    with open(f"20NewsGroup_Top2vec_{MODEL_NAME}_{num_topics}.json", "w") as file:
        json.dump(results, file)


### LDA

In this example, we load the LDA model from [OCTIS](https://github.com/MIND-Lab/OCTIS).

In [None]:
from octis.models.LDA import LDA

In [None]:
for nr_topics in [10]:
    results = []
    for i, random_state in enumerate([0, 21, 42]):
        params = {"random_state": random_state,
                "alpha" : "auto", "eta": "auto", "decay" : 0.5, "offset" : 1.0,
                "iterations" : 400, "gamma_threshold" : 0.001}
        
        model = LDA(**params, num_topics=nr_topics)
        model.use_partitions = False


        start = time.time()
        output_tm = model.train_model(dataset)
        end = time.time()
        computation_time = end - start

        twords = output_tm['topics']
        
        coherence_metric = TopicCoherence(texts=corpus, topk=10, measure="c_npmi")
        coherence_score = coherence_metric.score(twords)

        diversity_metric = TopicDiversity(topk=10)
        diversity_score = diversity_metric.score(twords)
        
        scores = {}
        scores['npmi'] = coherence_score
        scores['diversity'] = diversity_score

        result = {
            "Dataset Size": len(documents),
            "Params": params,
            "Scores": scores,
            "Computation Time": computation_time,
            "Topic words": twords
        }
        results.append(result)


    with open(f"20NewsGroups_LDA_{nr_topics}.json", "w") as file:
        json.dump(results, file)


### TopClus

Follow the instructions of [TopClus](https://github.com/yumeng5/TopClus):

To execute the code on a new dataset, you need to
1. Create a directory named your_dataset under datasets.
2. Prepare a text corpus texts.txt (one document per line) under your_dataset as the target corpus for topic discovery.
3. Run src/trainer.py with appropriate command line arguments (the default values are usually good start points).



In this example, I create a directory 20NewsGroup and store the texts.txt under it. I use the following script to run the model:

```
clusters=(10 20 30)

# Loop over the cluster values
for n_clusters in "${clusters[@]}"; do
    python src/trainer.py --dataset 20NewsGroup --n_clusters "$n_clusters" --lr 5e-4 --cluster_weight 0.1 --seed 42 --do_cluster --do_inference
done

````

After you get the results, you can use the following script to evaluate the results.

In [None]:
def read_twords(filename):
    result_dict = {}
    with open(filename, 'r') as file:
        for line in file:
            # Remove any leading/trailing whitespace characters
            line = line.strip()
            if line:  # If the line is not empty
                # Split the line into key and values part
                key, values = line.split(':')
                # Split the values by comma and strip any extra whitespace
                value_list = [value.strip() for value in values.split(',')]
                # Add the key and list of values to the dictionary
                result_dict[key] = value_list
    return result_dict

results = []
for nr_topics in [10, 20, 30]:
    twords_dic = read_twords(f'your_file_path/topics_final_{nr_topics}.txt')
    twords = list(twords_dic.values())
    coherence_metric = TopicCoherence(texts=corpus, topk=10, measure="c_npmi")
    umass = TopicCoherence(texts=corpus, topk=10, measure='u_mass')
    uci = TopicCoherence(texts=corpus, topk=10, measure='c_uci')
    
    coherence_score = coherence_metric.score(twords)
    umass_score = umass.score(twords)
    uci_score = uci.score(twords)
  
     
    diversity_metric = TopicDiversity(topk=10)
    diversity_score = diversity_metric.score(twords)

    scores = {}
    scores['npmi'] = coherence_score
    scores['diversity'] = diversity_score
    scores['umass'] = umass_score
    scores['uci'] = uci_score

    result = {
        "Dataset Size": len(corpus),
        "nr_topics" : nr_topics,
        "Scores": scores,
        "Topic words": twords
    }
    results.append(result)


with open("Elonmusk_Pre_TopClus.json", "w") as file:
    json.dump(results, file)