# Introduction

This notebook demonstrates an end-to-end pipeline to improve topic modeling using <b> BERTopic </b> by reducing the noise cluster size while preserving coherence and diversity. Our method introduces an <b> objective function </b> that balances three competing goals: topic coherence, topic diversity, and noise reduction. Through <b> hyperparameter optimization (HPO) </b> with Optuna, we systematically search for the best configuration of UMAP and HDBSCAN parameters that maximize this objective.

We apply our approach to the IMDb movie review dataset—a high-noise, real-world benchmark—and compare results against the baseline BERTopic configuration. Our method reduces noise from ~60% to ~40%, significantly increasing usable data and revealing richer topic structures.

To ensure scalability, we leverage NVIDIA cuML to accelerate UMAP and HDBSCAN by adding a drop-in
<b> %load_ext cuml.accel </b>

# GPU Acceleration (cuML) Activation 

In [1]:
%load_ext cuml.accel

[2025-06-16 15:02:22.178] [CUML] [info] cuML: Installed accelerator for sklearn.
[2025-06-16 15:02:23.732] [CUML] [info] cuML: Installed accelerator for umap.
[2025-06-16 15:02:23.736] [CUML] [info] cuML: Installed accelerator for hdbscan.
[2025-06-16 15:02:23.736] [CUML] [info] cuML: Successfully initialized accelerator.


# BERTopic Introduction

In [None]:
from bertopic import BERTopic
import hdbscan
import umap
from datasets import load_dataset

# Load IMDb reviews dataset
dataset = load_dataset("imdb", split="train")
docs = dataset["text"]  # List of text documents

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = umap.UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state = 42)
hdbscan_model = hdbscan.HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model
)
topics, probs = topic_model.fit_transform(docs)

# Visualize the topics
topic_model.visualize_topics()

<b>Note:</b> The Intertopic Distance Map visualization has been removed due to its large size. You can still generate it by running the notebook locally.

Get the Topic ID and the Respective Top Keywords

In [3]:
# Get the topic keywords
topic_info = topic_model.get_topic_info()  # Returns a DataFrame with topic details
topic_info  # Includes Topic ID and top keywords

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14801,-1_the_to_and_of,"[the, to, and, of, this, is, it, in, that, movie]","[Now, I have seen a lot of movies in my day, b..."
1,0,873,0_show_series_episode_episodes,"[show, series, episode, episodes, season, show...","[""That '70s Show"" is definitely the funniest s..."
2,1,505,1_japanese_chinese_martial_japan,"[japanese, chinese, martial, japan, arts, acti...",[Billy Chung Siu Hung's (the bloody swordplay ...
3,2,461,2_bad_movie_worst_it,"[bad, movie, worst, it, you, this, acting, was...","[Worst film ever, this is a statement that peo..."
4,3,341,3_horror_scary_movie_movies,"[horror, scary, movie, movies, it, you, house,...",[(SPOILERS included) This film surely is the b...
...,...,...,...,...,...
299,298,5,298_hutton_duchovny_geekboy_muldoon,"[hutton, duchovny, geekboy, muldoon, jolie, jo...","[Okay, truthfully, I saw the previews for this..."
300,299,5,299_elephant_elephants_plantation_taylor,"[elephant, elephants, plantation, taylor, finc...",[A beautiful shopgirl in London is swept off h...
301,300,5,300_tomanovich_reverand_dara_mamets,"[tomanovich, reverand, dara, mamets, kirkland,...",[This centers on unironic notions of coming to...
302,301,5,301_bombshells_elizabeth_dench_patrick,"[bombshells, elizabeth, dench, patrick, laine,...",[A charming little film set in the UK about th...


In [4]:
import cupy as cp
noise_count = topic_info[topic_info.Topic == -1].Count
float(cp.round((noise_count.values[0]/len(docs)) * 100, 2))

59.2

# The Objective Function

The optimization objective is an objective function represented as a weighted linear combination of three key metrics:
<div align="center">
$$
J = w_c \cdot Coherence + w_d \cdot Diversity - w_n \cdot \ max\ (0,\frac{(Noise \; or \; -1 \; Cluster)\% - Threshold}{100} )
$$
</div>

In [5]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

In [6]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true" 

<div> 
    Taking $w_c$ = 0.4, $w_d$ = 0.3, $w_n$ = 0.3, $Threshold$ = 20 for example, 
</div>
<div align="center">
$$
J = 0.4 \cdot Coherence + 0.3 \cdot Diversity - 0.3 \cdot \ max\ (0,\frac{(Noise \; or \; -1 \; Cluster)\% - 20}{100} )
$$
</div>

In [7]:
import re
from itertools import chain

# Preprocess and tokenize documents
def preprocess(doc):
    doc = re.sub(r'<.*?>', '', doc)  # Remove HTML
    doc = re.sub(r'[^a-zA-Z\s]', '', doc)  # Remove non-alphabetic chars
    return doc.lower().split()

# Extract topic words (excluding outlier topic -1)
def get_topic_words(topic_model, topics):
    return [
        [word for word, _ in topic_model.get_topic(topic)]
        for topic in set(topics) - {-1}
    ]

# Calculate the noise penalty
def calculate_noise_penalty(topic_model, doc_count):
    noise_row = topic_model.get_topic_info().query("Topic == -1")
    noise_count = int(noise_row.Count.values[0]) if not noise_row.empty else 0
    noise_percent = cp.round((noise_count / doc_count) * 100, 2)
    penalty = abs(noise_percent - 20)
    return noise_percent, penalty

# Calculate coherence score
def calculate_coherence(topic_model, tokenized_docs, topics, dictionary):
    topic_words = get_topic_words(topic_model, topics)
    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokenized_docs,
        dictionary=dictionary,
        coherence='c_v'
    )
    return coherence_model.get_coherence()

# Calculate topic diversity score
def calculate_diversity(topic_model, topics):
    topic_words = get_topic_words(topic_model, topics)
    all_words = chain.from_iterable(topic_words)
    word_list = list(all_words)
    return len(set(word_list)) / len(word_list) if word_list else 0


In [8]:
# Training and evaluation pipeline
def train_and_eval(docs, n_components=5, n_neighbors=15, min_dist=0.0, min_samples=10,
                   gen_min_span_tree=True, prediction_data=True):
    try:
        # Dimensionality reduction and clustering models
        spread_val = max(min_dist + 1e-3, 1.0)
        umap_model = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors,
                               min_dist=min_dist, spread=spread_val)
        hdbscan_model = hdbscan.HDBSCAN(min_samples=min_samples,
                                        gen_min_span_tree=gen_min_span_tree,
                                        prediction_data=prediction_data)

        topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
        topics, _ = topic_model.fit_transform(docs)

        # Preprocess docs once
        tokenized_docs = list(map(preprocess, docs))
        dictionary = Dictionary(tokenized_docs)

        # Metric calculations
        noise_percent, penalty = calculate_noise_penalty(topic_model, len(docs))
        coherence = calculate_coherence(topic_model, tokenized_docs, topics, dictionary)
        diversity = calculate_diversity(topic_model, topics)

        # Weighted score
        score = (0.4 * coherence) + (0.3 * diversity) - (0.3 * penalty / 100)
        return score

    except Exception as e:
        print(f"[Trial failed] Exception: {e}")
        return -100.0


# Hyperparameter Optimization (HPO)

In [9]:
import optuna

In [10]:
def objective(trial):
    params = {
        "n_components": trial.suggest_int("n_components", 5, 20),
        "n_neighbors": trial.suggest_int("n_neighbors", 5, 20),
        "min_dist": trial.suggest_float("min_dist", 0.0, 1.0),
        "min_samples": trial.suggest_int("min_samples", 5, 25),
        "gen_min_span_tree": trial.suggest_categorical("gen_min_span_tree", [True, False]),
        "prediction_data": trial.suggest_categorical("prediction_data", [True, False]),  
    }
    return train_and_eval(docs=docs, **params)


In [11]:
%%time

study = optuna.create_study(
    direction="maximize",
    study_name="optuna_bertopic",
    sampler=optuna.samplers.TPESampler(seed=142),
)

study.optimize(objective, n_trials=40)

print(f"Best params: {study.best_params}")

[I 2025-06-16 14:15:35,127] A new study created in memory with name: optuna_bertopic


[2025-06-16 14:15:51.973] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:15:59,898] Trial 0 finished with value: 0.2585886740893275 and parameters: {'n_components': 19, 'n_neighbors': 13, 'min_dist': 0.6559847055064072, 'min_samples': 22, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 0 with value: 0.2585886740893275.


[2025-06-16 14:16:16.933] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:16:38,791] Trial 1 finished with value: 0.39402694288194373 and parameters: {'n_components': 11, 'n_neighbors': 17, 'min_dist': 0.7855316209877806, 'min_samples': 8, 'gen_min_span_tree': False, 'prediction_data': False}. Best is trial 1 with value: 0.39402694288194373.


[2025-06-16 14:16:55.829] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:17:04,156] Trial 2 finished with value: 0.22755259693611923 and parameters: {'n_components': 14, 'n_neighbors': 11, 'min_dist': 0.9578462739625135, 'min_samples': 22, 'gen_min_span_tree': False, 'prediction_data': True}. Best is trial 1 with value: 0.39402694288194373.


[2025-06-16 14:17:21.286] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:17:29,947] Trial 3 finished with value: 0.25855867408932753 and parameters: {'n_components': 16, 'n_neighbors': 13, 'min_dist': 0.8552674895776905, 'min_samples': 25, 'gen_min_span_tree': False, 'prediction_data': True}. Best is trial 1 with value: 0.39402694288194373.


[2025-06-16 14:17:47.197] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:18:03,011] Trial 4 finished with value: 0.3969103338719902 and parameters: {'n_components': 16, 'n_neighbors': 14, 'min_dist': 0.17615075495259813, 'min_samples': 20, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 4 with value: 0.3969103338719902.


[2025-06-16 14:18:19.734] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:18:28,197] Trial 5 finished with value: 0.2586486740893275 and parameters: {'n_components': 6, 'n_neighbors': 19, 'min_dist': 0.919979952129186, 'min_samples': 13, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 4 with value: 0.3969103338719902.


[2025-06-16 14:18:44.840] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:19:03,844] Trial 6 finished with value: 0.3993301583667193 and parameters: {'n_components': 18, 'n_neighbors': 11, 'min_dist': 0.5024363814847549, 'min_samples': 14, 'gen_min_span_tree': False, 'prediction_data': False}. Best is trial 6 with value: 0.3993301583667193.


[2025-06-16 14:19:21.161] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:19:35,801] Trial 7 finished with value: 0.3602612607986829 and parameters: {'n_components': 11, 'n_neighbors': 8, 'min_dist': 0.537753088603063, 'min_samples': 23, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 6 with value: 0.3993301583667193.


[2025-06-16 14:19:52.978] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:20:11,850] Trial 8 finished with value: 0.3958639030397454 and parameters: {'n_components': 10, 'n_neighbors': 12, 'min_dist': 0.41410321525062666, 'min_samples': 16, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 6 with value: 0.3993301583667193.


[2025-06-16 14:20:28.364] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:20:47,884] Trial 9 finished with value: 0.41012372293006766 and parameters: {'n_components': 8, 'n_neighbors': 13, 'min_dist': 0.35778167444036124, 'min_samples': 13, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 9 with value: 0.41012372293006766.


[2025-06-16 14:21:04.739] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:22:34,765] Trial 10 finished with value: 0.4238033112251916 and parameters: {'n_components': 5, 'n_neighbors': 6, 'min_dist': 0.0856621756325428, 'min_samples': 5, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 10 with value: 0.4238033112251916.


[2025-06-16 14:22:51.603] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:24:25,417] Trial 11 finished with value: 0.43095635637979335 and parameters: {'n_components': 5, 'n_neighbors': 5, 'min_dist': 0.006038095781139979, 'min_samples': 5, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 11 with value: 0.43095635637979335.


[2025-06-16 14:24:42.079] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:26:17,592] Trial 12 finished with value: 0.42267679278847486 and parameters: {'n_components': 5, 'n_neighbors': 5, 'min_dist': 0.0231801885461784, 'min_samples': 5, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 11 with value: 0.43095635637979335.


[2025-06-16 14:26:34.922] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:27:15,937] Trial 13 finished with value: 0.42899552852255635 and parameters: {'n_components': 7, 'n_neighbors': 5, 'min_dist': 0.005350668686725374, 'min_samples': 8, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 11 with value: 0.43095635637979335.


[2025-06-16 14:27:32.818] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:28:04,631] Trial 14 finished with value: 0.41762493357797337 and parameters: {'n_components': 8, 'n_neighbors': 8, 'min_dist': 0.22660613306919541, 'min_samples': 9, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 11 with value: 0.43095635637979335.


[2025-06-16 14:28:21.660] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:28:57,256] Trial 15 finished with value: 0.4160984707849811 and parameters: {'n_components': 8, 'n_neighbors': 8, 'min_dist': 0.239794161607893, 'min_samples': 9, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 11 with value: 0.43095635637979335.


[2025-06-16 14:29:14.197] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:30:23,834] Trial 16 finished with value: 0.4339047428924338 and parameters: {'n_components': 7, 'n_neighbors': 5, 'min_dist': 0.0012105087172698276, 'min_samples': 7, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:30:40.854] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:31:12,040] Trial 17 finished with value: 0.4247040583210651 and parameters: {'n_components': 10, 'n_neighbors': 7, 'min_dist': 0.11928553834143374, 'min_samples': 11, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:31:28.990] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:31:47,305] Trial 18 finished with value: 0.4111306368703605 and parameters: {'n_components': 13, 'n_neighbors': 10, 'min_dist': 0.28222095137931086, 'min_samples': 17, 'gen_min_span_tree': False, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:32:04.357] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:32:34,649] Trial 19 finished with value: 0.39295782217238157 and parameters: {'n_components': 5, 'n_neighbors': 16, 'min_dist': 0.34485406140919583, 'min_samples': 6, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:32:51.869] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:33:18,625] Trial 20 finished with value: 0.4281653736607506 and parameters: {'n_components': 7, 'n_neighbors': 9, 'min_dist': 0.11614095187671158, 'min_samples': 11, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:33:35.574] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:34:38,016] Trial 21 finished with value: 0.4300780064867389 and parameters: {'n_components': 7, 'n_neighbors': 5, 'min_dist': 0.006053230377275454, 'min_samples': 7, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 16 with value: 0.4339047428924338.


[2025-06-16 14:34:55.079] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:35:54,447] Trial 22 finished with value: 0.43720227759735714 and parameters: {'n_components': 9, 'n_neighbors': 6, 'min_dist': 0.0021612169364261147, 'min_samples': 7, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:36:11.938] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:36:44,094] Trial 23 finished with value: 0.4203556793466974 and parameters: {'n_components': 9, 'n_neighbors': 7, 'min_dist': 0.1491895612932953, 'min_samples': 10, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:37:01.313] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:38:00,677] Trial 24 finished with value: 0.42857334374774014 and parameters: {'n_components': 6, 'n_neighbors': 6, 'min_dist': 0.07274897583375675, 'min_samples': 7, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:38:17.440] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:39:36,718] Trial 25 finished with value: 0.41297708499067276 and parameters: {'n_components': 9, 'n_neighbors': 6, 'min_dist': 0.19056897574329767, 'min_samples': 5, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:39:53.308] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:40:36,657] Trial 26 finished with value: 0.42831085772894417 and parameters: {'n_components': 12, 'n_neighbors': 9, 'min_dist': 0.06959809405442266, 'min_samples': 7, 'gen_min_span_tree': False, 'prediction_data': True}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:40:53.260] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:41:18,716] Trial 27 finished with value: 0.41787947335202097 and parameters: {'n_components': 6, 'n_neighbors': 7, 'min_dist': 0.28458375701470495, 'min_samples': 11, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:41:35.895] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:42:15,635] Trial 28 finished with value: 0.41107165824176783 and parameters: {'n_components': 10, 'n_neighbors': 5, 'min_dist': 0.5722102382508596, 'min_samples': 6, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:42:33.077] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:42:50,741] Trial 29 finished with value: 0.3996944601873649 and parameters: {'n_components': 20, 'n_neighbors': 9, 'min_dist': 0.4253684658245429, 'min_samples': 18, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:43:08.082] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:43:30,133] Trial 30 finished with value: 0.40046380005015 and parameters: {'n_components': 9, 'n_neighbors': 15, 'min_dist': 0.7216519748157755, 'min_samples': 9, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:43:47.421] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:44:50,470] Trial 31 finished with value: 0.4299196363726815 and parameters: {'n_components': 7, 'n_neighbors': 5, 'min_dist': 0.0023900101400661407, 'min_samples': 7, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:45:07.851] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:45:53,153] Trial 32 finished with value: 0.42942273187676866 and parameters: {'n_components': 7, 'n_neighbors': 6, 'min_dist': 0.03237838187232781, 'min_samples': 8, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:46:10.597] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:47:29,096] Trial 33 finished with value: 0.41870547975942163 and parameters: {'n_components': 6, 'n_neighbors': 5, 'min_dist': 0.14844782847265728, 'min_samples': 6, 'gen_min_span_tree': True, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:47:46.196] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:48:29,469] Trial 34 finished with value: 0.4283902675174767 and parameters: {'n_components': 5, 'n_neighbors': 7, 'min_dist': 0.06959943207016685, 'min_samples': 8, 'gen_min_span_tree': False, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:48:46.566] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:49:06,952] Trial 35 finished with value: 0.4000443955721958 and parameters: {'n_components': 8, 'n_neighbors': 19, 'min_dist': 0.19589947603322433, 'min_samples': 10, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:49:23.979] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:50:55,418] Trial 36 finished with value: 0.4177026289782734 and parameters: {'n_components': 14, 'n_neighbors': 6, 'min_dist': 0.05706027163931862, 'min_samples': 5, 'gen_min_span_tree': False, 'prediction_data': False}. Best is trial 22 with value: 0.43720227759735714.


[2025-06-16 14:51:12.336] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:51:36,240] Trial 37 finished with value: 0.4399132307553768 and parameters: {'n_components': 11, 'n_neighbors': 10, 'min_dist': 0.0021227884589070994, 'min_samples': 12, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 37 with value: 0.4399132307553768.


[2025-06-16 14:51:53.287] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:52:17,096] Trial 38 finished with value: 0.4199464121730489 and parameters: {'n_components': 12, 'n_neighbors': 11, 'min_dist': 0.12133431022641758, 'min_samples': 12, 'gen_min_span_tree': True, 'prediction_data': True}. Best is trial 37 with value: 0.4399132307553768.


[2025-06-16 14:52:34.502] [CUML] [info] Building knn graph using brute force


[I 2025-06-16 14:52:52,877] Trial 39 finished with value: 0.39252737727258014 and parameters: {'n_components': 11, 'n_neighbors': 10, 'min_dist': 0.8433903744638245, 'min_samples': 15, 'gen_min_span_tree': False, 'prediction_data': True}. Best is trial 37 with value: 0.4399132307553768.


Best params: {'n_components': 11, 'n_neighbors': 10, 'min_dist': 0.0021227884589070994, 'min_samples': 12, 'gen_min_span_tree': True, 'prediction_data': True}
CPU times: user 54min 41s, sys: 3min, total: 57min 41s
Wall time: 37min 17s


Apply HPO best results to BERTopic

In [12]:
spread_val = max(study.best_params['min_dist'] + 1e-3, 1.0)
n_components = study.best_params['n_components']
n_neighbors = study.best_params['n_neighbors']
min_dist = study.best_params['min_dist']
min_samples = study.best_params['min_samples']
gen_min_span_tree = study.best_params['gen_min_span_tree']
prediction_data = study.best_params['prediction_data']

In [None]:
# Create instances of GPU-accelerated UMAP and HDBSCAN

umap_model = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, min_dist=min_dist, spread=spread_val, random_state = 42)
hdbscan_model = hdbscan.HDBSCAN(min_samples=min_samples, gen_min_span_tree=gen_min_span_tree, prediction_data=prediction_data)

# Pass the models to BERTopic
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

# Visualize the topics
topic_model.visualize_topics()

<b>Note:</b> The Intertopic Distance Map visualization has been removed due to its large size. You can still generate it by running the notebook locally.

Check the cluster sizes

In [14]:
# Get the topic keywords
topic_info = topic_model.get_topic_info()  # Returns a DataFrame with topic details
topic_info  # Includes Topic ID and top keywords

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,10007,-1_this_it_movie_was,"[this, it, movie, was, to, the, br, that, and,...","[Firstly, I really enjoyed this movie and its ..."
1,0,377,0_horror_scarecrow_gore_scary,"[horror, scarecrow, gore, scary, scarecrows, g...","[Scarecrows is one of those films that, with a..."
2,1,360,1_bollywood_indian_kapoor_akshay,"[bollywood, indian, kapoor, akshay, khan, indi...",[Do not waste your time with this movie. This ...
3,2,204,2_french_paris_alexandre_la,"[french, paris, alexandre, la, love, je, taime...","[I saw ""Paris Je T'Aime"" because a friend real..."
4,3,198,3_martial_jackie_kung_arts,"[martial, jackie, kung, arts, fu, chan, action...",[Jackie Chan's classic directorial feature POL...
...,...,...,...,...,...
602,601,5,601_show_smart_butterflies_jokes,"[show, smart, butterflies, jokes, jackass, rea...","[I understand the jokes quite well, they just ..."
603,602,5,602_powell_vance_powells_philo,"[powell, vance, powells, philo, astor, thin, a...",[I've seen the Thin Man series -- Powell and L...
604,603,5,603_suleiman_anansa_amin_linderby,"[suleiman, anansa, amin, linderby, arab, caine...",[The story at the outset is interesting: slave...
605,604,5,604_coulier_host_hosts_mustve,"[coulier, host, hosts, mustve, kinnear, greg, ...","[In my opinion, this is a pretty good celebrit..."


Calculate Coherence Score (c_v)

In [15]:
tokenized_docs = list(map(preprocess, docs))
dictionary = Dictionary(tokenized_docs)
coherence = calculate_coherence(topic_model, tokenized_docs, topics, dictionary)
print(f"Coherence Score: {coherence:.4f}")


Coherence Score: 0.6562


Calculate Diversity Score 

In [16]:
diversity = calculate_diversity(topic_model, topics)
print(f"Diversity Score: {diversity:.4f}")


Diversity Score: 0.7634


Calculate Noise (-1 Cluster) Percentage

In [17]:
percent, penalty = calculate_noise_penalty(topic_model, len(docs))
print(f"Noise Percentage: {percent:.4f}")

Noise Percentage: 41.2500


A good c_v Coherence Score typically ranges:

* above 0.5 = Acceptable

* above 0.65 = Good

* above 0.75 = Excellent

And a good Diversity Score is usually ≥ 0.85

# GPU Acceleration with NVIDIA cuML

The CPU-based and GPU-accelerated approaches use identical code, with the only difference being the addition of <b>%load_ext cuml.accel</b> at the top of the notebook to enable GPU acceleration with cuML. This simple modification can significantly accelerate the K-Means and UMAP components of the BERTopic model—even within a single trial. You can try it yourself by running the below code on both CPU and GPU and comparing the execution times side by side.

In [18]:
%%time

from bertopic import BERTopic
import hdbscan
import umap
from datasets import load_dataset

# Load IMDb reviews dataset
dataset = load_dataset("imdb", split="train")
docs = dataset["text"] # List of text documents

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = umap.UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state = 42)
hdbscan_model = hdbscan.HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=True)

# Pass the above models to be used in BERTopic
topic_model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model
)
topics, probs = topic_model.fit_transform(docs)

[2025-06-16 15:25:47.289] [CUML] [info] build_algo set to brute_force_knn because random_state is given
CPU times: user 53.7 s, sys: 1.88 s, total: 55.6 s
Wall time: 22.8 s
