<a target="_blank" href="https://colab.research.google.com/github/rapidsai-community/event-notebooks/blob/main/GoogleNext_2025/TopicModelingAtTheSpeedOfLight.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>

# Topic modeling at the speed of light

Topic modeling is a type of statistical modeling used to discover abstract topics within a collection of documents. It is widely used in natural language processing (NLP) to uncover hidden thematic structures in large text. Traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA), can be computationally intensive, especially with large datasets. Leveraging GPUs can significantly accelerate the process, making it feasible to handle larger datasets and more complex models.

## Why would you use GPUs?

GPUs (Graphics Processing Units) are designed to handle parallel processing tasks efficiently. They are particularly well-suited for the matrix and vector operations that are common in machine learning and deep learning algorithms. By utilizing GPUs, we can achieve substantial speedups in training and inference times for topic modeling.

Let's get started!

## Setup

First, let's make sure we are running on an runtime with a GPU.

In [None]:
!nvidia-smi

If the above cell returns an error please make sure that you use a container or runtime with a GPU attached and restart.

To perform the topic modeling we will us [BERTopic](https://maartengr.github.io/BERTopic/index.html), a widely used NLP (Natural Language Processing) framework built on top of BERT embeddings and designed to provide coherent and naturally sounding topic descriptions. So, let's make sure the package is available in our environment.

In [None]:
!pip install bertopic --quiet

### Imports

RAPIDS provides a new experience that allows you to harness the capabilities of GPUs to run your code authored in pandas or scikit-learn, all *without* the need to change your code in a meaningful way. The Zero-Code-Change (ZCC) experience runs seamlessly on a GPU without doing any additional work on the user's part. And in any case the code to be run on a GPU has not been yet supported, the framwork will then execute the CPU version of the code without any input from the user!

![test](https://rapids.ai/cudf-pandas/chart.png)

To enable this experience, all you need to to is to add these lines on top of your script!

In [None]:
%load_ext cudf.pandas
%load_ext cuml.accel

Now that we have the environment set up, we can do our imports.

In [None]:
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

We are using a bunch of frameworks here. Most of these are fairly selfexplanatory (who doesn't know pandas?!) and we have already touched upon BERTopic. The remaining frameworks help us with the following:

1. [SentenceTransformer](https://www.sbert.net/) is a part of a large collection of over 5000 pre-trained models that help create embeddings we will use to train the topic modeling model.
2. [UMAP](https://umap-learn.readthedocs.io/en/latest/) is a STOA dimensionality reduction tool that is useful with non-linear problems.
3. [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) is a powerful clustering algorithm that uses density-based algorithm (DBSCAN) to find clusters but was further extended to convert it into a hierchical version (hence the HD prefix in the name).

## Download the data
In this example we will be using a Amazon Review dataset and focus on the reviews of beauty products.

In [None]:
!wget https://mcauleylab.ucsd.edu/public_datasets/data/amazon_2023/raw/review_categories/All_Beauty.jsonl.gz --no-check-certificate

The code downloads to a local drive so now we can use pandas like we normally would but all this code actually runs on the GPU!

In [None]:
path = "All_Beauty.jsonl.gz"
data = pd.read_json(path, lines=True)

You can check this for yourself by running the `nvidia-smi` command and you should see about 1GB memory usage on the GPU.

In [None]:
!nvidia-smi

In this particular exercise -- we will only use the first 200k records.

In [None]:
# Limit to e.g., 200K records for demo purposes
N = 200000

data = data.head(N)

Let's have a peek what the data looks like.

In [None]:
data.head()

We have the rating and additional metadata associated with the review. However, we will be using the `text` column only as we are interested in understanding if we can uncover any patterns in the reviews.

In [None]:
sample_docs = data.text.tolist()
sample_docs[0]

## Let's have some fun!

Now that we have the data to work with -- let's start our main task: the topic modeling. First, we cannot simply pass text to the BERTopic model and we need to turn each and every sentence into a numerical representation -- an embedding. In this notebook we will use the [`all-MiniLM-L6-v2`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model.

In [None]:
%%time
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(sample_docs, batch_size=128, show_progress_bar=True)

This process may take a 2 minutes or so but it's a process we only need to do once. Next, now that we have the embeddings, we can train our initial *vanilla* topic model.

In [None]:
%%time
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(sample_docs, embeddings)

Woot! We have now successfully trained a BERTopic model! On a GPU nonetheless!

Let's explore what we have learned! First, we can quickly discern the most commonly occuring words in each topic (here we only use 8 top topics).

In [None]:
topic_model.visualize_barchart(top_n_topics=8)

So we can clearly see that for each topic -- we see semantically related words. This is good!

But how many topics there are, you ask!? Well...

In [None]:
print(f'The model idenitified {len(set(topics))} distinct topics...')

That's a lot... Let's see a distribution of how common each topic was.

In [None]:
pd.DataFrame(topics).hist()

Well... we're seeing that it's a long tail distribution and likely we can do better than this. Let's see if we're seeing any similarity between these topics before we proceed to refine it.

In [None]:
topic_model.visualize_heatmap(top_n_topics=100)

So the heatmap clearly shows *regions* of similar / overlapping topics (the more blue areas) and patches of less condensed overlap. This is likely better visible on the distance map between topics.

In [None]:
topic_model.visualize_topics()

Okay... There's a lot of topics but **many** overlaps!! We can surely do better than that now that we have this knowledge!

## Clustering to the rescue!

We can use the UMAP to reduce the dimensionality of our dataset and then apply a clustering model (the HDBSCAN) to semantically (since we're working on embeddings!) group some of the reviews into more refined clusters!

And the added benefit -- it all runs on a GPU!!! So no more waiting long time for the UMAP model alone to finish it's job! Running on a GPU gives us the freedom to experiment at a lightning speed!

In [None]:
umap_model = UMAP(n_components=15, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_cluster_size=100, gen_min_span_tree=True, prediction_data=True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
%time topics, probs = topic_model.fit_transform(sample_docs, embeddings)

topic_model.visualize_topics()

As I was saying... it's quick!

And now we have a much better refined topics (and there's only 105 of them!) Let's see if we still have similar words retained in each topic!

In [None]:
topic_model.visualize_barchart(top_n_topics=16)

Unsurprisingly -- we still do. There are still some topics that could be closely related and viewed as related e.g. topic 0 and topic 6 could be related in some cases when the reviewer talks about how gentle certain shampoos or conditioners are to the skin and how nicely they smell.

Luckily, we used the HDBSCAN and we can quickly pull up these hierarchies.

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(sample_docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

Nice. So now we can further decide how to further cluster our topics. However, one thing that we saw when we first peeked inside the dataset that there was a ton of short reviews. These reviews are likely skewing our results.

In [None]:
data.head()

Let's see how big of a problem this is for us.

In [None]:
data_word_count = data.text.str.count("\w+")
data_word_count.head()

Now let's see the stats!

In [None]:
data_word_count.describe()

Alright... So, we see that the median is 21 words - that's a decent length. However, we'd be filtering out half of our reviews. I think 10 words (i.e. over 70% of our dataset) would be usable so let's try that.

In [None]:
longer_reviews = data.loc[
    data.text.str.count("\w+") >= 10
]

filtered_docs = longer_reviews.text.tolist()
filtered_embeddings = embeddings[longer_reviews.index]

filtered_embeddings.shape

Good! We still have almost 150k reviews left but we are now guaranteed that these reviews convey a little bit more information than a simple -- "good product".

In [None]:
topic_model_long = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
%time topics, probs = topic_model_long.fit_transform(filtered_docs, filtered_embeddings)

How many topics we're seeing now?

In [None]:
print(f'Revised number of topics: {len(set(topics))}')

In [None]:
topic_model_long.visualize_topics()

Okay, this looks even better than before! Let's check the word frequencies.

In [None]:
topic_model_long.visualize_barchart(top_n_topics=16)

# Summary

In the end, in less than a few minutes, we have a well defined clusters of topics that the reviewers of beauty products cared to share with us. Thanks to the power of GPU we were able to quickly sift through 200k reviews and come up with up to 70 clearly delineated topics that using HDBSCAN we can further group into logical topics.

And we were able to achieve *all* of this without changing any code that would have run for hours on a CPU.

The power of GPUs gives you the power to build better models faster!

Try it out for yourself!