# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents. We run the same operations using `BERTopic` to compare their behaviour. 

## Quick Start
In both the cases, we start by extracting topics from the well-known 20 newsgroups dataset from `sklearn` which is comprised of english documents

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we compare performance between it and `cuBERTopic`. 

`cuBERTopic` runs on `cudf` and `cuml` which can be installed using instructions at https://rapids.ai/start.html and `cupy` which can be installed from https://docs.cupy.dev/en/stable/install.html

More detailed instructions are in the README.

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from transformers import AutoTokenizer, AutoModel
import torch
from cuBERTopic import gpu_BERTopic

docs = fetch_20newsgroups(subset='all')['data']

### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using `AutoTokenizer` followed by `AutoModel` from `transformers` and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
%%time
# Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2"
)

# Tokenize sentences
encoded_input = tokenizer(
    docs,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask
        .unsqueeze(-1)
        .expand(token_embeddings.size())
        .float()
    )
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(
    model_output,
    encoded_input["attention_mask"]
)
sentence_embeddings = sentence_embeddings.to('cpu').numpy()
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, sentence_embeddings)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

`get_topic_info` returns information about each topic including its id, frequency, and name 

In [3]:
%%time
topic_model.get_topic_info()

CPU times: user 4.08 ms, sys: 0 ns, total: 4.08 ms
Wall time: 3.29 ms


Unnamed: 0,Topic,Count,Name
0,-1,5843,-1_information_system_email_anyone
1,0,775,0_newsgroup_ripem_risc_address
2,1,459,1_monitor_card_video_drivers
3,2,265,2_gun_guns_firearms_militia
4,3,166,3_bike_bikes_honda_motorcycle
...,...,...,...
363,369,11,369_clutch_sabo_nonclutch_samuel
377,376,10,376_mormon_temple_ceremonies_temples
378,377,10,377_w4wg_lan_workgroups_workplace
379,378,10,378_xman_xkernel_aranalabeines_sadie


`get_topic` returns topics with top n words and their c-TF-IDF score

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [4]:
%%time
topic_model.get_topic(0)

CPU times: user 10 µs, sys: 5 µs, total: 15 µs
Wall time: 24.3 µs


[('newsgroup', 0.002335821068308015),
 ('ripem', 0.0022073291862053524),
 ('risc', 0.0019953134707877745),
 ('address', 0.0017269943863598774),
 ('computer', 0.001567019885977901),
 ('email', 0.0015440261562222233),
 ('group', 0.001505439992869189),
 ('list', 0.001489374021387911),
 ('widget', 0.0014891264945428434),
 ('please', 0.0014204805941597995)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. Here, instead of using `AutoTokenizer` from `transformers`, we use `SubwordTokenizer` from `cuDF` in combination with `AutoModel` from `transformers`. 

Due to the stochastisch nature of UMAP, the results might differ and the quality can degrade.

In [5]:
%%time
gpu_topic = gpu_BERTopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

UMAP:  0.7581701278686523
Label prop iterations: 23
Label prop iterations: 7
Label prop iterations: 5
Label prop iterations: 3
Label prop iterations: 3
Label prop iterations: 3
Iterations: 6
2616,172,371,19,316,1419
Label prop iterations: 3
Label prop iterations: 2
Iterations: 2
1040,68,153,6,100,189
Label prop iterations: 2
Iterations: 1
976,45,101,4,54,91
HDBSCAN:  0.4366781711578369
Topic creation:  4.247727632522583
time for topic representation:  8.51262354850769
CPU times: user 32.9 s, sys: 7.35 s, total: 40.3 s
Wall time: 32.9 s


In [6]:
%%time
gpu_topic.get_topic_info()

CPU times: user 18 ms, sys: 0 ns, total: 18 ms
Wall time: 16.3 ms


Unnamed: 0,Topic,Count,Name
0,-1,6406,-1_information_email_file_anyone
112,0,377,0_monitor_card_video_drivers
260,1,283,1_gun_guns_firearms_weapons
311,2,230,2_clipper_chip_encryption_key
300,3,153,3_drive_drives_disk_ide
...,...,...,...
324,389,10,389_hunting_rkba_neighborhoods_deer
360,390,10,390_wings_murray_gm_detroit
385,391,10,391_moon_luna_coffman_lunar
386,392,10,392_ampere_amp_db_ohmite


In [7]:
%%time
gpu_topic.get_topic(0)

CPU times: user 119 ms, sys: 4.81 ms, total: 124 ms
Wall time: 123 ms


[('monitor', array(0.01677312)),
 ('card', array(0.0151612)),
 ('video', array(0.0124634)),
 ('drivers', array(0.01139202)),
 ('vga', array(0.01042369)),
 ('monitors', array(0.00925574)),
 ('diamond', array(0.00856024)),
 ('ati', array(0.00840078)),
 ('vesa', array(0.00724637)),
 ('driver', array(0.0068034))]