# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents.

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we copare performance between it and `cuBERTopic`

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from cuBERTopic import gpu_bertopic

docs = fetch_20newsgroups(subset='all')['data']

### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using a `SentenceTransformer` model and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
%%time
model_sbert = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model_sbert.encode(
    docs,
    show_progress_bar=True,
    batch_size=64,
    convert_to_numpy=True,
)
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, embeddings)

Batches:   0%|          | 0/295 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

`get_topic_info` returns information about each topic including its id, frequency, and name 

In [3]:
%%time
topic_model.get_topic_info()

CPU times: user 3.94 ms, sys: 153 µs, total: 4.09 ms
Wall time: 3.29 ms


Unnamed: 0,Topic,Count,Name
0,-1,6081,-1_program_email_file_available
1,0,855,0_baseball_game_team_year
2,1,402,1_gun_guns_militia_firearms
3,2,299,2_address_os_internet_lyme
4,3,157,3_amp_sale_stereo_sony
...,...,...,...
338,346,11,346_timing_ultralong_timer_snow
337,347,11,347_olwm_editres_openlook_xterminal
336,348,11,348_jgfootminervacisyaleedu_encryptiononly_por...
335,342,11,342_level_software_wingert_process


`get_topic` returns topics with top n words and their c-TF-IDF score

In [4]:
%%time
topic_model.get_topic(0)

CPU times: user 11 µs, sys: 0 ns, total: 11 µs
Wall time: 19.1 µs


[('baseball', 0.006877399121173026),
 ('game', 0.005831486240381638),
 ('team', 0.005720180807388458),
 ('year', 0.00546482449813302),
 ('players', 0.0054403405659096795),
 ('braves', 0.005221678217344607),
 ('hit', 0.00501477478125479),
 ('games', 0.004965242396763328),
 ('runs', 0.0047352143773061635),
 ('pitching', 0.004480412343762814)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. `SentenceTransformer` model is used by default in this case.

In [5]:
%%time
gpu_topic = gpu_bertopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

Label prop iterations: 23
Label prop iterations: 9
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 4
Label prop iterations: 2
Iterations: 6
4872,173,368,23,321,1482
Label prop iterations: 3
Label prop iterations: 2
Iterations: 2
3587,67,152,7,116,220
Label prop iterations: 2
Iterations: 1
3453,45,102,5,58,104
CPU times: user 3min 25s, sys: 14.5 s, total: 3min 39s
Wall time: 34.7 s


In [6]:
%%time
gpu_topic.get_topic_info()

<class 'list'>
<class 'cudf.core.series.Series'>
CPU times: user 14 ms, sys: 4.19 ms, total: 18.2 ms
Wall time: 16.7 ms


Unnamed: 0,Topic,Count,Name
182,-1,5941,-1_email_file_program_version
61,0,713,0_baseball_game_year_players
48,1,663,1_clipper_encryption_key_chip
11,2,380,2_gun_guns_firearms_militia
133,3,380,3_scsi_drive_ide_drives
...,...,...,...
246,339,10,339_tank_fj11001200_tankbag_bag
257,340,10,340_audio_relays_switching_clicks
292,341,10,341_manned_lunar_exploration_crystal
319,342,10,342_convex_corp_visserconvexcom_visser


In [7]:
%%time
gpu_topic.get_topic(0)

CPU times: user 10 µs, sys: 1e+03 ns, total: 11 µs
Wall time: 19.6 µs


[('baseball', array(0.00701936)),
 ('game', array(0.00624271)),
 ('year', array(0.00587031)),
 ('players', array(0.00562376)),
 ('runs', array(0.00530307)),
 ('team', array(0.00527507)),
 ('hit', array(0.00525333)),
 ('games', array(0.00522359)),
 ('morris', array(0.00483839)),
 ('pitching', array(0.00475111))]