# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents. We run the same operations using `BERTopic` to compare their behaviour. 

## Quick Start
In both the cases, we start by extracting topics from the well-known 20 newsgroups dataset from `sklearn` which is comprised of english documents

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we compare performance between it and `cuBERTopic`. 

`cuBERTopic` runs on `cudf` and `cuml` which can be installed using instructions at https://rapids.ai/start.html and `cupy` which can be installed from https://docs.cupy.dev/en/stable/install.html

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from cuBERTopic import gpu_BERTopic

docs = fetch_20newsgroups(subset='all')['data']

### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using a `SentenceTransformer` model and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
%%time
model_sbert = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model_sbert.encode(
    docs,
    show_progress_bar=True,
    batch_size=64,
    convert_to_numpy=True,
)
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, embeddings)

Batches:   0%|          | 0/295 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

`get_topic_info` returns information about each topic including its id, frequency, and name 

In [3]:
%%time
topic_model.get_topic_info()

CPU times: user 4.26 ms, sys: 232 µs, total: 4.5 ms
Wall time: 3.62 ms


Unnamed: 0,Topic,Count,Name
0,-1,5834,-1_program_email_file_information
1,0,471,0_scsi_drive_ide_drives
2,1,382,1_gun_guns_firearms_militia
3,2,264,2_address_mailing_lyme_internet
4,3,216,3_clayton_cramer_gay_homosexual
...,...,...,...
360,359,10,359_image_databases_processing_imaging
361,360,10,360_thyroid_thyroxin_deficiency_thyroidal
362,361,10,361_infallible_conscience_boundary_confident
363,362,10,362_hpgl_polytechnique_povray_ecole


`get_topic` returns topics with top n words and their c-TF-IDF score

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [4]:
%%time
topic_model.get_topic(0)

CPU times: user 11 µs, sys: 0 ns, total: 11 µs
Wall time: 19.6 µs


[('scsi', 0.0220991978158162),
 ('drive', 0.02199492294325124),
 ('ide', 0.018172151108964266),
 ('drives', 0.015052401784791796),
 ('controller', 0.013235501469234521),
 ('disk', 0.010827320352523567),
 ('bus', 0.009073747528564382),
 ('hard', 0.008638074004252863),
 ('scsi2', 0.008222033377157038),
 ('isa', 0.00745744294758776)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. `SentenceTransformer` model is used by default in this case.

Due to the stochastisch nature of UMAP, the results might differ and the quality can degrade.

In [5]:
%%time
gpu_topic = gpu_BERTopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

Label prop iterations: 23
Label prop iterations: 8
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 3
Label prop iterations: 2
Iterations: 6
4847,197,385,17,361,1512
Label prop iterations: 3
Label prop iterations: 2
Iterations: 2
4207,72,158,8,119,236
Label prop iterations: 2
Iterations: 1
3488,48,107,5,62,110
CPU times: user 3min 52s, sys: 9.56 s, total: 4min 2s
Wall time: 34.8 s


In [6]:
%%time
gpu_topic.get_topic_info()

CPU times: user 9.56 ms, sys: 7.88 ms, total: 17.4 ms
Wall time: 16 ms


Unnamed: 0,Topic,Count,Name
183,-1,6173,-1_file_email_available_information
279,0,841,0_baseball_game_team_year
4,1,430,1_gun_guns_firearms_militia
48,2,368,2_scsi_drive_ide_drives
254,3,227,3_armenian_turkish_armenians_armenia
...,...,...,...
275,340,10,340_christian_bible_reading_book
301,341,10,341_alarm_alarms_viper_sensor
317,342,10,342_habs_roy_hextall_goal
320,343,10,343_depression_dariceyoyoccmonasheduau_sex_rice


In [7]:
%%time
gpu_topic.get_topic(0)

CPU times: user 14 µs, sys: 1 µs, total: 15 µs
Wall time: 22.4 µs


[('baseball', array(0.00702754)),
 ('game', array(0.00605834)),
 ('team', array(0.00581931)),
 ('year', array(0.00555358)),
 ('players', array(0.00547964)),
 ('braves', array(0.00530604)),
 ('games', array(0.00510702)),
 ('hit', array(0.00508724)),
 ('runs', array(0.00478629)),
 ('pitching', array(0.00455138))]