# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents. We run the same operations using `BERTopic` to compare their behaviour. 

## Quick Start
In both the cases, we start by extracting topics from the well-known 20 newsgroups dataset from `sklearn` which is comprised of english documents

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we compare performance between it and `cuBERTopic`. 

`cuBERTopic` runs on `cudf` and `cuml` which can be installed using instructions at https://rapids.ai/start.html and `cupy` which can be installed from https://docs.cupy.dev/en/stable/install.html

More detailed instructions are in the README.

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from transformers import AutoTokenizer, AutoModel
import torch
from cuBERTopic import gpu_BERTopic

docs = fetch_20newsgroups(subset='all')['data']

Attempting to build table using 1.499967n space
Longest bin was 14
Processing bin 0 / 7630 of size = 3


  return ((a * k + b) % PRIME) % size


Processing bin 500 / 7630 of size = 3
Processing bin 1000 / 7630 of size = 2
Processing bin 1500 / 7630 of size = 5
Processing bin 2000 / 7630 of size = 6
Processing bin 2500 / 7630 of size = 4
Processing bin 3000 / 7630 of size = 3
Processing bin 3500 / 7630 of size = 6
Processing bin 4000 / 7630 of size = 7
Processing bin 4500 / 7630 of size = 1
Processing bin 5000 / 7630 of size = 3
Processing bin 5500 / 7630 of size = 4
Processing bin 6000 / 7630 of size = 6
Processing bin 6500 / 7630 of size = 7
Processing bin 7000 / 7630 of size = 4
Processing bin 7500 / 7630 of size = 5
Final table size 30522 elements compared to 30522 for original
Max bin length was 14
All present tokens return correct value.


### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using `AutoTokenizer` followed by `AutoModel` from `transformers` and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
%%time
# Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2"
)

# Tokenize sentences
encoded_input = tokenizer(
    docs,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask
        .unsqueeze(-1)
        .expand(token_embeddings.size())
        .float()
    )
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(
    model_output,
    encoded_input["attention_mask"]
)
sentence_embeddings = sentence_embeddings.to('cpu').numpy()
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, sentence_embeddings)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

`get_topic_info` returns information about each topic including its id, frequency, and name 

In [3]:
%%time
topic_model.get_topic_info()

CPU times: user 4.3 ms, sys: 86 µs, total: 4.38 ms
Wall time: 3.58 ms


Unnamed: 0,Topic,Count,Name
0,-1,5738,-1_db_drive_program_file
1,0,367,0_gun_guns_firearms_weapons
2,1,227,1_sy_rh_manta_reserve
3,2,179,2_dos_windows_swap_memory
4,3,165,3_bike_dod_motorcycle_bikes
...,...,...,...
382,393,11,393_timing_timer_ultralong_snow
381,396,11,396_wheelie_shaftdrive_xlyxvax5citcornelledu_w...
402,401,10,401_fpu_apple_brochures_c650
403,402,10,402_language_tounges_gifted_tongues


`get_topic` returns topics with top n words and their c-TF-IDF score

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [4]:
%%time
topic_model.get_topic(0)

CPU times: user 10 µs, sys: 9 µs, total: 19 µs
Wall time: 31.5 µs


[('gun', 0.014857341362931618),
 ('guns', 0.008647026450956679),
 ('firearms', 0.006604684982692749),
 ('weapons', 0.005635420912832051),
 ('militia', 0.005310199595599718),
 ('amendment', 0.004533582499531278),
 ('weapon', 0.0044622578247057905),
 ('control', 0.0044088484956494485),
 ('handgun', 0.00429303387467014),
 ('firearm', 0.004227456523340576)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. Here, instead of using `AutoTokenizer` from `transformers`, we use `SubwordTokenizer` from `cuDF` in combination with `AutoModel` from `transformers`. 

Due to the stochastisch nature of UMAP, the results might differ and the quality can degrade.

In [5]:
%%time
gpu_topic = gpu_BERTopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

Label prop iterations: 21
Label prop iterations: 7
Label prop iterations: 6
Label prop iterations: 3
Label prop iterations: 6
Iterations: 5
6241,148,330,17,276,1342
Label prop iterations: 7
Label prop iterations: 4
Iterations: 2
2954,68,156,6,109,368
CPU times: user 33.2 s, sys: 24.3 s, total: 57.5 s
Wall time: 57.7 s


In [6]:
%%time
gpu_topic.get_topic_info()

CPU times: user 21.6 ms, sys: 1.44 ms, total: 23 ms
Wall time: 21.2 ms


Unnamed: 0,Topic,Count,Name
206,-1,6182,-1_file_drive_system_please
210,0,643,0_clipper_encryption_chip_key
326,1,431,1_monitor_card_video_drivers
334,2,303,2_car_cars_toyota_engine
266,3,171,3_printer_deskjet_print_hp
...,...,...,...
297,377,10,377_oto_templars_reuss_oriental
318,378,10,378_2600_atari_tia_4k
319,379,10,379_tv_exploding_tube_prasad
335,380,10,380_alarm_alarms_viper_sensor


In [7]:
%%time
gpu_topic.get_topic(0)

CPU times: user 13 µs, sys: 13 µs, total: 26 µs
Wall time: 38.9 µs


[('clipper', array(0.01191998)),
 ('encryption', array(0.01146954)),
 ('chip', array(0.00923971)),
 ('key', array(0.00884558)),
 ('keys', array(0.00599331)),
 ('algorithm', array(0.00585033)),
 ('escrow', array(0.00550255)),
 ('crypto', array(0.00516373)),
 ('security', array(0.00512615)),
 ('nsa', array(0.004853))]