# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents. We run the same operations using `BERTopic` to compare their behaviour. 

## Quick Start
In both the cases, we start by extracting topics from the well-known 20 newsgroups dataset from `sklearn` which is comprised of english documents

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we compare performance between it and `cuBERTopic`. 

More detailed instructions are in the README.

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from transformers import AutoTokenizer, AutoModel
import torch
from cuBERTopic import gpu_BERTopic
import rmm
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
rmm.reinitialize(pool_allocator=True,initial_pool_size=5e+9)

docs = fetch_20newsgroups(subset='all')['data']

### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using `AutoTokenizer` followed by `AutoModel` from `transformers` and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
len(docs)

18846

In [3]:
%%time
# Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained(
    "sentence-transformers/all-MiniLM-L6-v2"
)

# Tokenize sentences
encoded_input = tokenizer(
    docs,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask
        .unsqueeze(-1)
        .expand(token_embeddings.size())
        .float()
    )
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(
    model_output,
    encoded_input["attention_mask"]
)
sentence_embeddings = sentence_embeddings.to('cpu').numpy()
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, sentence_embeddings)

CPU times: user 52min 27s, sys: 27min 26s, total: 1h 19min 54s
Wall time: 2min 53s


`get_topic_info` returns information about each topic including its id, frequency, and name 

In [4]:
%%time
topic_model.get_topic_info()

CPU times: user 2.58 ms, sys: 0 ns, total: 2.58 ms
Wall time: 2.3 ms


Unnamed: 0,Topic,Count,Name
0,-1,6169,-1_file_drive_email_information
1,0,279,0_gun_guns_militia_firearms
2,1,241,1_car_cars_toyota_mustang
3,2,228,2_clipper_chip_encryption_key
4,3,196,3_sy_rh_reserve_year
...,...,...,...
394,393,10,393_kibology_religion_forming_disasters
395,394,10,394_mode_640x400_vga_vesa
396,395,10,395_wip_sports_eagles_fan
397,396,10,396_mac_rebooting_constantly_plus


`get_topic` returns topics with top n words and their c-TF-IDF score

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [5]:
%%time
topic_model.get_topic(0)

CPU times: user 9 µs, sys: 5 µs, total: 14 µs
Wall time: 24.3 µs


[('gun', 0.01439620175290087),
 ('guns', 0.00948467877156171),
 ('militia', 0.006324452798138242),
 ('firearms', 0.005763341042633199),
 ('weapons', 0.005673735137682135),
 ('weapon', 0.005168657529058602),
 ('crime', 0.004562871433230919),
 ('amendment', 0.004400026705177315),
 ('control', 0.004320586427788624),
 ('handgun', 0.004202791173387931)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. Here, instead of using `AutoTokenizer` from `transformers`, we use `SubwordTokenizer` from `cuDF` in combination with `AutoModel` from `transformers`. 

Due to the stochastisch nature of UMAP, the results might differ and the quality can degrade.

In [6]:
%%time
gpu_topic = gpu_BERTopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

Label prop iterations: 17
Label prop iterations: 8
Label prop iterations: 5
Label prop iterations: 4
Label prop iterations: 4
Iterations: 5
1175,214,345,19,215,1186
Label prop iterations: 5
Label prop iterations: 4
Iterations: 2
973,113,167,6,83,291
CPU times: user 25.7 s, sys: 4.03 s, total: 29.8 s
Wall time: 22.5 s


In [7]:
%%time
gpu_topic.get_topic_info()

CPU times: user 15.7 ms, sys: 0 ns, total: 15.7 ms
Wall time: 13.9 ms


Unnamed: 0,Topic,Count,Name
0,-1,6747,-1_file_email_information_program
92,0,413,0_monitor_card_video_drivers
258,1,201,1_car_cars_convertible_toyota
270,2,148,2_printer_deskjet_printers_hp
345,3,143,3_israel_israeli_arab_arabs
...,...,...,...
334,402,10,402_jets_canucks_winnipeg_selanne
365,403,10,403_clipper_phone_tapped_crooks
389,404,10,404_ampere_amp_db_ohmite
399,405,10,405_smiley_object_kuiper_karla


In [8]:
%%time
gpu_topic.get_topic(0)

CPU times: user 359 ms, sys: 37.7 ms, total: 397 ms
Wall time: 395 ms


[('monitor', array(0.01571617)),
 ('card', array(0.01371929)),
 ('video', array(0.01272293)),
 ('drivers', array(0.01032598)),
 ('vga', array(0.0095135)),
 ('monitors', array(0.0088007)),
 ('ati', array(0.0078124)),
 ('diamond', array(0.00775029)),
 ('vesa', array(0.00629848)),
 ('screen', array(0.00619216))]