# Topic Modelling With BERTopic

*Learn how to perform topic modelling to determine what topics are within unlabelled text data using BERTopic with Python.*

![](https://drive.google.com/uc?id=1eyuwF6NcmFsykzOUx1DbBSKEdJN4tNjK&authuser=recohut.data.001%40gmail.com&usp=drive_fs)

Topic modelling is a common task in NLP. It’s an unsupervised technique for determining what topics, which can be thought of as categories, are part of a set of documents and what topics each document is likely to be a part of. Since it’s an unsupervised technique, no labels are required, meaning we do not need a predefined list of topics — rather just the text from the documents. In this article, we’ll discuss how to perform topic modelling with BERTopic, which is a leading Python package for this task and uses state-of-the-art Transformer models.

Topic modelling has a wide range of applications. For example, a social media site may want to know what news topics are currently trending but does not have a predefined list of the current news topics. By applying topic modelling, the site can determine what topics are trending along with what articles fall within them. This is just one of the countess examples of applying topic modelling, and in this tutorial we’ll apply topic modelling to new headlines related to the economy.

We’ll discuss how BERTopic works from a high level. Although knowing these concepts is not strictly necessary to use the library, I believe this background information will give you some appreciation for the package. It will also provide you with some intuition to better understand its capabilities and limitations.

Here’s a great [article](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6) by the author of BERTopic that explains the methodology in more detail.

## Install

In [3]:
%%capture
pip install bertopic datasets

## Import

In [4]:
from bertopic import BERTopic
from datasets import load_dataset

## Data

We’ll use a dataset called newspop, which contains titles and headlines for news articles. The articles are about one of four topics: “economy,” “microsoft,” “obama,” or “palestine”. Data is licensed under Creative Commons Attribution 4.0 International License (CC-BY-4.0). See the [repository](https://huggingface.co/datasets/newspop) for more detail.

In [5]:
dataset = dataset = load_dataset("newspop", split="train[:]")

Downloading builder script:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset newspop/default (download: 28.93 MiB, generated: 26.63 MiB, post-processed: Unknown size, total: 55.57 MiB) to /root/.cache/huggingface/datasets/newspop/default/0.0.0/9904d4082ffd3c0953efa538ff926c43d27da8f37c9b5d6a13f51ab96740474e...


Downloading data:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/93239 [00:00<?, ? examples/s]

Dataset newspop downloaded and prepared to /root/.cache/huggingface/datasets/newspop/default/0.0.0/9904d4082ffd3c0953efa538ff926c43d27da8f37c9b5d6a13f51ab96740474e. Subsequent calls will reuse this data.


In [6]:
print(dataset)

Dataset({
    features: ['id', 'title', 'headline', 'source', 'topic', 'publish_date', 'facebook', 'google_plus', 'linked_in'],
    num_rows: 93239
})


In [10]:
(dict( (l, dataset["topic"].count(l) ) for l in set(dataset["topic"])))

{'economy': 33928, 'microsoft': 21858, 'obama': 28610, 'palestine': 8843}

Let’s create list of strings which we’ll call docs that will contain all of the headlines for the “economy” topic. We’ll ignore the other topics for this tutorial to keep the subject matter more specific.

In [7]:
docs = []
for case in dataset:
  if case["topic"] == "economy":
    docs.append(case["headline"])

## Model

By default, BERTopic uses a Transformer model called all-MiniLM-L6-v2 to produce embeddings (which we call vectors). Other models from this [webpage](https://www.sbert.net/docs/pretrained_models.html) may be used instead, like all-mpnet-base-v2. To load a model, simply pass the name of the model in the form of a string to the BERTopic class, or don’t pass anything to use the default model.



In [11]:
topic_model = BERTopic()
topic_model_large = BERTopic("all-mpnet-base-v2")

## Fit

In [12]:
import time

In [13]:
start = time.time()

topics, probs = topic_model.fit_transform(docs)
end = time.time()
print(end - start)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



106.7158408164978


We can fit the model from here by calling our BERTopic object’s fit_transformer() method. This would perform all of the steps we’ve mentioned in the Methodology section. This method outputs two lists which we wrote to the variables: topics and probs. The topics list contains an integer for each document we provided to indicate which topic it belongs to. Similarly, the probs list contains the probability that the document belongs to the given topic, or in other words, the model’s certainty.



In [14]:
print("topics: ", topics)
print("probs: ", probs)

topics:  [0, -1, 194, 53, 31, -1, 31, 89, 31, 0, 0, 376, 376, 376, 92, 0, -1, -1, 43, -1, 92, 92, 92, 93, -1, -1, -1, 0, 0, 109, 0, 17, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, 17, 35, 72, -1, -1, 158, 57, 90, 0, 0, 177, -1, -1, -1, 104, 104, 143, 2, 0, 79, 5, 5, 5, 5, -1, 5, -1, 243, -1, -1, -1, -1, -1, -1, -1, 167, 167, -1, -1, -1, 167, -1, 203, 167, -1, -1, 167, -1, 81, 2, 46, 50, 0, 0, 32, 32, 129, 129, 24, 3, 169, 30, 1, -1, 85, 29, 0, 42, 46, 139, 139, -1, 110, -1, 41, 85, 20, 15, 79, -1, 13, 0, 69, 183, -1, 8, -1, 0, 20, 14, 13, -1, -1, -1, 2, 2, 225, -1, 187, -1, 8, -1, 225, 398, -1, -1, 8, -1, 53, -1, 140, 0, 14, 273, 214, 196, 120, 179, 196, 196, 196, 27, 83, 0, -1, 0, -1, 1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 14, 8, -1, 8, 8, 63, 66, 34, -1, 1, -1, -1, 88, 1, 79, -1, -1, 21, 0, -1, -1, 183, -1, -1, -1, 395, -1, 5, 395, 63, 175, 97, -1, -1, 5, 24, -1, 8, -1, 395, 395, 164, 164, 348, 94, 395, 19, 48, -1, -1, 117, 14, 131, 67, -1, -1, -1, 15, 58, -1, 0, 69, 73, -1, -1, 95, -1, 164, 1

In [15]:
print(len(topics))
print(len(probs))

33928
33928


## Topic Information

We can get information describing each topic by calling our BERTopic object’s get_topic_info() method.

In [16]:
topic_information = topic_model.get_topic_info()

In [17]:
print(type(topic_information))

<class 'pandas.core.frame.DataFrame'>


In [18]:
print(topic_information)

     Topic  Count                                          Name
0       -1  10932                         -1_the_us_economic_of
1        0   2539                0_chinas_chinese_china_beijing
2        1    901                   1_india_indian_indias_delhi
3        2    681                 2_japans_japanese_tokyo_japan
4        3    468         3_australias_australian_australia_nsw
..     ...    ...                                           ...
404    405     10     405_climate_humancaused_change_scientific
403    406     10  406_utilisation_investments_assocham_private
401    408     10          408_savings_expenses_borrow_directly
400    404     10           404_shift_drastic_necessarily_thats
410    409     10       409_damaged_votes_forecasters_scotlands

[411 rows x 3 columns]


The output is a Pandas Dataframe with the number of documents that fall within each topic. It shows 571 topics in order of how many documents belong to them. The first topic, “-1” is a special topic for outliner topics. From there, we see that the topic with the ID “0” and the keywords indian, india, indias and delhi. An uncased model was used, which is why the keyword “india” was included and not “India.”



## Topic Words

We can get more information about each topic by calling our BERTopic’s get_topic() method. This outputs a list of words for the topic in order of their c-TF-IDF score, or in simple terms, in order of how frequent and unique they are to the document.

In [19]:
topic_words = topic_model.get_topic(1)

print(topic_words)

[('india', 0.023809099661973258), ('indian', 0.023665045640721694), ('indias', 0.01744525918806283), ('delhi', 0.015393242449096407), ('modi', 0.01190209318457343), ('narendra', 0.010521589380143675), ('jaitley', 0.009279842716726484), ('fastest', 0.007869132335817437), ('arun', 0.007291678409053478), ('73', 0.007059091730358834)]


## Display

We can visualize the topics by calling the visualize_topics() method. This allows us to see how closely related the topics are to each other. But keep in mind that this is a 2D visualization, so it’s not a perfect representation of the relationships between the topics.



In [20]:
topic_model.visualize_topics()

### Barchart

We can also create a bar chart by calling the visualize_barchart() method.

In [21]:
topic_model.visualize_barchart()

# Predict

We can predict what topic any arbitrary text belongs to using the fitted model. We can accomplish this by calling the transform() method. The code below demonstrates this and uses a made up headline for a news article.

In [22]:
topic_model.calculate_probabilities = False

In [24]:
text = "Canadian exports are increasing thanks to the price of the loonie."
preds, probs = topic_model.transform(text)
print(preds)

[57]


In [25]:
top_topic = preds[0]
print(topic_model.get_topic(top_topic))

[('canadian', 0.04704896368456583), ('canadians', 0.021028340114528036), ('canadas', 0.020594432571126805), ('canada', 0.013105547999807966), ('loonie', 0.01227048208803553), ('oil', 0.012083564043674434), ('dollar', 0.00958801983076037), ('prices', 0.009272796632930336), ('toronto', 0.008362943605941656), ('mounts', 0.008274521987416834)]


That's all.

Thanks for your attention.