<a href="https://colab.research.google.com/github/rfahrn/Exercise_3/blob/main/example_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling with BERTopic

Taken from:

https://maartengr.github.io/BERTopic/index.html

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9

In [None]:
!pip install bertopic
from bertopic import BERTopic

## Load the data

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

In [2]:
from sklearn.datasets import fetch_20newsgroups

# The data needs to be a list
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
len(docs)

18846

## Create the model
`.fit_transform()` fits the model to the data, generates the topics, and returns the topics with the data. It can take around half an hour.

Optional arguments for `BERTopic()`:
- `verbose=true`
- `nr_topics=20`
- `nr_topics='auto'`
- `language='German'`
- `language='multilingual'` supports 50+ languages

In [None]:
topic_model = BERTopic()  

topics, probs = topic_model.fit_transform(docs)

## Select top topics

We can now access the frequent topics that were generated:

In [None]:
topic_model.get_topic_info()

-1 refers to all outliers and should typically be ignored. We can look at the size of topics in descending order (we use `.head(11)` because we're ignoring -1):

In [None]:
topic_model.get_topic_freq().head(11)

## Select one topic

You can select a specific topic and get the top n words for that topic and their c-TF-IDF scores.

In [None]:
topic_model.get_topic(0)

## Topic modelling visualisation

Let's look at three methods of visualisation. See https://maartengr.github.io/BERTopic/index.html for a list of all methods.

# Visualise topics

`visualize_topics()` shows the topics with their sizes and corresponding words.

In [None]:
topic_model.visualize_topics()

# Visualise terms

`visualize_barchart()` will show the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores. You can then compare topic representations to each other and gain more insights from the topic generated.

In [None]:
topic_model.visualize_barchart()

# Visualise topic similarity

You can also visualize how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap()

## Reducing topics

If you didn't use the `nr_topics` argument before training, you can also reduce the number of topics after training if you want to avoid re-training.

In [None]:
new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=15)

## Make a prediction

To predict a topic of a new document, you need to add a new instance(s) on the `transform()` method.

In [None]:
topics, probs = topic_model.transform(new_docs)

## Save the model

After generating topics and their probabilities (i.e. training the model), we can save it to avoid our having to train the model every time we want to tweak something.

In [None]:
topic_model.save('my_topic_model')

## Load the model

In [None]:
trained_model = BERTopic.load('my_topic_model')