<a href="https://colab.research.google.com/github/rfahrn/Exercise_3/blob/main/ex_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling with BERTopic

Taken from:

https://maartengr.github.io/BERTopic/index.html

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9

In [None]:
!pip install bertopic
from bertopic import BERTopic

## Load the data

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

In [None]:
from sklearn.datasets import fetch_20newsgroups

# The data needs to be a list
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

## Create the model
`.fit_transform()` fits the model to the data, generates the topics, and returns the topics with the data. It can take around half an hour.

Optional arguments for `BERTopic()`:
- `verbose=true`
- `nr_topics=20`
- `nr_topics='auto'`
- `language='German'`
- `language='multilingual'` supports 50+ languages

In [None]:
topic_model = BERTopic()  

topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



Unnamed: 0,Topic,Count,Name
0,-1,6546,-1_to_of_and_the
1,0,1832,0_game_team_games_he
2,1,569,1_key_clipper_chip_encryption
3,2,533,2_ites_hello_cheek_hi
4,3,454,3_drive_scsi_drives_ide
...,...,...,...
213,212,10,212_ear_wax_ears_syringe
214,213,10,213_icon_icons_click_program
215,214,10,214_religion_religious_wars_dogma
216,215,10,215_law_jesus_god_sabbath


## Select top topics

We can now access the frequent topics that were generated:

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,6546,-1_to_of_and_the
1,0,1832,0_game_team_games_he
2,1,569,1_key_clipper_chip_encryption
3,2,533,2_ites_hello_cheek_hi
4,3,454,3_drive_scsi_drives_ide
...,...,...,...
213,212,10,212_ear_wax_ears_syringe
214,213,10,213_icon_icons_click_program
215,214,10,214_religion_religious_wars_dogma
216,215,10,215_law_jesus_god_sabbath


-1 refers to all outliers and should typically be ignored. We can look at the size of topics in descending order (we use `.head(11)` because we're ignoring -1):

In [None]:
topic_model.get_topic_freq().head(11)

Unnamed: 0,Topic,Count
0,-1,6546
1,0,1832
2,1,569
3,2,533
4,3,454
5,4,450
6,5,329
7,6,328
8,7,265
9,8,185


## Select one topic

You can select a specific topic and get the top n words for that topic and their c-TF-IDF scores.

In [None]:
topic_model.get_topic(0)

[('game', 0.01037988667299007),
 ('team', 0.009040690955853959),
 ('games', 0.007202051400546848),
 ('he', 0.007079899520713731),
 ('players', 0.0063152731728265445),
 ('season', 0.006258325596614474),
 ('hockey', 0.0061343588212836985),
 ('play', 0.005791802025830788),
 ('25', 0.005657570761628978),
 ('year', 0.005629384769157129)]

## Topic modelling visualisation

Let's look at three methods of visualisation. See https://maartengr.github.io/BERTopic/index.html for a list of all methods.

# Visualise topics

`visualize_topics()` shows the topics with their sizes and corresponding words.

In [None]:
topic_model.visualize_topics()

# Visualise terms

`visualize_barchart()` will show the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores. You can then compare topic representations to each other and gain more insights from the topic generated.

In [None]:
topic_model.visualize_barchart()

# Visualise topic similarity

You can also visualize how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap()

## Reducing topics

If you didn't use the `nr_topics` argument before training, you can also reduce the number of topics after training if you want to avoid re-training.

In [None]:
new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=15)

## Make a prediction

To predict a topic of a new document, you need to add a new instance(s) on the `transform()` method.

In [None]:
topics, probs = topic_model.transform(new_docs)

## Save the model

After generating topics and their probabilities (i.e. training the model), we can save it to avoid our having to train the model every time we want to tweak something.

In [None]:
topic_model.save('my_topic_model')

## Load the model

In [None]:
trained_model = BERTopic.load('my_topic_model')