#BERTopic

BERTopic is a topic modeling technique that uses transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

In [1]:
%%capture
!pip install bertopic

Get the Data ready

In [1]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [15]:
len(docs)

18846

#Training BERTopic model using the above dataset

In [2]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2023-05-10 04:45:36,684 - BERTopic - Transformed documents to Embeddings
2023-05-10 04:46:14,412 - BERTopic - Reduced dimensionality
2023-05-10 04:46:58,034 - BERTopic - Clustered reduced embeddings


NOTE: Use language="multilingual" to select a model that support 50+ languages.

#Extract topics

In [3]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,6594,-1_to_the_and_of
1,0,1824,0_game_team_games_he
2,1,605,1_key_clipper_chip_encryption
3,2,524,2_ites_cheek_yep_huh
4,3,426,3_drive_scsi_drives_ide


In [4]:
topic_model.get_topic(0)

[('game', 0.010536276312972109),
 ('team', 0.009174122321601728),
 ('games', 0.007303857957639276),
 ('he', 0.007227373276361553),
 ('players', 0.006419733458247637),
 ('season', 0.0063495243789260854),
 ('hockey', 0.006210497295960609),
 ('play', 0.005871172635283199),
 ('25', 0.0057452062784470615),
 ('year', 0.005724989144616739)]

BERTopic will give different topic infor in different run due to its stochastic nature of UMAP ( Uniform Manifold Approximation and Projection).
UMAP is a dimentionality reduction technique similar to t-SNE.

In [5]:
topic_model.topics_[:10]

[0, 10, 64, 3, 103, -1, -1, 0, 0, -1]

In [6]:
topic_model.visualize_topics()

In [7]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

In [8]:
topic_model.visualize_hierarchy(top_n_topics=50)

In [16]:
topic_model.visualize_barchart(top_n_topics=5)

In [17]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [18]:
topic_model.visualize_term_rank()

In [19]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [20]:
topic_model.get_topic(0)

[('game', 0.006669213345197113),
 ('team', 0.0057000241241332225),
 ('he', 0.005397815969065439),
 ('games', 0.004493551207465028),
 ('the', 0.004164927757402022),
 ('was', 0.0038693889924185845),
 ('players', 0.0038554001761822846),
 ('season', 0.003811702474453964),
 ('in', 0.0037331202402915292),
 ('hockey', 0.0037181456047040763)]

In [21]:
topic_model.reduce_topics(docs, nr_topics=60)

2023-05-10 05:20:33,204 - BERTopic - Reduced number of topics from 210 to 60


<bertopic._bertopic.BERTopic at 0x7f1fe1a0bdf0>

#Find similar topics

In [22]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[6, 22, 30, 19, 41]

In [23]:
topic_model.get_topic(6)

[('the', 0.014206015221196903),
 ('bike', 0.012488691894486382),
 ('to', 0.010648520285567242),
 ('car', 0.010590652202972981),
 ('and', 0.010565512345557344),
 ('it', 0.009947822905246392),
 ('you', 0.00879021344429998),
 ('in', 0.008785025966820827),
 ('is', 0.008601195734211793),
 ('on', 0.008584396575636104)]

In [24]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census',
                                  'London is known for its finacial district'])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))

Downloading (…)5fedf/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2cb455fedf/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)b455fedf/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)edf/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5fedf/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)fedf/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)2cb455fedf/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)455fedf/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Similarity: tensor([[0.5472, 0.6330]])
