## Topic identification Models

In the word of topic identification we have many choice to employ, for instance,

- LDA (Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) for shorter texts): Latent Dirichlet Allocation (LDA) is a popular probabilistic model used for topic modeling
- BERTopic: Uses transformer-based embeddings (like BERT) to create dense representations of documents, followed by clustering to identify topics.

Each of them has pros and cons. For instance, LDA is not that good for shorter text but GSDMM is there to cater that. So, we will try to use both of these to see if they works and if yes how many topics we can extract. 

### Result
 As per the results we can say that at least topics are there in unlabelled dataset

### Data Prep

- Labelled data
- Unlabeled data

In [16]:
from utils import Dataset, Preprocessing
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')

In [17]:
data_obj = Dataset() 

# get labelled data
train_df = data_obj.get_labelled_data()
process_obj = Preprocessing()

# preprocessing includes, text cleaning and label encoding
cleaned_train_df = process_obj.get_preprocessd_data(train_df, text_column="phrase")

# we need tokenized docs for GSDMM
tokenized_train_docs = process_obj.get_tokenized(cleaned_train_df)

train_vocab = set(x for doc in tokenized_train_docs for x in doc)

print(tokenized_train_docs)
# get labelled data
test_df = data_obj.get_unlabelled_data()
process_obj = Preprocessing()

cleaned_test_df = process_obj.get_preprocessd_data(test_df, text_column="user_msg", has_labels = False)

# we need tokenized docs for GSDMM
tokenized_test_docs = process_obj.get_tokenized(cleaned_test_df)

test_vocab = set(x for doc in tokenized_test_docs for x in doc)

[['iphone'], ['iphone', 'beschikbaar'], ['komen', 'iphone'], ['samsung', 'verkopen'], ['google', 'pixel'], ['iphone', 'xs', 'beschikbaar'], ['toestel', 'assortiment'], ['iphone', 'beschikbaar'], ['iphone', 'beschikbaar'], ['telefoon', 'aanbieden'], ['iphone', 'bestellen'], ['verkoop', 'samsung', 'note'], ['iphone', 'plus', 'beschikbaar'], ['verkopen', 'samsung', 'galaxy', 's20'], ['iphone', 'bestellen'], ['telefoon', 'bieden'], ['telefoon', 'beschikbaar'], ['verkopen', 'iphone'], ['iphone', 'beschikbaarheid'], ['verkoop', 'nieuw', 'toestel'], ['samsung', 'galaxy', 'amsung', 's20', 'assortiment'], ['toestel', 'beschikbaar'], ['samsung', 's20', 'beschikbaar'], ['iphone', 'beschikbaar'], ['toestel'], ['nieuw', 'telefoon', 'beschikbaar'], ['iphone', 'mini', 'beschikbaar'], ['iphone', 'verkopen'], ['beschikbaarheid', 'toestel'], ['beschikbaarheid', 'iphone', 'se'], ['nieuw', 'pixel', 'koop'], ['iphone', 'se', 'beschikbaar'], ['iphone', 'xs', 'beschikbaar'], ['iphonexs', 'beschikbaar'], ['ve

#### BERTopic

We tried our coventional way to do topic modeling but it wasnt that good so lets try with new LLM based approaches to extract through this. BERTopic is one of the most popular one to identify topics. Let's try that

In [28]:
from bertopic import BERTopic

first lets apply the technique over the labelled documents to see if it does work or no.

In [31]:
# Fetch the list of documents from cleaned_train_df
train_docs = cleaned_train_df["cleaned_text"].to_list()

# Apply BERTTopic with language dutch and initially nr_topics as 20. although we can leave the nr_topic blank to let the model decide how many topics it can fetch
model_train = BERTopic(language="Dutch", nr_topics=20)

# Fit the model over our train docs
topics, probabilities = model_train.fit_transform(train_docs)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Let's see how many topics are there and how many doucments are associated with each topic.

In [32]:
model_train.get_topic_freq()

Unnamed: 0,Topic,Count
2,0,225
14,1,150
15,2,146
9,-1,142
11,3,90
4,4,82
3,5,80
16,6,80
19,7,79
18,8,65


So as per the parameter it created about 20 topics. Two out of them categorized as noised. which mean those were not able to get into any definitive cluster. Now let lets check a particular topic to see if within the topic words are relevant or not.

In [33]:
model_train.get_topic(1)

[('abonnemenen', 0.23856939580542766),
 ('abonnement', 0.17815924447131343),
 ('verlengen', 0.10240756769147039),
 ('aanpassen', 0.10162964550689321),
 ('wijzig', 0.0830828072217799),
 ('veranderen', 0.07983064144690818),
 ('huidig', 0.06689062262875899),
 ('vernieuw', 0.056536445156321696),
 ('databundel', 0.047920540715351075),
 ('tussentijds', 0.04120115530472438)]

At at least at first glance it seems words are relevant. Let's check another topic

In [34]:
model_train.get_topic(14)

[('samsung', 0.4520382195378256),
 ('tablet', 0.2730915309799744),
 ('kleur', 0.19487030032604502),
 ('pixel', 0.19487030032604502),
 ('s20', 0.16517292757206117),
 ('huawei', 0.13289610319904602),
 ('google', 0.13289610319904602),
 ('s9', 0.0971137913078263),
 ('note', 0.0971137913078263),
 ('galaxy', 0.0971137913078263)]

Absouletly this is also good. related mobile phone brands. Let's visualize to see how far the topics are on the plane.

In [35]:
model_train.visualize_topics()

We can see the that most some of the topics still can be merged for instance topic 1 and 17. However, we can also try to extract topics without giving the nr_topics parameter. Let try and visualize that.

In [36]:
# Apply BERTTopic with language dutch
model_train = BERTopic(language="Dutch")

# Fit the model over our train docs
topics, probabilities = model_train.fit_transform(train_docs)

In [40]:
model_train.visualize_topics()

We see that it could capture a total of 52 topics. most of them are far yet some of them are still overlapping which means the corpus can be managed with less than 52 topics

Now lets find the possible topics from unlablled data. although we can use the same model as we defined for the labelled data yet as the texts in unlabelled data are bit free texts so lets try with a seperate model without any parameter except language parameter needed to fetch relevant sentence transformer. (sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)

In [43]:
model_test = BERTopic(language="Dutch")

In [45]:
test_docs = cleaned_test_df["cleaned_text"].to_list()

test_topics, test_probabilities = model_test.fit_transform(test_docs)

# new_topics, new_probs = model_test.reduce_topics(docs, topics, probabilities, nr_topics=30)

In [46]:
model_test.get_topic_freq()

Unnamed: 0,Topic,Count
15,-1,526
8,0,508
11,1,159
26,2,120
3,3,100
...,...,...
98,93,13
44,94,12
24,95,12
27,96,11


In [47]:
model_test.get_topic(2)

[('wijzig', 0.25279429371991063),
 ('abonnemenen', 0.14767597401132776),
 ('verhoog', 0.032171168883330686),
 ('abonement', 0.02704542595061215),
 ('direct', 0.02578038435860101),
 ('programmas', 0.01867456562406856),
 ('beheerder', 0.01867456562406856),
 ('opnames', 0.01867456562406856),
 ('doorspoelen', 0.01867456562406856),
 ('reclame', 0.016085584441665343)]

In [48]:
model_test.get_topic(94)

[('abbonement', 0.2701212844325855),
 ('maandelijks', 0.13934756973813478),
 ('opzegbaar', 0.10277862360994972),
 ('wekelijks', 0.09897519780756336),
 ('voordeliger', 0.09897519780756336),
 ('ozegbaar', 0.09897519780756336),
 ('jaar', 0.0868889310052321),
 ('afsluit', 0.08525359754082631),
 ('opzegging', 0.08525359754082631),
 ('actief', 0.08525359754082631)]

In [49]:
model_test.visualize_topics()

Here we can see that although the clusters for each topic makes sense but still a number of clusters are overlapping and can be merged with another one. Morover, we can see some of the topics contain a very few docs which doesnt make sense at some point. So, perhaps we can train the model but give the parameter for minimum size of the cluster.

In [50]:
model_test = BERTopic(language="Dutch", min_topic_size= 20)

test_topics, test_probabilities = model_test.fit_transform(test_docs)

model_test.get_topic_freq()

Unnamed: 0,Topic,Count
19,-1,617
8,0,507
11,1,179
15,2,121
26,3,119
...,...,...
53,59,23
49,60,23
45,61,23
40,62,22


In [55]:
model_test.get_topic(59)

[('betalen', 0.13177949575308714),
 ('factuur', 0.08768870597024409),
 ('ideal', 0.08121384002747009),
 ('kost', 0.06605886510563347),
 ('waaneer', 0.05648395805286159),
 ('stuuur', 0.05648395805286159),
 ('hoveel', 0.05648395805286159),
 ('doorvoeren', 0.05648395805286159),
 ('activeringskost', 0.05648395805286159),
 ('gebruikskost', 0.0492367457638707)]

In [56]:
model_test.visualize_topics()

In [62]:
model_test = BERTopic(language="Dutch", min_topic_size= 20, nr_topics=20)

test_topics, test_probabilities = model_test.fit_transform(test_docs)

model_test.get_topic_freq()

Unnamed: 0,Topic,Count
3,-1,726
4,0,724
6,1,507
2,2,396
0,3,362
7,4,309
10,5,224
1,6,192
13,7,160
8,8,159


In [63]:
model_test.get_topic(10)

[('slecht', 0.3701954783314747),
 ('bereik', 0.22376849894021772),
 ('traag', 0.1114002505088417),
 ('heel', 0.09885994437628183),
 ('verbinding', 0.08399363369294205),
 ('streep', 0.07042527245275938),
 ('ontvangst', 0.05036228058956477),
 ('langzaam', 0.04730763001458205),
 ('tijd', 0.044900398906076444),
 ('super', 0.042255163471655637)]

In [64]:
model_test.visualize_topics()

In [65]:
model_test.visualize_barchart()

In [67]:
model_test.visualize_distribution(test_probabilities)

In [72]:
# from sentence_transformers import SentenceTransformer
# from umap import UMAP

In [74]:
# sentence_model = SentenceTransformer("NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers")
# embeddings = sentence_model.encode(test_docs, show_progress_bar=False)
# topic_model = BERTopic(min_topic_size=20).fit(test_docs, embeddings)

# topic_model.get_topic_freq()

# topic_model.visualize_documents(test_docs, embeddings=embeddings)

# # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# topic_model.visualize_documents(test_docs, reduced_embeddings=reduced_embeddings)

Unnamed: 0,Topic,Count
5,-1,623
8,0,502
4,1,145
36,2,129
10,3,115
...,...,...
56,56,23
49,57,22
59,58,21
58,59,20


In [75]:

# topic_model.get_topic(10)

[('wijzig', 0.3827571838784957),
 ('abonnemenen', 0.22840662138829837),
 ('abonament', 0.044626029352060906),
 ('veranderen', 0.025354175925931286),
 ('tussentijds', 0.021104462522211827),
 ('sony', 0.02012133996865295),
 ('site', 0.02012133996865295),
 ('geldig', 0.017953109360939728),
 ('vinden', 0.01462666065903743),
 ('verlengen', 0.012117394635119226)]

In [76]:
# topic_model.visualize_topics()