#BERTopic on Quran Urdu Translations
In this notebook I have implemented Topic Modeling on Urdu translations based topic modelling technique BERTopic.

Shaista Zulfiqar

## Mounting Google Drive
If the dataset is on Google Drive then you have to mount over google drive with collaboratory.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive




#Installing required dependencies
**One thing to remember is that after installing libraries you have to restart the run time again so that other dependencies are not affected by it.**

In [None]:
!pip install bertopic
!pip install -U sentence-transformers
!pip install urduhack

Collecting bertopic
  Downloading bertopic-0.16.2-py2.py3-none-any.whl (158 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m153.6/158.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.36-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=0.4.1 (from bertopic)


In [None]:
!pip install --upgrade keras tensorflow-addons




# Importing required dependencies
We will import numpy, pandas and re, bertopic, gensim library for now. other libraries will be imported in the notebook later.

Pandas will be used to create a Dataframe and handle the csv file. Numpy will be used for the faster computation of arrays to save time. re library will be used for the cleaning of data. gensim library is used to get coherence score and train LDA. bertopic is used to train bertopic on our Quran-UTM dataset with using pretrained language models Multilingial MiniLM

In [None]:
import pandas as pd
import numpy as np
import re
from bertopic import BERTopic
from urduhack.normalization import remove_diacritics #Rerun this cell if you get any error
from gensim.models import LdaMulticore
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
#optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

##DataFrame




In [None]:

junapd = pd.read_table("", header=None, encoding='utf-8') #provide your data path

print(junapd.head(5))

stopspd=pd.read_csv('/content/drive/MyDrive/stopwords.txt',names=['List'])#your path

stopspd

                                                   0
0  1|1|شروع کرتا ہوں اللہ تعالیٰ کے نام سے جو بڑا...
1  1|2|سب تعریف اللہ تعالیٰ کے لئے ہے جو تمام جہا...
2                  1|3|بڑا مہربان نہایت رحم کرنے واﻻ
3             1|4|بدلے کے دن (یعنی قیامت) کا مالک ہے
4  1|5|ہم صرف تیری ہی عبادت کرتے ہیں اور صرف تجھ ...


Unnamed: 0,List
0,کی
1,ہیں
2,ہے
3,رہا
4,رہی
...,...
396,گئی
397,ہونے
398,وجہ
399,ہوگیا


## Cleaning of Data
After collecting the eight Quran Urdu translations, we observed these translation had some irrelevant information, such as metadata, punctuation, and diacritics. So we cleaned the translations for topic modeling

Stopwords are common words that are often filtered out during text processing in natural language processing (NLP) tasks. These words are considered to have little or no value in conveying the actual meaning of the text. We take list of 401 stopwords for topic modelling. Stopwords are removed in post preprocessing phase

In [None]:
import re
#Remove Urdu Punctuation
def remove_urdu_punctuation(text):
    pattern = r'[؛؟،٫٬‘’“”«»!"٪&\'\*\+,-./:;<=>؟@^_`()[]{|}~]'

    cleaned_text = re.sub(pattern, '', text)

    return cleaned_text


In [None]:
# Remove trailing metadata lines
index_to_drop = junapd[junapd[0].str.startswith("#")].index
junapd.drop(index_to_drop, inplace=True)

# Remove verse reference from every line
junapd[0] = junapd[0].str.replace(r'\d+\|\d+\|', '', regex=True)

# Remove diacritics
junapd[0] = junapd[0].apply(remove_diacritics)

# Remove punctuation
junapd[0] = junapd[0].apply(remove_urdu_punctuation)

# Reset index
junapd.reset_index(drop=True, inplace=True)

# Display the cleaned DataFrame
print("Cleaned DataFrame:")
print(junapd.head())



Cleaned DataFrame:
                                                   0
0  شروع کرتا ہوں اللہ تعالی کے نام سے جو بڑا مہرب...
1  سب تعریف اللہ تعالی کے لئے ہے جو تمام جہانوں ک...
2                      بڑا مہربان نہایت رحم کرنے واﻻ
3                 بدلے کے دن (یعنی قیامت) کا مالک ہے
4  ہم صرف تیری ہی عبادت کرتے ہیں اور صرف تجھ ہی س...


In [None]:
def remove_nonbreaking_space(text):
    return re.sub(r'\xa0', ' ', text)

junapd[0] = junapd[0].apply(remove_nonbreaking_space)

In [None]:
# Save the cleaned DataFrame to a text file
file_path = 'cleaned_Junagarhi_text.txt'
junapd[0].to_csv(file_path, sep='\n', index=False, header=False, encoding='utf-8')

print(f'Text has been saved to {file_path}')

Text has been saved to cleaned_Junagarhi_text.txt


In [None]:
# Convert the DataFrame to a list of strings
data = junapd[0].tolist()

In [None]:
print(len(data))

6236


In [None]:
print(data[56:90])

['اور جب ہم نے تمہارے لئے دریا چیر (پھاڑ) دیا اور تمہیں اس سے پار کردیا اور فرعونیوں کو تمہاری نظروں کے سامنے اس میں ڈبو دیا', 'اور ہم نے (حضرت) موسی ﴿علیہ السلام﴾ سے چالیس راتوں کا وعده کیا، پھر تم نے اس کے بعد بچھڑا پوجنا شروع کردیا اور ﻇالم بن گئے', 'لیکن ہم نے باوجود اس کے پھر بھی تمہیں معاف کردیا، تاکہ تم شکر کرو', 'اور ہم نے (حضرت) موسی﴿علیہ السلام﴾ کو تمہاری ہدایت کے لئے کتاب اور معجزے عطا فرمائے', 'جب (حضرت موسی) ﴿علیہ السلام﴾ نے اپنی قوم سے کہا کہ اے میری قوم! بچھڑے کو معبود بنا کر تم نے اپنی جانوں پر ﻇلم کیا ہے، اب تم اپنے پیدا کرنے والے کی طرف رجوع کرو، اپنے کو آپس میں قتل کرو، تمہاری بہتری اللہ تعالی کے نزدیک اسی میں ہے، تو اس نے تمہاری توبہ قبول کی، وه توبہ قبول کرنے واﻻ اور رحم وکرم کرنے واﻻ ہے', 'اور (تم اسے بھی یاد کرو) تم نے (حضرت) موسی ﴿علیہ السلام﴾ سے کہا تھا کہ جب تک ہم اپنے رب کو سامنے نہ دیکھ لیں ہرگز ایمان نہ ﻻئیں گے (جس گستاخی کی سزا میں) تم پر تمہارے دیکھتے ہوئے بجلی گری', 'لیکن پھر اس لئے کہ تم شکرگزاری کرو، اس موت کے بعد بھی ہم نے تمہیں زنده کردیا', 'اور ہم ن

# BERTopic Training
The default  bertopic embedding model is paraphrase-multilingual-MiniLM-L12-v2 when selecting language="multilingual". We take MiniLM Model from [sentence-tranformer](https://www.sbert.net/docs/pretrained_models.html) and create custom document embedding and passed it to the bertopic model for training.

In [None]:
#create custom embedding
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(data, show_progress_bar=True)
print(embeddings)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/195 [00:00<?, ?it/s]

[[-0.05596548  0.5835919  -0.3221646  ... -0.04675008  0.28727624
   0.02047918]
 [-0.03789676  0.54637575 -0.21332069 ... -0.12550303  0.07511239
   0.05277733]
 [-0.00982627  0.3162045  -0.09909539 ... -0.2868312   0.25163165
   0.05825587]
 ...
 [ 0.03173115  0.19379574 -0.060475   ... -0.07272384  0.22816095
   0.13100289]
 [ 0.01132959  0.0149532  -0.17385605 ...  0.05484664  0.17869394
   0.07355011]
 [-0.0037258   0.07043177  0.04907688 ... -0.03990352  0.15049328
  -0.03631473]]


In [None]:
stop_words_list = stopspd['List'].tolist()

In [None]:
#pass vectorizer_model to bertopic for stopwords removal
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words= stop_words_list)


In [None]:
#UMAP for dimention reduction
from umap import UMAP
dim_model = UMAP(n_components=4, random_state=42)

In [None]:
# # #KMeans used for clustering
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters=5, random_state=42)

In [None]:
np.random.seed(42)

In [None]:
topic_model = BERTopic(language="urdu", low_memory=True ,calculate_probabilities=True,  top_n_words=10,hdbscan_model=cluster_model, umap_model=dim_model, verbose=True, vectorizer_model=vectorizer_model)

In [None]:
#Fit documents in bertopic
topics, probs = topic_model.fit_transform(data,embeddings)

2024-06-09 15:02:23,557 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-09 15:03:12,371 - BERTopic - Dimensionality - Completed ✓
2024-06-09 15:03:12,374 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-06-09 15:03:12,761 - BERTopic - Cluster - Completed ✓
2024-06-09 15:03:12,784 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-06-09 15:03:13,090 - BERTopic - Representation - Completed ✓


In [None]:
print(probs)

None


In [None]:
#topics that assign to each document
print(topics)

[1, 1, 2, 0, 2, 0, 0, 0, 1, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 2, 2, 1, 0, 1, 0, 1, 1, 2, 1, 1, 0, 2, 1, 1, 1, 1, 3, 3, 2, 3, 3, 2, 2, 3, 0, 2, 3, 2, 0, 3, 2, 2, 2, 2, 0, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 1, 1, 0, 1, 0, 2, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 3, 1, 1, 1, 1, 3, 2, 1, 1, 1, 2, 1, 0, 2, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 2, 2, 2, 3, 1, 1, 3, 2, 2, 2, 3, 3, 3, 0, 1, 1, 1, 1, 3, 1, 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 1, 1, 2, 1, 0, 2, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 2, 1, 1, 2, 0, 2, 1, 1, 2, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 1, 3, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 2, 1, 2, 

In [None]:
topic_model.get_topic_freq()

Unnamed: 0,Topic,Count
2,0,2151
0,1,1948
1,2,1536
3,3,569
4,4,32


In [None]:
document_topics = topic_model.get_topics()

In [None]:
#topics with score
print(document_topics)

{0: [('دن', 0.0464722778633009), ('پس', 0.03150121810967468), ('جائیں', 0.030957348019091765), ('لوگ', 0.027129070972407476), ('کوئی', 0.026649038084400722), ('یقینا', 0.026022130573654188), ('پیدا', 0.0259991332612175), ('عذاب', 0.02377807841452333), ('والوں', 0.023292657655769686), ('وقت', 0.022989176686948387)], 1: [('اللہ', 0.14460906089339248), ('تعالی', 0.09752328745436274), ('واﻻ', 0.04188940309141993), ('کوئی', 0.03375854507728429), ('لوگ', 0.03241164196384428), ('تمہارے', 0.030816135976750095), ('ایمان', 0.03063410745363759), ('لوگوں', 0.027782768442137076), ('تمہیں', 0.02534517791535824), ('چیز', 0.021974974245793602)], 2: [('رب', 0.06678070748911359), ('ایمان', 0.04279725807114603), ('لوگ', 0.04142203528016125), ('واﻻ', 0.034466579902883175), ('لوگوں', 0.03277577224220408), ('پروردگار', 0.031721589679804325), ('پاس', 0.030156558677440645), ('اے', 0.027310313302848874), ('کوئی', 0.027179752954481173), ('عذاب', 0.026783997701245597)], 3: [('السلام', 0.09604209472797368), ('علی

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,2151,0_دن_پس_جائیں_لوگ,"[دن, پس, جائیں, لوگ, کوئی, یقینا, پیدا, عذاب, ...",[پس اگر ہم تجھے یہاں سے لے بھی جائیں تو بھی ہم...
1,1,1948,1_اللہ_تعالی_واﻻ_کوئی,"[اللہ, تعالی, واﻻ, کوئی, لوگ, تمہارے, ایمان, ل...",[اس سے پہلے، لوگوں کو ہدایت کرنے والی بنا کر، ...
2,2,1536,2_رب_ایمان_لوگ_واﻻ,"[رب, ایمان, لوگ, واﻻ, لوگوں, پروردگار, پاس, اے...",[اے ہمارے رب! ہم نے سنا کہ منادی کرنے واﻻ بآوا...
3,3,569,3_السلام_علیہ_موسی_فرعون,"[السلام, علیہ, موسی, فرعون, قوم, اے, رب, میرے,...",[اور ہم نے موسی (علیہ السلام) کو اپنی نشانیاں ...
4,4,32,4_نعمت_جھٹلاؤ_رب_پس,"[نعمت, جھٹلاؤ, رب, پس, جنو, انسانو, اے, کون, پ...","[پس تم اپنے رب کی کس کس نعمت کو جھٹلاؤ گے؟, پس..."


In [None]:
topic_distr, _ = topic_model.approximate_distribution(data, window=3, min_similarity=0.01)

100%|██████████| 7/7 [00:01<00:00,  3.80it/s]


In [None]:
print(topic_distr)

[[0.06770668 0.61750968 0.15966548 0.15511816 0.        ]
 [0.06477241 0.61169466 0.15573635 0.1472886  0.02050799]
 [0.16518651 0.27237735 0.39489618 0.16753996 0.        ]
 ...
 [0.5634443  0.1184214  0.15415836 0.16397594 0.        ]
 [0.26959802 0.2232964  0.30335588 0.2037497  0.        ]
 [0.47626493 0.08400767 0.28820584 0.05863305 0.09288851]]


In [None]:
topic_model.visualize_distribution(topic_distr[0], width=600,height=600, title="Topic Probability Distributio")

# Evaluation
we used three evaluation metrics to compare the results.

1. The coherence score is used to capture the degree of similarity between the words within each topic, with higher scores indicating more coherent topics. We used two coherence metrics NPMI and Cv Score.
2. IRBO measures are used to assess how different and distinct the topics are in a topic model.


### Coherence Score
To evaluate the model topics coherence we use [Gensim](https://radimrehurek.com/gensim/models/coherencemodel.html) library

In [None]:
texts = [[word for word in str(document).split() if word not in stop_words_list] for document in data] #if word not in stop_words_list
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

In [None]:
topics_bert=[]
for i in topic_model.get_topics():
  row=[]
  topic= topic_model.get_topic(i)
  for word in topic:
     row.append(word[0])
  topics_bert.append(row)

In [None]:
print(topics_bert)

[['دن', 'پس', 'جائیں', 'لوگ', 'کوئی', 'یقینا', 'پیدا', 'عذاب', 'والوں', 'وقت'], ['اللہ', 'تعالی', 'واﻻ', 'کوئی', 'لوگ', 'تمہارے', 'ایمان', 'لوگوں', 'تمہیں', 'چیز'], ['رب', 'ایمان', 'لوگ', 'واﻻ', 'لوگوں', 'پروردگار', 'پاس', 'اے', 'کوئی', 'عذاب'], ['السلام', 'علیہ', 'موسی', 'فرعون', 'قوم', 'اے', 'رب', 'میرے', 'پاس', 'مجھے'], ['نعمت', 'جھٹلاؤ', 'رب', 'پس', 'جنو', 'انسانو', 'اے', 'کون', 'پروردگار', 'تکذیب']]


In [None]:
# compute Coherence Score CV

cm = CoherenceModel(topics=topics_bert, texts=texts, dictionary=id2word, coherence='c_v')
coherence = round(cm.get_coherence(),2)
print('\nCV Score: ', coherence)


CV Score:  0.52


In [None]:
# compute Coherence Score NPMI

cm = CoherenceModel(topics=topics_bert, texts=texts, dictionary=id2word, coherence='c_npmi')
coherence = round(cm.get_coherence(),2)
print('\nNPMI Score: ', coherence)


NPMI Score:  0.02


**Diversity Score**

upload rbo.py file before importing

In [None]:
import itertools
from rbo import rbo
import numpy as np

class InvertedRBO:
    def __init__(self):
        pass

    def irbo(self, topics, topk=10, weight=0.9):
        """
        Calculate inverted Rank Biased Overlap (RBO) as a measure of topic diversity from a list of lists of words.

        :param topics: A list of lists of words representing different topics.
        :param topk: The number of top words on which RBO will be computed.
        :param weight: Weight of each agreement at depth d: p**(d-1). When set to 1.0, there is no weight,
                       and the RBO returns to average overlap.
        :return: The inverted RBO topic diversity score.
        """
        if topk <= 0:
            raise ValueError("topk must be a positive integer.")

        num_topics = len(topics)
        if num_topics == 0:
            raise ValueError("topics list cannot be empty.")

        if topk > len(topics[0]):
            raise Exception('Words in topics are less than topk')

        collect = []
        for list1, list2 in itertools.combinations(topics, 2):
            rbo_val = rbo(list1[:topk], list2[:topk], p=weight)[2]
            collect.append(rbo_val)

        Irbo_score = 1 - np.mean(collect)
        return Irbo_score

In [None]:
inverted_rbo_calculator = InvertedRBO()
IRBO= round(inverted_rbo_calculator.irbo(topics_bert, topk=10, weight=0.9),2)
print("Inverted RBO Score:", IRBO)

Inverted RBO Score: 0.87


# Visualize Topics

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(n_words=10,width=220, height=270, title="Topic Word Scores")

# Model serialization

In [None]:
# Save model
topic_model.save("my_model")



In [None]:
loaded_topic_model = topic_model.load("my_model")
loaded_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,2151,0_دن_پس_جائیں_لوگ,"[دن, پس, جائیں, لوگ, کوئی, یقینا, پیدا, عذاب, ...",[پس اگر ہم تجھے یہاں سے لے بھی جائیں تو بھی ہم...
1,1,1948,1_اللہ_تعالی_واﻻ_کوئی,"[اللہ, تعالی, واﻻ, کوئی, لوگ, تمہارے, ایمان, ل...",[اس سے پہلے، لوگوں کو ہدایت کرنے والی بنا کر، ...
2,2,1536,2_رب_ایمان_لوگ_واﻻ,"[رب, ایمان, لوگ, واﻻ, لوگوں, پروردگار, پاس, اے...",[اے ہمارے رب! ہم نے سنا کہ منادی کرنے واﻻ بآوا...
3,3,569,3_السلام_علیہ_موسی_فرعون,"[السلام, علیہ, موسی, فرعون, قوم, اے, رب, میرے,...",[اور ہم نے موسی (علیہ السلام) کو اپنی نشانیاں ...
4,4,32,4_نعمت_جھٹلاؤ_رب_پس,"[نعمت, جھٹلاؤ, رب, پس, جنو, انسانو, اے, کون, پ...","[پس تم اپنے رب کی کس کس نعمت کو جھٹلاؤ گے؟, پس..."


# LDA

We use the [ parallelized Latent Dirichlet Allocation (LDA)](https://radimrehurek.com/gensim/models/ldamulticore.html) from Gensim.

Note: for LDA you have to define topics number in advance.

In [None]:
n_topics=5
lda = LdaMulticore(corpus, id2word=id2word, random_state=42, num_topics=5)
topics = lda.show_topics(num_topics=5, formatted=False)

#Extract the words from the topics
topics_list = []
for _, topic_words in topics:
    words = [word for word, _ in topic_words]
    topics_list.append(words)

print(topics_list)

In [None]:
#CV Score
cm = CoherenceModel(topics=topics_list, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_v')
coherence_lda = round(cm.get_coherence(),2)
print('\nCV Score: ', coherence_lda)


CV Score:  0.46


In [None]:
#NPMI Score
cm = CoherenceModel(topics=topics_list, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_npmi')
coherence_lda = round(cm.get_coherence(),2)
print('\nNPMI Score: ', coherence_lda)


NPMI Score:  -0.01


In [None]:
# Calculate IRBO Score
inverted_rbo_calculator = InvertedRBO()
IRBO_LDA = round(inverted_rbo_calculator.irbo(topics_list, topk=10, weight=0.9),2)
print('\nIRBO Score: ',IRBO_LDA)


IRBO Score:  0.51


#NMF
We use Gensim library for implementation of NMF

Note: for NMF you have to define topics number in advance.

In [None]:
#Using Gensim
from gensim import corpora, models


# Train NMF model
num_topics = 5  # Define the number of topics
nmf_model = models.Nmf(corpus, num_topics=num_topics, id2word=id2word, random_state=42)

# Extract topics
topics = []
for topic_id in range(num_topics):
    topic_words = nmf_model.show_topic(topic_id, topn=10)
    topic_words = [word for word, _ in topic_words]
    topics.append(topic_words)

# Print topics in the desired format
print(topics)

In [None]:
#Calculate CV Score
cm = CoherenceModel(topics=topics, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_v')
coherence_nmf = round(cm.get_coherence(),2)
print('\nCV Score: ', coherence_nmf)

In [None]:
#Calculate NPMI Score
cm = CoherenceModel(topics=topics, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_npmi')
coherence_nmf = round(cm.get_coherence(),2)
print('\nNPMI Score: ', coherence_nmf)

In [None]:
# Calculate IRBO Score
inverted_rbo_calculator = InvertedRBO()
IRBO_NMF = round(inverted_rbo_calculator.irbo(topics, topk=10, weight=0.9),2)c
print('\nIRBO Score: ',IRBO_NMF)