# Topic Modeling
This notebook presents the application of Topic Modelling using Gensim, a popular python library. The notebook procures the dataset from Huggingface, and uses Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)​ algorithm, to measure the accuracy via the coherence score.



# Dataset
The dataset uses Indian news labelled for the category and topics (https://huggingface.co/datasets/Nirmalt13/news_topicModelling). The dataset has curated over 11,000 Indian news on diverse topics.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
import pandas as pd

In [52]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [61]:
from datasets import load_dataset

ds = load_dataset("Nirmalt13/news_topicModelling")

train.csv:   0%|          | 0.00/1.19M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11183 [00:00<?, ? examples/s]

In [62]:
ds

DatasetDict({
    train: Dataset({
        features: ['headline', 'category', 'topic'],
        num_rows: 11183
    })
})

In [63]:
ds['train']['headline'][20]

'Insects Top Newly Discovered Species List'

In [65]:
dataset_df = pd.DataFrame(ds['train'])

In [68]:
dataset_df.sample(10)

Unnamed: 0,headline,category,topic
7358,Anderson Cooper Shares 'The Sad Fact' About Ma...,POLITICS,Politics - Gun Control
1403,New York Fashion Week 2012: Derek Lam Fall 201...,STYLE & BEAUTY,Entertainment - Fashion
6526,John Oliver Crafts Perfect Meme For Aging Rela...,ENTERTAINMENT,Entertainment - Social Media
6063,‘Quick Reaction Forces’ And The Lingering Myst...,POLITICS,Politics - National Security
6256,Travis Scott Says He'll 'Continue To Show Up' ...,ENTERTAINMENT,Entertainment - Music Festivals
1667,How To Be Organized: Cleaning Your Handbag,HOME & LIVING,Education - Productivity
4398,The Country's Most Expensive Hotel Rooms (PHOTOS),TRAVEL,Business - Luxury Hospitality
5523,Jennifer Lopez Breaks Down Over Oscars Snub In...,ENTERTAINMENT,Entertainment - Movies
5813,China Weighs Exit From 'Zero COVID' And The Ri...,WORLD NEWS,Politics - Public Health Policy
2596,"'It Gets Better' on Cable, with Commercials",QUEER VOICES,Entertainment - Television


# Dataset preprocessing

In [13]:
%%capture
!pip install -U gensim

In [14]:
!pip uninstall numpy
!pip install numpy==1.23.5

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Would remove:
    /usr/local/bin/f2py
    /usr/local/lib/python3.11/dist-packages/numpy-1.26.4.dist-info/*
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libgfortran-040039e1.so.5.0.0
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
    /usr/local/lib/python3.11/dist-packages/numpy.libs/libquadmath-96973f99.so.0.0.0
    /usr/local/lib/python3.11/dist-packages/numpy/*
Proceed (Y/n)? y
  Successfully uninstalled numpy-1.26.4
Collecting numpy==1.23.5
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take i

In [14]:
from gensim.utils import tokenize
from gensim.parsing.preprocessing import preprocess_string,strip_tags,strip_punctuation,strip_numeric,remove_stopwords,strip_short
from gensim.corpora.dictionary import Dictionary
from gensim import models

In [None]:
help(preprocess_string)

In [69]:
dataset_df['Clean_news'] = dataset_df['headline'].apply(preprocess_string)

In [70]:
dataset_df.sample(10)

Unnamed: 0,headline,category,topic,Clean_news
10481,SC to examine acquitted man’s ‘right to be for...,explained,Technology - Cyber Law,"[examin, acquit, man’, ‘right, forgotten’, rig..."
1953,"'Glee' Bashed, 'This Means War' Crashed And Mo...",ENTERTAINMENT,Entertainment - Movie/TV Reviews,"[glee, bash, mean, war, crash, week, ouch]"
2762,Weaning And Depression Linked In Many Women,PARENTING,Education - Women's Health,"[wean, depress, link, women]"
9669,"ICMAI CMA 2024: June session results out, link...",education,Education - Examination Results,"[icmai, cma, june, session, result, link, icmai]"
4303,"Health Care Reform Could Get You A Raise, But ...",MONEY,Business - Employee Compensation,"[health, care, reform, rais, catch]"
9026,"Spain vs France Semi Final, EURO 2024 Highligh...",sports,Sports - EURO 2024,"[spain, franc, semi, final, euro, highlight, y..."
7687,International Criminal Court Ruling Brings Hop...,WORLD NEWS,Politics - International Relations,"[intern, crimin, court, rule, bring, hope, pal..."
4822,How To Cocktail: The French 75 (Video),FOOD & DRINK,Entertainment - Mixology,"[cocktail, french, video]"
6208,Rapper Drakeo The Ruler Fatally Stabbed At LA ...,ENTERTAINMENT,Entertainment - Music,"[rapper, drakeo, ruler, fatal, stab, music, fe..."
5430,The Races Where Democrats Are Rooting For Elec...,POLITICS,Politics - US Elections,"[race, democrat, root, elect, denier]"


In [72]:
filters=[lambda x: x.lower(),strip_tags,strip_punctuation,strip_numeric,remove_stopwords,strip_short]
dataset_df['Clean_news1'] = dataset_df['headline'].apply(lambda x: preprocess_string(x,filters))

In [73]:
dataset_df.sample(10)

Unnamed: 0,headline,category,topic,Clean_news,Clean_news1
3209,A Viennese Ball For Good,TRAVEL,Entertainment - Charity Events,"[viennes, ball, good]","[viennese, ball, good]"
3351,STOP: Don't Believe Everything You Think,WELLNESS,Education - Mindfulness,"[stop, believ, think]","[stop, believe, think]"
8634,‘Moderating inflation aiding goods trade recov...,business,Business - International Trade,"[‘moder, inflat, aid, good, trade, recoveri, i...","[‘moderating, inflation, aiding, goods, trade,..."
5477,WATCH LIVE — A New Labor Movement: How Workers...,POLITICS,Business - Labor Unions,"[watch, live, new, labor, movement, worker, un...","[watch, live, new, labor, movement, workers, u..."
2352,New Study: Common Over-the-Counter Drugs May R...,WELLNESS,Education - Medical Research,"[new, studi, common, counter, drug, reduc, spr...","[new, study, common, counter, drugs, reduce, s..."
4841,The Regrets of the Young,WELLNESS,Entertainment - Movies,"[regret, young]","[regrets, young]"
145,Bella Santorum Hospitalized With Pneumonia: Is...,WELLNESS,Entertainment - Celebrity Health,"[bella, santorum, hospit, pneumonia, complic, ...","[bella, santorum, hospitalized, pneumonia, com..."
111,Hipster Freeze-Tag Brings Childhood Game To Ac...,COMEDY,Entertainment - Games,"[hipster, freez, tag, bring, childhood, game, ...","[hipster, freeze, tag, brings, childhood, game..."
9245,Buying Dali and Picasso in India,lifestyle,Entertainment - Art,"[bui, dali, picasso, india]","[buying, dali, picasso, india]"
10761,"As a new campus rises at an ancient site, the ...",explained,Education - Universities,"[new, campu, rise, ancient, site, stori, nalanda]","[new, campus, rises, ancient, site, story, nal..."


In [74]:
dataset_dictionary = Dictionary(dataset_df['Clean_news1'])

In [75]:
len(dataset_dictionary)

18526

In [76]:
print(dataset_dictionary.token2id)



In [77]:
dataset_corpus_bow = [dataset_dictionary.doc2bow(text) for text in dataset_df['Clean_news1']] #create a dataset corpus with bag of word vectorization

In [78]:
len(dataset_corpus_bow)

11183

In [81]:
print(dataset_corpus_bow[1000])

[(69, 1), (352, 1), (1225, 1), (2496, 1), (3060, 1), (3061, 1), (3062, 1), (3063, 1), (3064, 2), (3065, 1)]


In [82]:
tfidf = models.TfidfModel(dataset_corpus_bow)
dataset_corpus_tfidf = tfidf[dataset_corpus_bow]

In [83]:
len(dataset_corpus_tfidf)

11183

In [85]:
print(dataset_corpus_tfidf[10])

[(58, 0.3655571011755978), (59, 0.3383761533539835), (60, 0.3655571011755978), (61, 0.3111952055323693), (62, 0.0990276886345307), (63, 0.24093347467943796), (64, 0.3655571011755978), (65, 0.284014257710755), (66, 0.2162833287202901), (67, 0.2793955351129638), (68, 0.3111952055323693), (69, 0.13251724433141182)]


# Topic Modelling with Latent Dirichlet Allocation(LDA)

In [30]:
from gensim.models.ldamodel import LdaModel

In [86]:
lda_bow = LdaModel(dataset_corpus_bow,num_topics=20,id2word=dataset_dictionary,random_state=0)



In [87]:
lda_topics_bow = lda_bow.print_topics(num_words=8)
for topic in lda_topics_bow:
  print(topic)

(0, '0.032*"house" + 0.032*"probe" + 0.027*"calls" + 0.026*"issues" + 0.017*"officials" + 0.016*"special" + 0.016*"white" + 0.012*"gave"')
(1, '0.030*"dead" + 0.023*"accused" + 0.023*"like" + 0.023*"today" + 0.018*"survey" + 0.018*"list" + 0.017*"order" + 0.016*"leader"')
(2, '0.085*"case" + 0.030*"high" + 0.028*"state" + 0.019*"government" + 0.017*"china" + 0.017*"reveals" + 0.016*"water" + 0.015*"school"')
(3, '0.058*"police" + 0.032*"arrested" + 0.030*"president" + 0.025*"time" + 0.022*"global" + 0.021*"chief" + 0.020*"means" + 0.019*"union"')
(4, '0.025*"women" + 0.020*"party" + 0.017*"attack" + 0.016*"expert" + 0.016*"fight" + 0.016*"data" + 0.016*"hospital" + 0.015*"said"')
(5, '0.169*"india" + 0.022*"story" + 0.020*"win" + 0.019*"politics" + 0.017*"security" + 0.013*"new" + 0.013*"million" + 0.012*"food"')
(6, '0.043*"know" + 0.018*"need" + 0.017*"twitter" + 0.016*"hearing" + 0.016*"google" + 0.015*"users" + 0.015*"health" + 0.015*"finds"')
(7, '0.072*"indian" + 0.028*"news" + 0

In [88]:
lda_tfidf = LdaModel(dataset_corpus_tfidf, id2word=dataset_dictionary, num_topics=20)



In [89]:
lda_topics_tfidf = lda_tfidf.print_topics(num_words=8)
for topic in lda_topics_tfidf:
  print(topic)

(0, '0.011*"west" + 0.011*"issues" + 0.011*"news" + 0.010*"arrest" + 0.010*"announces" + 0.009*"capitol" + 0.009*"order" + 0.008*"leaders"')
(1, '0.013*"today" + 0.011*"calls" + 0.011*"global" + 0.010*"set" + 0.009*"shot" + 0.009*"said" + 0.008*"members" + 0.008*"explains"')
(2, '0.012*"space" + 0.012*"texas" + 0.010*"come" + 0.008*"kerala" + 0.007*"boost" + 0.007*"ceo" + 0.007*"control" + 0.007*"race"')
(3, '0.027*"police" + 0.018*"arrested" + 0.009*"claims" + 0.009*"journey" + 0.008*"movement" + 0.007*"road" + 0.006*"attacks" + 0.006*"lead"')
(4, '0.011*"class" + 0.010*"review" + 0.009*"market" + 0.009*"shooting" + 0.009*"action" + 0.008*"trial" + 0.008*"france" + 0.007*"east"')
(5, '0.026*"year" + 0.022*"years" + 0.016*"old" + 0.010*"right" + 0.009*"survey" + 0.008*"jobs" + 0.007*"schools" + 0.007*"games"')
(6, '0.015*"check" + 0.014*"government" + 0.013*"meet" + 0.012*"key" + 0.011*"record" + 0.011*"tech" + 0.007*"prince" + 0.007*"justice"')
(7, '0.013*"law" + 0.012*"china" + 0.010

# Topic Modelling with Latent Semantic Analysis/Indexing(LSA/LSI)

In [35]:
from gensim.models.lsimodel import LsiModel

In [90]:
lsi_bow = LsiModel(corpus=dataset_corpus_bow,id2word=dataset_dictionary,num_topics=20)

  sparsetools.csc_matvecs(


In [91]:
lsi_topics_bow = lsi_bow.print_topics(num_words=8)
for topic in lsi_topics_bow:
  print(topic)

(0, '0.875*"photos" + 0.249*"new" + 0.186*"week" + 0.171*"fashion" + 0.118*"day" + 0.112*"york" + 0.097*"video" + 0.096*"style"')
(1, '0.776*"new" + -0.372*"photos" + 0.252*"york" + 0.243*"fashion" + 0.241*"week" + 0.123*"fall" + 0.103*"video" + 0.083*"trump"')
(2, '0.616*"trump" + 0.429*"says" + 0.293*"day" + 0.251*"video" + 0.227*"biden" + -0.114*"fashion" + -0.112*"new" + -0.111*"week"')
(3, '-0.738*"video" + -0.430*"day" + 0.363*"trump" + 0.157*"says" + 0.131*"photos" + 0.104*"biden" + 0.103*"new" + -0.079*"valentine"')
(4, '-0.732*"day" + 0.563*"video" + 0.159*"trump" + -0.125*"valentine" + -0.114*"india" + -0.090*"look" + -0.086*"olympics" + -0.084*"paris"')
(5, '-0.706*"says" + 0.540*"trump" + 0.180*"day" + -0.173*"delhi" + -0.117*"india" + 0.108*"week" + 0.102*"fashion" + 0.095*"donald"')
(6, '-0.590*"week" + -0.500*"fashion" + 0.482*"new" + -0.217*"says" + -0.183*"fall" + -0.118*"york" + 0.103*"photos" + -0.098*"paris"')
(7, '-0.702*"india" + -0.346*"olympics" + -0.325*"paris"

In [92]:
lsi_tfidf = LsiModel(dataset_corpus_tfidf, id2word=dataset_dictionary, num_topics=20)

  sparsetools.csc_matvecs(


In [93]:
lsi_topics_tfidf = lsi_tfidf.print_topics(num_words=8)
for topic in lsi_topics_tfidf:
  print(topic)

(0, '-0.459*"photos" + -0.446*"week" + -0.401*"fashion" + -0.311*"new" + -0.283*"york" + -0.226*"fall" + -0.151*"day" + -0.119*"best"')
(1, '-0.518*"day" + -0.328*"photos" + 0.282*"fashion" + -0.269*"valentine" + 0.263*"week" + 0.228*"york" + 0.214*"new" + -0.181*"love"')
(2, '0.539*"photos" + -0.459*"day" + -0.244*"valentine" + -0.227*"new" + 0.207*"style" + 0.157*"best" + 0.151*"evolution" + -0.140*"york"')
(3, '0.342*"trump" + -0.340*"day" + 0.288*"biden" + 0.244*"says" + -0.197*"valentine" + 0.147*"covid" + 0.144*"joe" + 0.144*"new"')
(4, '-0.869*"love" + 0.245*"day" + -0.118*"wedding" + -0.113*"divorce" + -0.089*"chefs" + -0.081*"marriage" + 0.078*"trump" + 0.078*"photos"')
(5, '0.581*"best" + 0.349*"video" + -0.249*"photos" + 0.226*"like" + 0.194*"looks" + -0.169*"trump" + 0.168*"divorce" + -0.167*"style"')
(6, '0.525*"divorce" + -0.279*"best" + 0.276*"getting" + -0.254*"trump" + 0.242*"married" + -0.213*"biden" + 0.174*"women" + -0.174*"love"')
(7, '0.463*"video" + -0.433*"new" 

# Topic Modelling Visualization with pyLDAvis

In [40]:
%%capture
!pip install pyLDAvis

In [41]:
import pyLDAvis
import pyLDAvis.gensim_models

In [94]:
pyLDAvis.enable_notebook()

In [95]:
vis_bow = pyLDAvis.gensim_models.prepare(lda_bow, dataset_corpus_bow, dataset_dictionary)
vis_bow

In [96]:
vis_tfidf = pyLDAvis.gensim_models.prepare(lda_tfidf, dataset_corpus_tfidf, dataset_dictionary)
vis_tfidf

# Model evaluation for Topic Modelling

Topic coherence is a quantitative method to measure the quality of topics, how similar the top words are similar to each other and how interpretable topics are to humans.Coherence is expressed as the sum of pairwise scores on the words w1, …, wn used to describe the topic . Coherence is usually an intrinsic or extrinsic measure. For the purpose of the session, two options for coherence will be implemented using the coherence model in gensim. u_mass(a measure of how often two words were seen together with a range of-14 and 14) and c_v (0 and 1)

In [45]:
from gensim.models import CoherenceModel

In [97]:
cm_lda_bow_umass = CoherenceModel(model=lda_bow,texts=dataset_df['Clean_news1'], corpus=dataset_corpus_bow, coherence='u_mass')
cm_lda_bow_umass.get_coherence()

-17.90310838056634

In [98]:
cm_lsi_bow_umass = CoherenceModel(model=lsi_bow,texts=dataset_df['Clean_news1'], corpus=dataset_corpus_bow, coherence='u_mass')
cm_lsi_bow_umass.get_coherence()

-12.547275093658538

In [99]:
texts= dataset_df['Clean_news1']
texts = [x for x in texts if x]

In [100]:
cm_lda_bow_cv = CoherenceModel(model=lda_bow,texts=texts,dictionary=dataset_dictionary,coherence='c_v')
cm_lda_bow_cv.get_coherence()

0.6353223372145674

In [101]:
cm_lsi_bow_cv = CoherenceModel(model=lsi_bow, texts=texts, dictionary=dataset_dictionary, coherence='c_v')
cm_lsi_bow_cv.get_coherence()

0.41613665171522146