# V1 Latent Dirichlet Allocation - Narasipal Topic Modelling
## Pipeline:
1. Case folding
2. Normalization
3. Bigram & trigram detection
4. Tokenization (w/ NLTK)
5. Stopword removal
6. Stemming
8. BoW
9. LDA
10. Coherence score test

- NO NER (Named Entity Removal)
- Removed POS filtering
- Fine tuning alpha and eta values
- Added perplexity score

## 0. Installing Libraries & Packages

In [1]:
pip install --upgrade gensim



In [2]:
pip install pyldavis==3.2.1



In [3]:
pip install Sastrawi



In [4]:
pip install swifter



In [5]:
import pandas as pd
import numpy as np
import nltk
import spacy
import gensim
import pyLDAvis
import Sastrawi
import swifter
import string
import re

## 1. Data Pre-processing

In [6]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [7]:
df_titles = pd.read_excel('/content/drive/MyDrive/+Skripsi/Narasipal LDA/merged_titles.xlsx')
df_titles.head()

Unnamed: 0,judul_berita
0,Panas! 5.000 Roket Ditembakkan dari Gaza ke Is...
1,Militer Israel Mulai Operasi Skala Besar Peran...
2,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu..."
3,"Perang Hamas Vs Israel, Rusia Serukan Gencatan..."
4,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...


### 1. Case folding dan punctuation removal

In [None]:
# Lowercase
df_titles['judul_berita'] = df_titles['judul_berita'].str.lower()

# Remove punctuation
df_titles['judul_berita'] = df_titles['judul_berita'].str.translate(str.maketrans('', '', string.punctuation))

# Remove number
df_titles['judul_berita'] = df_titles['judul_berita'].str.replace(r'\d+', '', regex=True)

# Remove whitespaces
df_titles['judul_berita'] = df_titles['judul_berita'].str.strip()

# Remove multiple whitespaces into a single whitespace
df_titles['judul_berita'] = df_titles['judul_berita'].str.replace(r'\s+', ' ', regex=True)

# Remove single characters
df_titles['judul_berita'] = df_titles['judul_berita'].str.replace(r'\b[a-zA-Z]\b', '', regex=True)

df_titles['judul_berita'] = df_titles['judul_berita'].astype(str)
df_titles.head()

### 2. Normalization

In [9]:
# Load normalized word excel
normalized_word = pd.read_excel('/content/drive/MyDrive/+Skripsi/Narasipal LDA/normalisasi.xlsx')

# Create normalized word dictionary
normalized_word_dict = {}
normalized_word_dict = {k.strip(): v for k, v in normalized_word_dict.items()}

# If word isn't already in normalized_word_dict, add it
for index, row in normalized_word.iterrows():
    if row[0] not in normalized_word_dict:
        normalized_word_dict[row[0]] = row[1]

# Function for normalizing word
def normalized_term(title):
    for term, replacement in normalized_word_dict.items():
        title = re.sub(rf'\b{re.escape(term)}\b', f' {replacement} ', title)  # Add spaces around replacement
    return ' '.join(title.split())

df_titles['textdata_normalized'] = df_titles['judul_berita'].apply(normalized_term)
df_titles.head(10)

Unnamed: 0,judul_berita,textdata_normalized
0,panas 5000 roket ditembakkan dari gaza ke israel,panas 5000 roket ditembakkan dari gaza ke israel
1,militer israel mulai operasi skala besar peran...,militer israel mulai operasi skala besar peran...
2,perang hamas vs israel pecah rusia desak semua...,perang hamas versus israel pecah rusia desak s...
3,perang hamas vs israel rusia serukan gencatan ...,perang hamas versus israel rusia serukan genca...
4,6 fakta dampak hamas vs israel 40 orang tewasr...,6 fakta dampak hamas versus israel 40 orang te...
5,198 orang di gaza tewas akibat serangan balik ...,198 orang di gaza tewas akibat serangan balik ...
6,kbri amman soal hamas vs israel sejauh ini tak...,kbri amman soal hamas versus israel sejauh ini...
7,korban serangan hamas di israel lebih dari 200...,korban serangan hamas di israel lebih dari 200...
8,kemlu indonesia prihatin meningkatnya eskalasi...,kementerian luar negeri indonesia prihatin men...
9,prihatin ketegangan palestinaisrael china mint...,prihatin ketegangan palestinaisrael cina minta...


#### Remove '-'

In [10]:
# Replace '-' with ' '
df_titles['textdata_normalized'] = df_titles['textdata_normalized'].str.replace('-', ' ')

Unnamed: 0,judul_berita,textdata_normalized
0,panas 5000 roket ditembakkan dari gaza ke israel,panas roket ditembakkan dari gaza ke israel
1,militer israel mulai operasi skala besar peran...,militer israel mulai operasi skala besar peran...
2,perang hamas vs israel pecah rusia desak semua...,perang hamas versus israel pecah rusia desak s...
3,perang hamas vs israel rusia serukan gencatan ...,perang hamas versus israel rusia serukan genca...
4,6 fakta dampak hamas vs israel 40 orang tewasr...,fakta dampak hamas versus israel orang tewasrs...


### 3. Bigram & trigram detection
Parameters:
- `min_count` : **FREQUENCY FILTER** Controls how many times a word pair must appear before being considered a bigram/trigram
- `threshold` : **STRENGTH OF ASSOCIATION** How strongly words must be associated before forming a phrase. Based on Pointwise Mutual Information (PMI)

In [11]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# Convert normalized text into tokenized lists
df_titles['tokenized'] = df_titles['textdata_normalized'].apply(lambda x: x.split())

# Train bigram model
bigram = Phrases(df_titles['tokenized'], min_count=10, threshold=20)
bigram_phraser = Phraser(bigram)  # Optimized for faster processing

# Train trigram model on bigram-transformed data
trigram = Phrases(bigram_phraser[df_titles['tokenized']], min_count=5, threshold=10)
trigram_phraser = Phraser(trigram)

# Apply the models to detect bigrams and trigrams
df_titles['bigrams'] = df_titles['tokenized'].apply(lambda x: bigram_phraser[x])
df_titles['trigrams'] = df_titles['bigrams'].apply(lambda x: trigram_phraser[x])

# View the processed output
df_titles[['textdata_normalized', 'bigrams', 'trigrams']].head(10)

Unnamed: 0,textdata_normalized,bigrams,trigrams
0,panas roket ditembakkan dari gaza ke israel,"[panas, roket, ditembakkan, dari, gaza, ke, is...","[panas, roket, ditembakkan_dari, gaza, ke, isr..."
1,militer israel mulai operasi skala besar peran...,"[militer, israel, mulai, operasi, skala, besar...","[militer, israel, mulai, operasi, skala_besar,..."
2,perang hamas versus israel pecah rusia desak s...,"[perang, hamas, versus, israel, pecah, rusia, ...","[perang, hamas_versus, israel, pecah, rusia, d..."
3,perang hamas versus israel rusia serukan genca...,"[perang, hamas, versus, israel, rusia, serukan...","[perang, hamas_versus, israel, rusia, serukan_..."
4,fakta dampak hamas versus israel orang tewasrs...,"[fakta, dampak, hamas, versus, israel, orang, ...","[fakta, dampak, hamas_versus, israel, orang, t..."
5,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat_serangan, bali...","[orang, di, gaza, tewas, akibat_serangan, bali..."
6,kbri amman soal hamas versus israel sejauh ini...,"[kbri, amman, soal, hamas, versus, israel, sej...","[kbri, amman, soal, hamas_versus, israel, seja..."
7,korban serangan hamas di israel lebih dari ora...,"[korban, serangan, hamas, di, israel, lebih_da...","[korban_serangan, hamas, di, israel, lebih_dar..."
8,kementerian luar negeri indonesia prihatin men...,"[kementerian_luar, negeri, indonesia, prihatin...","[kementerian_luar_negeri, indonesia, prihatin,..."
9,prihatin ketegangan palestinaisrael cina minta...,"[prihatin, ketegangan, palestinaisrael, cina, ...","[prihatin, ketegangan, palestinaisrael, cina, ..."


In [12]:
# Extract phrases from the trained bigram model
detected_bigrams = bigram.export_phrases()

# Extract phrases from the trained trigram model
detected_trigrams = trigram.export_phrases()

# Convert bytes to readable strings
detected_bigrams = [phrase if isinstance(phrase, str) else phrase.decode("utf-8") for phrase in detected_bigrams]
detected_trigrams = [phrase if isinstance(phrase, str) else phrase.decode("utf-8") for phrase in detected_trigrams]

In [13]:
detected_bigrams[:50]

['gencatan_senjata',
 'akibat_serangan',
 'tak_ada',
 'lebih_dari',
 'orang_tewas',
 'kementerian_luar',
 'luar_negeri',
 'konflik_palestinaisrael',
 'dalam_jam',
 'baku_tembak',
 'minta_maaf',
 'amerika_serikat',
 'tel_aviv',
 'uni_eropa',
 'rumah_sakit',
 'jadi_sasaran',
 'iron_dome',
 'perang_hamasisrael',
 'fadli_zon',
 'evakuasi_wni',
 'tepi_barat',
 'keluar_dari',
 'harus_dihentikan',
 'kapal_induk',
 'apa_itu',
 'festival_musik',
 'ketum_pbnu',
 'gus_yahya',
 'tembak_mati',
 'situasi_terkini',
 'liga_arab',
 'balas_dendam',
 'sekjen_pbb',
 'bantuan_kemanusiaan',
 'kamp_pengungsi',
 'jet_tempur',
 'tak_boleh',
 'turun_tangan',
 'new_york',
 'gal_gadot',
 'perang_israelhamas',
 'lancarkan_serangan',
 'putra_mahkota',
 'hari_ini',
 'unjuk_rasa',
 'terus_gempur',
 'korban_jiwa',
 'warga_sipil',
 'depan_kedubes',
 'kedubes_amerika']

In [14]:
detected_trigrams[:50]

['ditembakkan_dari',
 'skala_besar',
 'hamas_versus',
 'semua_pihak',
 'gencatan_senjata',
 'serukan_gencatan_senjata',
 'akibat_serangan',
 'tak_ada',
 'jadi_korban',
 'korban_serangan',
 'lebih_dari',
 'orang_tewas',
 'kementerian_luar',
 'kementerian_luar_negeri',
 'konflik_palestinaisrael',
 'tewas_ditembak',
 'ditembak_tentara',
 'dalam_jam',
 'detikdetik_rudal',
 'baku_tembak',
 'bertambah_jadi',
 'minta_maaf',
 'amerika_serikat',
 'amerika_serikat_kirim',
 'kirim_kapal',
 'pesawat_tempur',
 'klaim_berhasil',
 'kok_bisa',
 'perang_lawan',
 'uni_eropa',
 'presiden_iran',
 'rumah_sakit',
 'sejarah_rumah_sakit',
 'rumah_sakit_indonesia',
 'jadi_sasaran',
 'iron_dome',
 'negara_muslim',
 'perang_hamasisrael',
 'evakuasi_wni',
 'tepi_barat',
 'di_tepi_barat',
 'imbau_wni',
 'keluar_dari',
 'harus_dihentikan',
 'jalur_gaza',
 'kapal_induk',
 'hancur_digempur',
 'hentikan_kekerasan',
 'pertahanan_udara',
 'kutuk_serangan']

### 4. Tokenization
With `punkt`

In [15]:
from nltk.tokenize import word_tokenize

In [16]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [17]:
df_titles['tokens'] = df_titles['trigrams'].apply(lambda x: word_tokenize(" ".join(x)))
df_titles[['textdata_normalized', 'trigrams', 'tokens']].head(20)

Unnamed: 0,textdata_normalized,trigrams,tokens
0,panas roket ditembakkan dari gaza ke israel,"[panas, roket, ditembakkan_dari, gaza, ke, isr...","[panas, roket, ditembakkan_dari, gaza, ke, isr..."
1,militer israel mulai operasi skala besar peran...,"[militer, israel, mulai, operasi, skala_besar,...","[militer, israel, mulai, operasi, skala_besar,..."
2,perang hamas versus israel pecah rusia desak s...,"[perang, hamas_versus, israel, pecah, rusia, d...","[perang, hamas_versus, israel, pecah, rusia, d..."
3,perang hamas versus israel rusia serukan genca...,"[perang, hamas_versus, israel, rusia, serukan_...","[perang, hamas_versus, israel, rusia, serukan_..."
4,fakta dampak hamas versus israel orang tewasrs...,"[fakta, dampak, hamas_versus, israel, orang, t...","[fakta, dampak, hamas_versus, israel, orang, t..."
5,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat_serangan, bali...","[orang, di, gaza, tewas, akibat_serangan, bali..."
6,kbri amman soal hamas versus israel sejauh ini...,"[kbri, amman, soal, hamas_versus, israel, seja...","[kbri, amman, soal, hamas_versus, israel, seja..."
7,korban serangan hamas di israel lebih dari ora...,"[korban_serangan, hamas, di, israel, lebih_dar...","[korban_serangan, hamas, di, israel, lebih_dar..."
8,kementerian luar negeri indonesia prihatin men...,"[kementerian_luar_negeri, indonesia, prihatin,...","[kementerian_luar_negeri, indonesia, prihatin,..."
9,prihatin ketegangan palestinaisrael cina minta...,"[prihatin, ketegangan, palestinaisrael, cina, ...","[prihatin, ketegangan, palestinaisrael, cina, ..."


### 5. Stop word removal

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
from nltk.corpus import stopwords

list_stopwords = stopwords.words('indonesian')
list_stopwords.extend(['bikin', 'masuk', 'gegara', 'update', 'puluhan', 'detik', 'potret', 'foto','ada apa'])
list_stopwords = set(list_stopwords)

def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]

# nsw = no stop words
df_titles['textdata_tokens_nsw'] = df_titles['tokens'].apply(stopwords_removal)
df_titles.head()

Unnamed: 0,judul_berita,textdata_normalized,tokenized,bigrams,trigrams,tokens,textdata_tokens_nsw
0,panas 5000 roket ditembakkan dari gaza ke israel,panas roket ditembakkan dari gaza ke israel,"[panas, roket, ditembakkan, dari, gaza, ke, is...","[panas, roket, ditembakkan, dari, gaza, ke, is...","[panas, roket, ditembakkan_dari, gaza, ke, isr...","[panas, roket, ditembakkan_dari, gaza, ke, isr...","[panas, roket, ditembakkan_dari, gaza, israel]"
1,militer israel mulai operasi skala besar peran...,militer israel mulai operasi skala besar peran...,"[militer, israel, mulai, operasi, skala, besar...","[militer, israel, mulai, operasi, skala, besar...","[militer, israel, mulai, operasi, skala_besar,...","[militer, israel, mulai, operasi, skala_besar,...","[militer, israel, operasi, skala_besar, perang..."
2,perang hamas vs israel pecah rusia desak semua...,perang hamas versus israel pecah rusia desak s...,"[perang, hamas, versus, israel, pecah, rusia, ...","[perang, hamas, versus, israel, pecah, rusia, ...","[perang, hamas_versus, israel, pecah, rusia, d...","[perang, hamas_versus, israel, pecah, rusia, d...","[perang, hamas_versus, israel, pecah, rusia, d..."
3,perang hamas vs israel rusia serukan gencatan ...,perang hamas versus israel rusia serukan genca...,"[perang, hamas, versus, israel, rusia, serukan...","[perang, hamas, versus, israel, rusia, serukan...","[perang, hamas_versus, israel, rusia, serukan_...","[perang, hamas_versus, israel, rusia, serukan_...","[perang, hamas_versus, israel, rusia, serukan_..."
4,6 fakta dampak hamas vs israel 40 orang tewasr...,fakta dampak hamas versus israel orang tewasrs...,"[fakta, dampak, hamas, versus, israel, orang, ...","[fakta, dampak, hamas, versus, israel, orang, ...","[fakta, dampak, hamas_versus, israel, orang, t...","[fakta, dampak, hamas_versus, israel, orang, t...","[fakta, dampak, hamas_versus, israel, orang, t..."


### 6. Stemming
Sekitar 10 menit tanpa GPU, 7 menit w/ GPU

In [20]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

factory = StemmerFactory()
stemmer = factory.create_stemmer()

custom_words = {
    "pasukan": "pasukan",
    "bantu": "bantuan",
    "bantuan": "bantuan",
    "hizbullah": "hizbullah",
    "pengungsi":"pengungsi",
    "pengungsian":"pengungsi",
    "bombardir": "bom",
    "akui": "akui"
}

def stem_words(words):
    return [custom_words[word] if word in custom_words else stemmer.stem(word) for word in words]

df_titles['textdata_tokens_stemmed'] = df_titles['textdata_tokens_nsw'].apply(stem_words)
df_titles[['textdata_tokens_nsw', 'textdata_tokens_stemmed']].head()


Unnamed: 0,textdata_tokens_nsw,textdata_tokens_stemmed
0,"[panas, roket, ditembakkan_dari, gaza, israel]","[panas, roket, tembak dari, gaza, israel]"
1,"[militer, israel, operasi, skala_besar, perang...","[militer, israel, operasi, skala besar, rang, ..."
2,"[perang, hamas_versus, israel, pecah, rusia, d...","[perang, hamas versus, israel, pecah, rusia, d..."
3,"[perang, hamas_versus, israel, rusia, serukan_...","[perang, hamas versus, israel, rusia, seru gen..."
4,"[fakta, dampak, hamas_versus, israel, orang, t...","[fakta, dampak, hamas versus, israel, orang, t..."


### 8. Bag of Words for LDA

In [21]:
from gensim.corpora import Dictionary

# Create a dictionary from tokenized text
dictionary = Dictionary(df_titles["textdata_tokens_stemmed"])

# Convert tokenized text into a Bag of Words representation
corpus = [dictionary.doc2bow(text) for text in df_titles["textdata_tokens_stemmed"]]

# Save dictionary & corpus for future use
dictionary.save("lda_dictionary.dict")
import pickle
with open("lda_corpus.pkl", "wb") as f:
    pickle.dump(corpus, f)

print(f"Dictionary size: {len(dictionary)} unique tokens")
print(f"Example BoW for first document: {corpus[0]}")

Dictionary size: 9288 unique tokens
Example BoW for first document: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]


## 2. LDA Model

### Fine-tuning alpha and eta

In [26]:
alpha_values = ["symmetric", "asymmetric"]
eta_values = [0.01, 0.1]
num_topics_range = range(5, 7, 1)

In [27]:
from gensim.models import LdaModel, CoherenceModel
import itertools

best_score = 0
best_params = {}

for num_topics, alpha, eta in itertools.product(num_topics_range, alpha_values, eta_values):
    model = LdaModel(
        corpus=corpus,
        id2word=dictionary,
        num_topics=num_topics,
        random_state=42,
        passes=10,
        chunksize=100,
        alpha=alpha,
        eta=eta,
    )

    coherence_model = CoherenceModel(model=model, texts=df_titles["textdata_tokens_stemmed"], dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()

    print(f"Topics: {num_topics}, Alpha: {alpha}, Eta: {eta} → Coherence: {coherence_score}")

    if coherence_score > best_score:
        best_score = coherence_score
        best_params = {"num_topics": num_topics, "alpha": alpha, "eta": eta}

print(f"\n🎯 Best Model → Topics: {best_params['num_topics']}, Alpha: {best_params['alpha']}, Eta: {best_params['eta']} with Coherence: {best_score}")

KeyboardInterrupt: 

### Final LDA model

In [28]:
from gensim.models import LdaModel

num_topics = 6

# Train LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    chunksize=100,
    alpha="asymmetric",
    eta=0.01,
)

# Print topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Save model for later use
lda_model.save("lda_model")

Topic 0: 0.213*"israel" + 0.125*"warga" + 0.082*"palestina" + 0.067*"amerika serikat" + 0.052*"as" + 0.044*"iran" + 0.022*"tentara" + 0.020*"senjata" + 0.019*"tewas" + 0.019*"detikdetik"
Topic 1: 0.205*"perang" + 0.191*"gaza" + 0.184*"israel" + 0.094*"pbb" + 0.071*"indonesia" + 0.053*"dunia" + 0.034*"orang tewas" + 0.020*"di tepi barat" + 0.019*"desak" + 0.013*"hantam"
Topic 2: 0.581*"palestina" + 0.108*"dukung" + 0.041*"merdeka" + 0.039*"gempur" + 0.034*"konflik" + 0.029*"aksi" + 0.026*"bebas" + 0.023*"jalan" + 0.017*"penuh" + 0.016*"keluarga"
Topic 3: 0.229*"israel" + 0.213*"serang" + 0.056*"bantuan" + 0.039*"militer" + 0.037*"tampak" + 0.029*"presiden" + 0.028*"lawan" + 0.023*"jalur gaza" + 0.023*"tahan" + 0.019*"roket"
Topic 4: 0.554*"israel" + 0.323*"hamas" + 0.052*"panas" + 0.047*"versus" + 0.005*"picu" + 0.004*"harga" + 0.000*"statement" + 0.000*"hawa" + 0.000*"audiensi" + 0.000*"berangan"
Topic 5: 0.615*"gaza" + 0.257*"israel" + 0.073*"tewas" + 0.026*"orang" + 0.011*"ribu" + 0.

In [29]:
from collections import Counter

topics = lda_model.show_topics(formatted=False)
data_flat = [w for w_list in df_titles['textdata_tokens_stemmed'] for w in w_list]
counter = Counter(data_flat)

out = []
for i, topic in topics:
    for word, weight in topic:
        out.append([word, i , weight, counter[word]])

df_imp_wcount = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])
df_imp_wcount

Unnamed: 0,word,topic_id,importance,word_count
0,israel,0,0.213181,11621
1,warga,0,0.124585,2365
2,palestina,0,0.081518,7915
3,amerika serikat,0,0.067218,666
4,as,0,0.052194,609
5,iran,0,0.044125,496
6,tentara,0,0.022268,450
7,senjata,0,0.019866,229
8,tewas,0,0.019324,829
9,detikdetik,0,0.018901,203


#### Download word-topic list

In [None]:
file_path = "/content/drive/MyDrive/+Skripsi/Narasipal LDA/it5_df_imp_wcount.xlsx"
df_imp_wcount.to_excel(file_path, index=False)
print(f"File saved to: {file_path}")

File saved to: /content/drive/MyDrive/+Skripsi/Narasipal LDA/it4_df_imp_wcount.xlsx


## 3. Coherence Score Testing
6 menit w/ GPU

In [31]:
from gensim.models import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(
    model=lda_model,
    texts=df_titles["textdata_tokens_stemmed"],
    dictionary=dictionary,
    coherence="c_v",
)

coherence_score = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_score}")

Coherence Score: 0.4198681802533353


In [32]:
import numpy as np

def compute_coherence_values(dictionary, corpus, texts, start=2, limit=12, step=1):
    coherence_values = []
    for num_topics in range(start, limit, step):
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            passes=10,
            chunksize=100,
            alpha="auto",
            eta="auto",
        )
        coherence_model = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence="c_v")
        coherence_values.append((num_topics, coherence_model.get_coherence()))
    return coherence_values

# Run coherence test
coherence_scores = compute_coherence_values(dictionary, corpus, df_titles["textdata_tokens_stemmed"])

# Print results
for num_topics, score in coherence_scores:
    print(f"Num Topics: {num_topics}, Coherence Score: {score}")

# Find best number of topics
best_num_topics = max(coherence_scores, key=lambda x: x[1])[0]
print(f"Best number of topics: {best_num_topics}")




KeyboardInterrupt: 

## 4. Frequency Test

In [30]:
from collections import Counter

all_words = [word for tokens in df_titles['textdata_tokens_stemmed'] for word in tokens]
word_freq = Counter(all_words)
top_200_words = word_freq.most_common(200)

for word, freq in top_200_words:
    print(f"{word}: {freq}")

israel: 11621
palestina: 7915
gaza: 7544
warga: 2365
hamas: 2185
serang: 2064
perang: 1178
indonesia: 1140
netanyahu: 997
dukung: 962
pbb: 835
tewas: 829
bantuan: 706
rafah: 672
amerika serikat: 666
as: 609
bom: 528
iran: 496
biden: 485
negara: 458
tentara: 450
bunuh: 429
desak: 426
militer: 421
dunia: 401
temu: 382
menteri luar negeri: 380
jalur gaza: 378
jokowi: 370
hizbullah: 353
henti: 349
tolak: 344
prabowo: 341
bela: 340
presiden: 339
di tepi barat: 334
gencat senjata: 331
ancam: 327
kecam: 305
genosida: 305
perintah: 300
mesir: 300
seru: 299
orang tewas: 297
anak: 297
tahan: 296
hancur: 296
lebanon: 280
klaim: 278
pasukan: 276
pengungsi: 272
anakanak: 269
konflik: 267
orang: 259
lawan: 248
gempur: 247
damai: 247
tembak: 245
merdeka: 244
pm: 240
momen: 238
korban tewas: 230
senjata: 229
houthi: 223
sandera: 221
menteri luar negeri retno: 213
korban: 212
rumah sakit: 212
aksi: 211
bebas: 208
rudal: 206
cina: 205
bahas: 205
tampak: 204
detikdetik: 203
unrwa: 199
batas: 198
kirim: 1

## 5. Perplexity Score