# Narasipal Latent Dirichlet Allocation
1. Installing libraries & packages
2. Data preprocessing
3. LDA model
4. Model evaluation
5. LDA Visualization


## 1. Installing libraries & packages

```pip install -r requirements.txt```

In [1]:
import pandas as pd
import numpy as np
import nltk
import spacy
import gensim
import Sastrawi
import swifter
import string
import re

  from .autonotebook import tqdm as notebook_tqdm


## 2. Data preprocessing

Dataset yang digunakan seterusnya adalah subset dari dataset awal dengan judul berita >= 9 kata.

Filtering data tersebut dilakukan oleh Çano dan Morisio (2019), yang dalam penelitiannya terkait analisis sentimen menghilangkan ulasan dalam dataset dengan panjang di bawah lima token dengan alasan mengeliminasi input dan noise yang tidak berarti. 

In [2]:
df_titles = pd.read_excel('/Users/salmadanu/Desktop/Skripsi/skripsi-env/narasipal/lda_topic_modeling/dataset_gte9.xlsx')
df_titles.head()

Unnamed: 0,judul_berita
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu..."
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...
3,KBRI Amman soal Hamas Vs Israel: Sejauh Ini Ta...
4,Korban Serangan Hamas di Israel: Lebih dari 20...


### 1. Case folding & punctuation removal

In [3]:
df_titles['textdata_cfpr'] = df_titles['judul_berita'].str.replace('-', ' ')
df_titles['textdata_cfpr'] = df_titles['textdata_cfpr'].str.lower()
df_titles['textdata_cfpr'] = df_titles['textdata_cfpr'].str.translate(str.maketrans('', '', string.punctuation))
df_titles['textdata_cfpr'] = df_titles['textdata_cfpr'].str.replace(r'\d+', '', regex=True)
df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...


### 2. Normalisasi

In [4]:
normalized_word = pd.read_excel('/Users/salmadanu/Desktop/Skripsi/skripsi-env/narasipal/lda_topic_modeling/normalisasi_lda.xlsx')

normalized_word_dict = {}
normalized_word_dict = {row[0].strip(): row[1] for _, row in normalized_word.iterrows()}

for index, row in normalized_word.iterrows():
    if row[0] not in normalized_word_dict:
        normalized_word_dict[row[0]] = row[1]

def normalized_term(title):
    title = re.sub(r'\b[Aa][Ss]\b', 'amerika serikat', title)
    title = re.sub(r'\b[Rr][Ii]\b', 'indonesia', title)

    for term, replacement in normalized_word_dict.items():
        title = re.sub(rf'\b{re.escape(term)}\b', f' {replacement} ', title)

    return ' '.join(title.split())

df_titles['textdata_normalized'] = df_titles['textdata_cfpr'].apply(normalized_term)
df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr,textdata_normalized
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...,perang hamas vs israel pecah rusia desak semua...
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...,fakta dampak hamas vs israel orang tewas rumah...
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...,orang di gaza tewas akibat serangan balik isra...


### 3. Tokenizing

In [5]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/salmadanu/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [6]:
df_titles['tokenized'] = df_titles['textdata_normalized'].apply(lambda x: x.split())
df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr,textdata_normalized,tokenized
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...,perang hamas vs israel pecah rusia desak semua...,"[perang, hamas, vs, israel, pecah, rusia, desa..."
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...,fakta dampak hamas vs israel orang tewas rumah...,"[fakta, dampak, hamas, vs, israel, orang, tewa..."
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat, serangan, bal..."


### 4. Stopword removal

In [7]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/salmadanu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords

list_stopwords = stopwords.words('indonesian')
list_stopwords.extend(['bikin', 'masuk', 'gegara', 'update', 'puluhan', 'detik', 'potret', 'foto','ada apa', 'vs', 'versus', 'israel', 'gaza', 'palestina', 'monas'])
list_stopwords = set(list_stopwords)
list_stopwords.discard("luar")

def stopwords_removal(words):
    return [word for word in words if word not in list_stopwords]

# nsw = no stop words
df_titles['textdata_nsw'] = df_titles['tokenized'].apply(stopwords_removal)
df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr,textdata_normalized,tokenized,textdata_nsw
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...,perang hamas vs israel pecah rusia desak semua...,"[perang, hamas, vs, israel, pecah, rusia, desa...","[perang, hamas, pecah, rusia, desak, menahan]"
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...,fakta dampak hamas vs israel orang tewas rumah...,"[fakta, dampak, hamas, vs, israel, orang, tewa...","[fakta, dampak, hamas, orang, tewas, rumah, sa..."
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat, serangan, bal...","[orang, tewas, akibat, serangan, ribuan, terluka]"


### 5. Bigram & trigram detection
Parameters:
- `min_count` : **FREQUENCY FILTER** Controls how many times a word pair must appear before being considered a bigram/trigram
- `threshold` : **STRENGTH OF ASSOCIATION** How strongly words must be associated before forming a phrase. Based on Pointwise Mutual Information (PMI)

In [9]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser

In [10]:
bigram_phraser = Phrases(df_titles["textdata_nsw"], min_count=10, threshold=20)
trigram_phraser = Phrases(bigram_phraser[df_titles["textdata_nsw"]], min_count=5, threshold=10)

df_titles["bigrams"] = [bigram_phraser[doc] for doc in df_titles["textdata_nsw"]]
df_titles["trigrams"] = [trigram_phraser[doc] for doc in df_titles["bigrams"]]


df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr,textdata_normalized,tokenized,textdata_nsw,bigrams,trigrams
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...,perang hamas vs israel pecah rusia desak semua...,"[perang, hamas, vs, israel, pecah, rusia, desa...","[perang, hamas, pecah, rusia, desak, menahan]","[perang, hamas, pecah, rusia, desak, menahan]","[perang, hamas, pecah, rusia, desak, menahan]"
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...,fakta dampak hamas vs israel orang tewas rumah...,"[fakta, dampak, hamas, vs, israel, orang, tewa...","[fakta, dampak, hamas, orang, tewas, rumah, sa...","[fakta, dampak, hamas, orang, tewas, rumah_sak...","[fakta, dampak, hamas, orang_tewas, rumah_saki..."
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat, serangan, bal...","[orang, tewas, akibat, serangan, ribuan, terluka]","[orang, tewas_akibat, serangan, ribuan, terluka]","[orang_tewas_akibat, serangan, ribuan, terluka]"


### 6. Stemming
`custom_words` ditambahkan untuk menyeragamkan bentuk kata serta menangani kata nama dengan imbuhan ("hizbullah", "-lah" bukan imbuhan, tetapi bagian dari nama)

In [11]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

In [12]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

custom_words = {
    "pasukan": "pasukan",
    "bantu": "bantuan",
    "bantuan": "bantuan",
    "hizbullah": "hizbullah",
    "pengungsi":"pengungsi",
    "pengungsian":"pengungsi",
    "bombardir": "bom",
    "akui": "akui",
    "perang":"perang",
    "perangi":"perang",
    "serangan":"serang",
    "diserang":"serang",
    "serang": "serang",
    "gencatan_senjata":"gencatan_senjata"
}

def stem_words(words):
    return [custom_words[word] if word in custom_words else stemmer.stem(word) for word in words]

df_titles['textdata_stemmed'] = df_titles['trigrams'].apply(stem_words)
df_titles.head(3)

Unnamed: 0,judul_berita,textdata_cfpr,textdata_normalized,tokenized,textdata_nsw,bigrams,trigrams,textdata_stemmed
0,"Perang Hamas Vs Israel Pecah, Rusia Desak Semu...",perang hamas vs israel pecah rusia desak semua...,perang hamas vs israel pecah rusia desak semua...,"[perang, hamas, vs, israel, pecah, rusia, desa...","[perang, hamas, pecah, rusia, desak, menahan]","[perang, hamas, pecah, rusia, desak, menahan]","[perang, hamas, pecah, rusia, desak, menahan]","[perang, hamas, pecah, rusia, desak, tahan]"
1,6 Fakta Dampak Hamas Vs Israel: 40 Orang Tewas...,fakta dampak hamas vs israel orang tewas rs ...,fakta dampak hamas vs israel orang tewas rumah...,"[fakta, dampak, hamas, vs, israel, orang, tewa...","[fakta, dampak, hamas, orang, tewas, rumah, sa...","[fakta, dampak, hamas, orang, tewas, rumah_sak...","[fakta, dampak, hamas, orang_tewas, rumah_saki...","[fakta, dampak, hamas, orang tewas, rumah saki..."
2,198 Orang di Gaza Tewas Akibat Serangan Balik ...,orang di gaza tewas akibat serangan balik isr...,orang di gaza tewas akibat serangan balik isra...,"[orang, di, gaza, tewas, akibat, serangan, bal...","[orang, tewas, akibat, serangan, ribuan, terluka]","[orang, tewas_akibat, serangan, ribuan, terluka]","[orang_tewas_akibat, serangan, ribuan, terluka]","[orang tewas akibat, serang, ribu, luka]"


### 7. TF-IDF

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from gensim import corpora
from gensim.models import LdaModel

In [14]:
# Convert list of textdata_stemmed into string format
documents = df_titles['textdata_stemmed'].apply(lambda x: ' '.join(x)).tolist()

# Step 1: Apply TF-IDF
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.6,
    min_df=20,
    stop_words=None,
    ngram_range=(1, 3)
)

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

In [15]:
dictionary = corpora.Dictionary(df_titles['textdata_stemmed'])  # Tokenized texts
dictionary.filter_extremes(no_below=20,
                           no_above=0.6
                           )

corpus = [dictionary.doc2bow(text) for text in df_titles['textdata_stemmed']]

### 8. Bag of Words (BoW)
Convert back into BoW format to be made into word embeddings for LDA.

In [16]:
from gensim.corpora import Dictionary

In [17]:
dictionary.save("lda_dictionary.dict")
import pickle
with open("lda_corpus.pkl", "wb") as f:
    pickle.dump(corpus, f)

print(f"Dictionary size: {len(dictionary)} unique tokens")

Dictionary size: 903 unique tokens


## 3. LDA Model

In [18]:
from gensim.models import LdaModel

# Define number of topic
num_topics = 6

# Train LDA model
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=num_topics,
    random_state=42,
    passes=10,
    chunksize=100,
    alpha="asymmetric", # Sebaran topik untuk suatu dokumen tidak rata
    eta=0.01, # Jumlah kata untuk mepresentasikan topik sedikit
)

# Print topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

# Save model for later use
lda_model.save("lda_model")

Topic 0: 0.123*"warga" + 0.067*"arab" + 0.066*"pbb" + 0.051*"biden" + 0.047*"negara" + 0.030*"rumah sakit" + 0.027*"senjata" + 0.026*"korban" + 0.024*"militer" + 0.023*"tentara"
Topic 1: 0.323*"perang" + 0.298*"amerika serikat" + 0.080*"iran" + 0.062*"bantuan" + 0.038*"lawan" + 0.035*"bela" + 0.032*"orang" + 0.031*"gempur" + 0.024*"setop" + 0.010*"luka"
Topic 2: 0.176*"indonesia" + 0.091*"dunia" + 0.088*"dukung" + 0.084*"rafah" + 0.076*"tepi barat" + 0.044*"kirim" + 0.040*"rudal" + 0.037*"kecam" + 0.034*"seru" + 0.032*"inggris"
Topic 3: 0.332*"tewas" + 0.120*"orang tewas" + 0.115*"panas" + 0.079*"jalur" + 0.058*"ribu" + 0.055*"konflik" + 0.049*"aksi" + 0.031*"penuh" + 0.029*"imbas" + 0.029*"hilang"
Topic 4: 0.544*"hamas" + 0.435*"serang" + 0.014*"buntut" + 0.005*"terbang" + 0.000*"ungkap" + 0.000*"peduli" + 0.000*"houthi yaman" + 0.000*"profil" + 0.000*"salur bantu" + 0.000*"besok"
Topic 5: 0.190*"negara arab" + 0.186*"rumah sakit indonesia" + 0.177*"fakta" + 0.152*"temu" + 0.122*"kena

## 4. Model evaluation

In [19]:
from gensim.models import CoherenceModel

texts = [[word for word in text if word in dictionary.token2id] for text in df_titles['textdata_stemmed']]  # Filter out removed words

coherence_model_lda = CoherenceModel(
    model=lda_model,
    texts=texts,
    dictionary=dictionary,
    coherence="c_v",
)

coherence_score = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_score}")

Coherence Score: 0.5088729864386473


## 5. LDA visualization

In [20]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [21]:
pyLDAvis.enable_notebook()

lda_display = gensimvis.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)