# üß† Topic Modeling with LDA ‚Äî Using `df_preprocessing.csv` Dataset

Notebook ini membangun **Latent Dirichlet Allocation (LDA)** untuk dataset `df_preprocessing.csv`, yang memiliki kolom:
- `tokenize_indo` ‚Üí teks berita hasil preprocessing/tokenisasi
- `Kategori Berita` ‚Üí label kategori berita (opsional, untuk analisis perbandingan)

In [3]:
import pandas as pd
import numpy as np
import re
# import nltk
# from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# nltk.download('stopwords')

## 1Ô∏è‚É£ Load Dataset

In [4]:
# Load dataset
df = pd.read_csv('datasets/df_preprocessing.csv')

# Ambil kolom yang sudah dipreprocessing
documents = df['tokenize_indo'].astype(str).tolist()
labels = df['Kategori Berita'].tolist()

print('Total documents:', len(documents))
df.head()

Total documents: 1200


Unnamed: 0,Isi Berita,lwr_indo,clean_sw_indo,clean_stb_indo,clean_typo_indo,stemming_indo,tokenize_indo,Kategori Berita
0,KOMPAS.com - Menteri Pemuda dan Olahraga Erick...,kompas.com - menteri pemuda dan olahraga erick...,kompas.com - menteri pemuda olahraga erick tho...,kompas com menteri pemuda olahraga erick tho...,kompas com menteri pemuda olahraga erick thohi...,kompas com menteri pemuda olahraga erick thohi...,"['kompas', 'com', 'menteri', 'pemuda', 'olahra...",BOLA
1,"KOMPAS.com - Manajer Chelsea, Enzo Maresca men...","kompas.com - manajer chelsea, enzo maresca men...","kompas.com - manajer chelsea, enzo maresca men...",kompas com manajer chelsea enzo maresca meng...,kompas com manajer chelsea enzo maresca mengan...,kompas com manajer chelsea enzo maresca anggap...,"['kompas', 'com', 'manajer', 'chelsea', 'enzo'...",BOLA
2,"KOMPAS.com - Pelatih Liverpool, Arne Slot, men...","kompas.com - pelatih liverpool, arne slot, men...","kompas.com - pelatih liverpool, arne slot, men...",kompas com pelatih liverpool arne slot menga...,kompas com pelatih liverpool arne slot mengaku...,kompas com latih liverpool arne slot aku cryst...,"['kompas', 'com', 'latih', 'liverpool', 'arne'...",BOLA
3,KOMPAS.com - Hasil terbaru pekan kelima Liga I...,kompas.com - hasil terbaru pekan kelima liga i...,kompas.com - hasil terbaru pekan liga italia 2...,kompas com hasil terbaru pekan liga italia ...,kompas com hasil terbaru pekan liga italia per...,kompas com hasil baru pekan liga italia beda n...,"['kompas', 'com', 'hasil', 'baru', 'pekan', 'l...",BOLA
4,KOMPAS.com - Menteri Pemuda dan Olahraga Repub...,kompas.com - menteri pemuda dan olahraga repub...,kompas.com - menteri pemuda olahraga republik ...,kompas com menteri pemuda olahraga republik ...,kompas com menteri pemuda olahraga republik in...,kompas com menteri pemuda olahraga republik in...,"['kompas', 'com', 'menteri', 'pemuda', 'olahra...",BOLA


## 2Ô∏è‚É£ Text Normalization & Preprocessing

In [5]:
# stop_words = set(stopwords.words('indonesian'))

# def normalize_text(text):
#     text = text.lower()
#     text = re.sub(r'[^a-zA-Z\s]', '', text)
#     tokens = text.split()
#     tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
#     return ' '.join(tokens)

# norm_corpus = [normalize_text(doc) for doc in documents]
# print('Contoh hasil normalisasi:')
# print(norm_corpus[0][:200])

## 3Ô∏è‚É£ Text Vectorization

In [6]:
cv = CountVectorizer(max_df=0.9, min_df=5, stop_words='english')
cv_matrix = cv.fit_transform(documents)
print('Shape of document-term matrix:', cv_matrix.shape)
vocab = cv.get_feature_names_out()

Shape of document-term matrix: (1200, 4168)


## 4Ô∏è‚É£ Train LDA Model

In [7]:
lda = LatentDirichletAllocation(n_components=5, max_iter=1000, random_state=0)
doc_topic_matrix = lda.fit_transform(cv_matrix)
print('LDA model trained!')

LDA model trained!


## 5Ô∏è‚É£ Inspect Topics

In [8]:
def get_topics_meanings(components, feature_names, n_top_words=10):
    topics = {}
    for idx, topic in enumerate(components):
        top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        topics[f'Topic {idx+1}'] = top_features
    return topics

topics = get_topics_meanings(lda.components_, vocab, n_top_words=10)
for k, v in topics.items():
    print(f'{k}:', ', '.join(v))

Topic 1: mobil, kendara, motor, listrik, milik, bbm, indonesia, jakarta, honda, jalan
Topic 2: mbg, makan, racun, siswa, sekolah, anak, program, gizi, dapur, barat
Topic 3: korban, jalan, warga, rumah, orang, sebut, itu, jakarta, laku, polisi
Topic 4: rp, indonesia, jakarta, kerja, usaha, persen, menteri, perintah, harga, negara
Topic 5: main, laga, indonesia, gol, vs, hasil, tim, menang, liga, menit


## 6Ô∏è‚É£ pyLDAvis Visualization

In [9]:
import pyLDAvis
import pyLDAvis.lda_model

pyLDAvis.enable_notebook()
vis = pyLDAvis.lda_model.prepare(lda, cv_matrix, cv, mds='mmds')
vis

## 7Ô∏è‚É£ Cluster Documents Based on Topic Distributions

In [10]:
km = KMeans(n_clusters=5, random_state=0)
km.fit(doc_topic_matrix)
df['ClusterLabel'] = km.labels_
df[['tokenize_indo', 'ClusterLabel']].head()

Unnamed: 0,tokenize_indo,ClusterLabel
0,"['kompas', 'com', 'menteri', 'pemuda', 'olahra...",2
1,"['kompas', 'com', 'manajer', 'chelsea', 'enzo'...",3
2,"['kompas', 'com', 'latih', 'liverpool', 'arne'...",3
3,"['kompas', 'com', 'hasil', 'baru', 'pekan', 'l...",3
4,"['kompas', 'com', 'menteri', 'pemuda', 'olahra...",2


## 8Ô∏è‚É£ Hyperparameter Tuning (Grid Search)

In [None]:
search_params = {'n_components': range(3,8), 'learning_decay': [.5, .7]}
model = LatentDirichletAllocation(learning_method='batch', max_iter=1000, random_state=0)
gridsearch = GridSearchCV(model, param_grid=search_params, n_jobs=-1, verbose=1)
gridsearch.fit(cv_matrix)
best_lda = gridsearch.best_estimator_

print('Best Model Params:', gridsearch.best_params_)
print('Best Log Likelihood:', gridsearch.best_score_)
print('Best Perplexity:', best_lda.perplexity(cv_matrix))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


## 9Ô∏è‚É£ Predict Topics for New Text

In [None]:
new_docs = ['pemerintah mengumumkan kebijakan ekonomi baru',
            'tim sepak bola memenangkan pertandingan final']
new_docs_norm = [normalize_text(t) for t in new_docs]
new_docs_vec = cv.transform(new_docs_norm)

new_topic_matrix = best_lda.transform(new_docs_vec)
topic_labels = [f'Topic {i+1}' for i in range(best_lda.n_components)]
new_df = pd.DataFrame(new_topic_matrix, columns=topic_labels)
new_df['predicted_topic'] = new_df.idxmax(axis=1)
new_df['document'] = new_docs
new_df

## üîü Optional: Coherence Evaluation (tmtoolkit)

In [None]:
from tmtoolkit.topicmod.evaluate import metric_coherence_gensim

def topic_model_coherence_generator(topic_num_start=2, topic_num_end=8, norm_corpus='', cv_matrix='', cv=''):
    norm_corpus_tokens = [doc.split() for doc in norm_corpus]
    models = []
    coherence_scores = []

    for i in range(topic_num_start, topic_num_end):
        cur_lda = LatentDirichletAllocation(n_components=i, max_iter=1000, random_state=0)
        cur_lda.fit(cv_matrix)
        cur_coherence = metric_coherence_gensim(
            measure='c_v', top_n=10, topic_word_distrib=cur_lda.components_,
            dtm=cv.fit_transform(norm_corpus), vocab=np.array(cv.get_feature_names_out()), texts=norm_corpus_tokens)
        models.append(cur_lda)
        coherence_scores.append(np.mean(cur_coherence))
    return models, coherence_scores

models, scores = topic_model_coherence_generator(2,8,norm_corpus,cv_matrix,cv)
pd.DataFrame({'n_topics': range(2,8), 'coherence': scores})