# 13. Topic Modeling (LDA) from Scratch\n
\n
Topic modeling is an unsupervised machine learning technique to discover abstract "topics" that occur in a collection of documents.\n
\n
We will implement a simplified version of **Latent Dirichlet Allocation (LDA)** using **Gibbs Sampling** concepts in vanilla Python.\n
\n
## Goal:\n
Given our synthetic 50+ documents, can we automatically rediscover the 5 topics (Politics, Sports, Technology, Travel, Culture)?

In [1]:
import os
import glob
import random
import math
from collections import Counter, defaultdict

DATA_DIR = "../data"

def load_documents(data_dir):
    documents = []
    filenames = sorted(glob.glob(os.path.join(data_dir, "doc*.txt")))
    for filepath in filenames:
        with open(filepath, 'r', encoding='utf-8') as f:
            documents.append(f.read())
    return documents, filenames

raw_documents, filenames = load_documents(DATA_DIR)
print(f"Loaded {len(raw_documents)} documents.")

Loaded 60 documents.


## 1. Preprocessing\n
We need to tokenize and remove stopwords to leave only content words.

In [2]:
STOPWORDS = {"र", "को", "मा", "हामी", "यो", "लागि", "ले", "का", "हरु", "तथा", "भने", "भयो", "छ", "हो", "पनि", "गर्न", "गरे"}

def preprocess(text):
    tokens = text.split()
    clean = []
    for t in tokens:
        # Remove punctuation
        t = t.strip('।,.!?;:"\'-()[]{}/')
        if t and t not in STOPWORDS:
            clean.append(t)
    return clean

docs = [preprocess(d) for d in raw_documents]

# Build Vocabulary
vocab = set(word for doc in docs for word in doc)
w2id = {w: i for i, w in enumerate(vocab)}
id2w = {i: w for w, i in w2id.items()}

print(f"Vocabulary size: {len(vocab)}")

Vocabulary size: 516


## 2. Simple LDA Implementation (Gibbs Sampling)\n
\n
**Core Idea**: \n
1. Randomly assign topics to each word in each document.\n
2. Iteratively update topic assignment for each word based on:\n
   - How prevalent is that topic in this document? $P(z|d)$\n
   - How prevalent is this word in that topic? $P(w|z)$\n
\n
$$ P(z_i = k | \dots) \propto (n_{d,k} + \alpha) \times \frac{n_{k,w} + \beta}{n_k + V\beta} $$

In [3]:
class SimpleLDA:
    def __init__(self, K, alpha=0.1, beta=0.1, iterations=20):
        self.K = K  # Number of topics
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    def fit(self, docs, vocab_size):
        self.docs = docs
        self.V = vocab_size
        
        # Initialize counts
        self.nd_k = defaultdict(lambda: [0]*self.K) # Doc-Topic counts
        self.nk_w = defaultdict(lambda: [0]*self.V) # Topic-Word counts
        self.nk = [0]*self.K                        # Total words in topic k
        
        # Random assignment
        self.z = [] # Topic assignments for every word in every doc
        for d, doc in enumerate(docs):
            z_doc = []
            for w_id in doc:
                topic = random.randint(0, self.K-1)
                z_doc.append(topic)
                
                self.nd_k[d][topic] += 1
                self.nk_w[topic][w_id] += 1
                self.nk[topic] += 1
            self.z.append(z_doc)
            
        # Gibbs Sampling
        print(f"Starting Gibbs Sampling ({self.iterations} iterations)...")
        for it in range(self.iterations):
            for d, doc in enumerate(docs):
                for i, w_id in enumerate(doc):
                    # 1. Remove current assignment
                    curr_topic = self.z[d][i]
                    self.nd_k[d][curr_topic] -= 1
                    self.nk_w[curr_topic][w_id] -= 1
                    self.nk[curr_topic] -= 1
                    
                    # 2. Calculate probabilities for new assignment
                    p_z = []
                    for k in range(self.K):
                        # P(topic | doc)
                        p_doc = self.nd_k[d][k] + self.alpha
                        # P(word | topic)
                        p_word = (self.nk_w[k][w_id] + self.beta) / (self.nk[k] + self.V * self.beta)
                        
                        p_z.append(p_doc * p_word)
                    
                    # Normalize
                    total_p = sum(p_z)
                    probs = [p/total_p for p in p_z]
                    
                    # 3. Sample new topic (roulette wheel)
                    r = random.random()
                    acc = 0
                    new_topic = 0
                    for k, p in enumerate(probs):
                        acc += p
                        if r < acc:
                            new_topic = k
                            break
                            
                    # 4. updates
                    self.z[d][i] = new_topic
                    self.nd_k[d][new_topic] += 1
                    self.nk_w[new_topic][w_id] += 1
                    self.nk[new_topic] += 1
            
            if (it+1) % 5 == 0:
                print(f"  Iteration {it+1}/{self.iterations} complete")

    def get_topics(self, top_n=5):
        topics = []
        for k in range(self.K):
            # Find top words for topic k
            word_counts = []
            for w_id in range(self.V):
                count = self.nk_w[k][w_id]
                word_counts.append((w_id, count))
            
            word_counts.sort(key=lambda x: x[1], reverse=True)
            top_word_ids = word_counts[:top_n]
            
            topics.append([id2w[wid] for wid, _ in top_word_ids])
        return topics

## 3. Train and Visualize\n
\n
We know we have 5 genres, so we set K=5.

In [4]:
# Convert docs to IDs
doc_ids = []
for doc in docs:
    doc_ids.append([w2id[w] for w in doc])

lda = SimpleLDA(K=5, iterations=30)
lda.fit(doc_ids, len(vocab))

print("\nDiscovered Topics:")
print("="*30)
topics = lda.get_topics(top_n=7)
for i, t in enumerate(topics):
    print(f"Topic {i+1}: {', '.join(t)}")

Starting Gibbs Sampling (30 iterations)...
  Iteration 5/30 complete
  Iteration 10/30 complete
  Iteration 15/30 complete
  Iteration 20/30 complete
  Iteration 25/30 complete
  Iteration 30/30 complete

Discovered Topics:
Topic 1: संविधान, चुनाव, राजनीति, अधिकार, संसदमा, मन्त्रीले, मन्त्री
Topic 2: क्रिकेट, फुटबल, प्रतियोगिता, अभ्यास, मैदानमा, खेलाडीले, खेलाडी
Topic 3: नेपाल, सगरमाथा, आउँछन्, देश, हिमालको, पर्यटक, चढ्न
Topic 4: संस्कृति, धेरै, मोबाइल, दशैं, सफ्टवेयर, चाडपर्व, लेख्छन्
Topic 5: नेपालमा, नेपालको, छन्, हुन्, नेपाली, जस्ता, विश्वविद्यालय


## 4. Evaluation\n
Check if documents match their dominant topic.

In [5]:
print("\nDocument Topic Distribution (Sample):")
for i in range(10, 15): # Check some synthetic docs
    doc_name = filenames[i]
    counts = lda.nd_k[i]
    total = sum(counts)
    if total > 0:
        props = [c/total for c in counts]
        dominant = props.index(max(props))
        print(f"{os.path.basename(doc_name)} -> Dominant Topic {dominant+1} ({max(props):.2f})")


Document Topic Distribution (Sample):
doc02.txt -> Dominant Topic 3 (0.71)
doc020_politics.txt -> Dominant Topic 1 (1.00)
doc021_sports.txt -> Dominant Topic 2 (1.00)
doc022_sports.txt -> Dominant Topic 2 (1.00)
doc023_sports.txt -> Dominant Topic 2 (1.00)
