# 07. Enhanced Language Modeling for IR\n
\n
## Introduction\n
\n
**Language Modeling** is a probabilistic approach to Information Retrieval where each document is viewed as a language model, and we estimate the probability of that model generating the query.\n
\n
### Formula:\n
$$P(q|d) = \prod_{t \in q} P(t|d)$$\n
\n
To handle zero-probability issues (when a query term doesn't appear in a document), we use **smoothing** techniques.\n
\n
## Techniques Implemented:\n
1. **Maximum Likelihood Estimation (MLE)**\n
2. **Jelinek-Mercer Smoothing** (Linear Interpolation)\n
3. **Dirichlet Prior Smoothing** (Bayesian Smoothing)

In [5]:
from pathlib import Path
import math
from collections import Counter, defaultdict

# Load data
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def tokenize(text):
    # Simple whitespace tokenization
    return text.split()

documents = load_documents(DATA_DIR)
print(f"Loaded {len(documents)} documents")

Loaded 10 documents


## 1. Building the Collection Model\n
The collection model $P(t|C)$ acts as a background probability distribution. It helps estimate probabilities for unseen words in a document based on their prevalence in the entire corpus.

In [6]:
class LanguageModel:
    def __init__(self, documents):
        self.doc_models = {}  # Stores counts per document
        self.collection_counts = Counter()
        self.collection_size = 0
        self.doc_lengths = {}
        
        self._build(documents)
        
    def _build(self, documents):
        for doc_id, text in documents.items():
            tokens = tokenize(text)
            counts = Counter(tokens)
            
            self.doc_models[doc_id] = counts
            self.doc_lengths[doc_id] = len(tokens)
            
            self.collection_counts.update(tokens)
            self.collection_size += len(tokens)
            
    def get_collection_prob(self, term):
        # P(t|C)
        count = self.collection_counts.get(term, 0)
        if count == 0:
             # Minimal smoothing for OOV terms relative to collection
            return 1 / (self.collection_size + 1) 
        return count / self.collection_size

lm = LanguageModel(documents)
print(f"Collection Size: {lm.collection_size} tokens")

Collection Size: 797 tokens


## 2. Smoothing Implementations\n
\n
### Jelinek-Mercer Smoothing\n
Linearly interpolates document model with collection model.\n
\n
$$P_{JM}(t|d) = \lambda P_{MLE}(t|d) + (1-\lambda) P(t|C)$$\n
\n
Typical $\lambda = 0.7$\n
\n
### Dirichlet Prior Smoothing\n
Adds "pseudo-counts" based on collection probability, proportional to document length.\n
\n
$$P_{DIR}(t|d) = \frac{c(t,d) + \mu P(t|C)}{|d| + \mu}$$\n
\n
Typical $\mu = 2000$ (average doc length)

In [7]:
def score_jm(lm, query_terms, doc_id, lam=0.7):
    score = 0.0
    doc_len = lm.doc_lengths[doc_id]
    if doc_len == 0: return -float('inf')
    
    for term in query_terms:
        tf = lm.doc_models[doc_id].get(term, 0)
        p_mle = tf / doc_len
        p_coll = lm.get_collection_prob(term)
        
        prob = lam * p_mle + (1 - lam) * p_coll
        score += math.log(prob)
        
    return score


def score_dirichlet(lm, query_terms, doc_id, mu=2000):
    score = 0.0
    doc_len = lm.doc_lengths[doc_id]
    if doc_len == 0: return -float('inf')
    
    for term in query_terms:
        tf = lm.doc_models[doc_id].get(term, 0)
        p_coll = lm.get_collection_prob(term)
        
        numerator = tf + (mu * p_coll)
        denominator = doc_len + mu
        
        prob = numerator / denominator
        score += math.log(prob)
        
    return score

## 3. Comparison and Ranking

In [8]:
def rank_documents(query, method='jm', param=None):
    query_terms = tokenize(query)
    scores = []
    
    for doc_id in lm.doc_models.keys():
        if method == 'jm':
            lam = param if param else 0.7
            s = score_jm(lm, query_terms, doc_id, lam=lam)
        else:
            mu = param if param else 2000
            s = score_dirichlet(lm, query_terms, doc_id, mu=mu)
        scores.append((doc_id, s))
    
    return sorted(scores, key=lambda x: x[1], reverse=True)

# Test Query
query = "नेपालको इतिहास"

print("--- Jelinek-Mercer (lambda=0.7) ---")
for doc_id, score in rank_documents(query, 'jm')[:3]:
    print(f"{doc_id}: {score:.4f}")

print("\n--- Dirichlet (mu=100) ---")
# Using smaller mu because our docs are short
for doc_id, score in rank_documents(query, 'dirichlet', param=100)[:3]:
    print(f"{doc_id}: {score:.4f}")

--- Jelinek-Mercer (lambda=0.7) ---
doc01: -7.1772
doc05: -9.7240
doc04: -9.9014

--- Dirichlet (mu=100) ---
doc01: -7.6233
doc05: -9.3132
doc04: -9.4448
