# Part 6.2: Probabilistic Ranking with BM25\n
\n
This notebook implements the Okapi BM25 ranking algorithm, a state-of-the-art probabilistic model in Information Retrieval. We will implement it from scratch using vanilla Python to understand its components: Term Frequency saturation and Document Length Normalization.

In [1]:
import math
import os
import glob
from collections import defaultdict, Counter

## 1. Load and Preprocess Data\n
We will reuse our dummy Nepali dataset and basic preprocessing.

In [2]:
def load_documents(data_dir="../data"):
    documents = {}
    for filepath in glob.glob(os.path.join(data_dir, "*.txt")):
        doc_id = os.path.basename(filepath)
        with open(filepath, 'r', encoding='utf-8') as f:
            documents[doc_id] = f.read()
    return documents

def preprocess(text):
    # Simple whitespace tokenization and normalization for Nepali
    return text.lower().split()

documents = load_documents()
print(f"Loaded {len(documents)} documents.")

Loaded 10 documents.


## 2. BM25 Theory and Components\n
\n
The BM25 score for a document $D$ given a query $Q$ is calculated as:\n
\n
$$ \text{score}(D, Q) = \sum_{q \in Q} IDF(q) \cdot \frac{f(q, D) \cdot (k_1 + 1)}{f(q, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})} $$\n
\n
Where:\n
- $f(q, D)$ is the term frequency of term $q$ in document $D$\n
- $|D|$ is the length of document $D$ in words\n
- $avgdl$ is the average document length in the collection\n
- $k_1$ controls term frequency saturation (typically 1.2 to 2.0)\n
- $b$ controls length normalization (0 to 1, typically 0.75)\n
- $IDF(q)$ is the Inverse Document Frequency weight

## 3. Implementation of BM25 Indexer

In [3]:
class BM25Indexer:
    def __init__(self, k1=1.5, b=0.75):
        self.k1 = k1
        self.b = b
        self.doc_lengths = {}
        self.avgdl = 0
        self.doc_freqs = defaultdict(int)
        self.term_freqs = defaultdict(lambda: defaultdict(int)) # term -> doc_id -> count
        self.N = 0
        self.idf = {}
    
    def fit(self, documents):
        self.N = len(documents)
        total_length = 0
        
        for doc_id, text in documents.items():
            tokens = preprocess(text)
            length = len(tokens)
            self.doc_lengths[doc_id] = length
            total_length += length
            
            # Calculate Term Frequencies per document
            counts = Counter(tokens)
            for term, count in counts.items():
                self.term_freqs[term][doc_id] = count
                self.doc_freqs[term] += 1
                
        self.avgdl = total_length / self.N
        self._calc_idf()
        
    def _calc_idf(self):
        # Standard IDF formula: log((N - n + 0.5) / (n + 0.5) + 1)
        for term, freq in self.doc_freqs.items():
            self.idf[term] = math.log((self.N - freq + 0.5) / (freq + 0.5) + 1)

    def score(self, query, doc_id):
        score = 0.0
        query_terms = preprocess(query)
        doc_len = self.doc_lengths.get(doc_id, 0)
        
        for term in query_terms:
            if term not in self.term_freqs:
                continue
            
            idf = self.idf.get(term, 0)
            tf = self.term_freqs[term].get(doc_id, 0)
            
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / self.avgdl))
            
            score += idf * (numerator / denominator)
            
        return score
    
    def search(self, query):
        scores = []
        for doc_id in self.doc_lengths.keys():
            s = self.score(query, doc_id)
            if s > 0:
                scores.append((doc_id, s))
        
        return sorted(scores, key=lambda x: x[1], reverse=True)

## 4. Testing the BM25 Model

In [4]:
bm25 = BM25Indexer(k1=1.5, b=0.75)
bm25.fit(documents)

# Sample Query
query = "नेपालको संविधान" # Constitution of Nepal
results = bm25.search(query)

print(f"Results for query: '{query}'")
for doc_id, score in results:
    print(f"{doc_id}: {score:.4f}")

Results for query: 'नेपालको संविधान'
doc04.txt: 0.4348
doc01.txt: 0.4201
doc08.txt: 0.4188
doc03.txt: 0.3801
doc07.txt: 0.2711
doc02.txt: 0.2574
doc10.txt: 0.2503
doc09.txt: 0.2490


## 5. Parameter Tuning Analysis\n
Let's see how changing k1 and b affects rankings.

In [5]:
# Compare High Saturation (k1=2.0) vs Low Saturation (k1=0.5)
bm25_high_k = BM25Indexer(k1=2.0, b=0.75)
bm25_high_k.fit(documents)

bm25_low_k = BM25Indexer(k1=0.5, b=0.75)
bm25_low_k.fit(documents)

print("\n--- High k1 (2.0) ---")
print(bm25_high_k.search(query)[:3])

print("\n--- Low k1 (0.5) ---")
print(bm25_low_k.search(query)[:3])


--- High k1 (2.0) ---
[('doc04.txt', 0.47064720728437776), ('doc01.txt', 0.4516810846315697), ('doc08.txt', 0.4500324129514447)]

--- Low k1 (0.5) ---
[('doc04.txt', 0.33315166872757707), ('doc01.txt', 0.32827303324143997), ('doc08.txt', 0.32783659668051623)]
