# 09. Enhanced Query Expansion\n
\n
## Introduction\n
\n
Query Expansion aims to solve vocabulary mismatch by adding related terms to the user's query. In this enhanced notebook, we implement:\n
\n
1. **Pseudo-Relevance Feedback**: Extracting terms from top-ranked documents.\n
2. **Global Analysis (Co-occurrence Matrix)**: Identifying terms that frequently appear together across the entire collection to find semantic associations.

In [1]:
from pathlib import Path
from collections import Counter, defaultdict
import math

DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def tokenize(text):
    # Simple tokenization for demonstration
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('।,.!?;:"\'-()[]{}/')
        if token:
            cleaned.append(token)
    return cleaned

documents = load_documents(DATA_DIR)
print(f"Loaded {len(documents)} documents")

Loaded 10 documents


## 1. Co-occurrence Matrix Construction\n
We build a term-term correlation matrix. Two terms are related if they co-occur in the same document (or window) frequently.\n
\n
Association Measure: **Mutual Information** or simple **Co-occurrence Count**.

In [2]:
class CooccurrenceMatrix:
    def __init__(self, documents, window_size=5):
        self.matrix = defaultdict(Counter)
        self.vocab = set()
        self.window_size = window_size
        self._build(documents)
        
    def _build(self, documents):
        for doc_text in documents.values():
            tokens = tokenize(doc_text)
            self.vocab.update(tokens)
            
            # Window-based co-occurrence
            for i, target_term in enumerate(tokens):
                start = max(0, i - self.window_size)
                end = min(len(tokens), i + self.window_size + 1)
                
                context = tokens[start:i] + tokens[i+1:end]
                for ctx_term in context:
                    self.matrix[target_term][ctx_term] += 1
                    
    def get_related_terms(self, term, top_k=5):
        if term not in self.matrix:
            return []
        
        # Return most frequent co-occurring terms
        return self.matrix[term].most_common(top_k)

co_matrix = CooccurrenceMatrix(documents)
print(f"Built matrix with vocabulary size: {len(co_matrix.vocab)}")

Built matrix with vocabulary size: 452


## 2. Query Expansion using Co-occurrence\n
For each query term, we find highly correlated terms from the matrix and add them to the query.

In [3]:
def expand_query_global(query, co_matrix, num_expansion_terms=2):
    query_terms = tokenize(query)
    expanded_query = list(query_terms)
    
    print(f"Original terms: {query_terms}")
    
    for term in query_terms:
        related = co_matrix.get_related_terms(term, top_k=num_expansion_terms)
        if related:
            print(f"  Related to '{term}': {related}")
            new_terms = [t for t, count in related]
            expanded_query.extend(new_terms)
            
    # Remove duplicates while preserving order
    final_query = list(dict.fromkeys(expanded_query))
    return " ".join(final_query)

# Test
test_query = "इतिहास"
expanded = expand_query_global(test_query, co_matrix)
print(f"\nExpanded Query: {expanded}")

Original terms: ['इतिहास']
  Related to 'इतिहास': [('नेपालको', 2), ('र', 2)]

Expanded Query: इतिहास नेपालको र


## 3. Pseudo-Relevance Feedback (Local Analysis)\n
Already discussed in placeholder, but here is a refined implementation assuming we have a ranking function.

In [4]:
# Placeholder ranking results for demonstration
# In a real scenario, we would run `bm25.search(query)` first
top_ranked_docs = ['doc01', 'doc03'] 

def extract_feedback_terms(relevant_doc_ids, documents, top_k=3):
    term_counts = Counter()
    for doc_id in relevant_doc_ids:
        if doc_id in documents:
            tokens = tokenize(documents[doc_id])
            term_counts.update(tokens)
            
    return [term for term, count in term_counts.most_common(top_k)]

feedback_terms = extract_feedback_terms(top_ranked_docs, documents)
print(f"Feedback terms from {top_ranked_docs}: {feedback_terms}")

Feedback terms from ['doc01', 'doc03']: ['र', 'छ', 'नेपालको']


## Summary\n
Global analysis (Co-occurrence) is static and corpus-dependent, offering generic synonyms. Local analysis (Pseudo-Relevance Feedback) is dynamic and query-dependent, offering context-specific terms. Combining both yields the best results.