# 04.01. Positional Index

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: Positional Indexing](#theory)
3. [Building Positional Index](#building)
4. [Phrase Queries](#phrases)
5. [Proximity Queries](#proximity)
6. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

**Positional Index** extends the inverted index by storing not just which documents contain a term, but also **where** in each document the term appears.

### Why Positional Indexing?
- **Phrase Queries**: Find "‡§®‡•á‡§™‡§æ‡§≤ ‡§∏‡§∞‡§ï‡§æ‡§∞" (exactly this phrase)
- **Proximity Queries**: Find documents where two terms appear close together
- **More Precise**: Better than simple Boolean AND

### Example:
```
Document: "‡§®‡•á‡§™‡§æ‡§≤ ‡§∏‡•Å‡§®‡•ç‡§¶‡§∞ ‡§¶‡•á‡§∂ ‡§π‡•ã‡•§ ‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤‡§ï‡•ã ‡§¶‡•á‡§∂ ‡§π‡•ã‡•§"

Simple Index:
‡§®‡•á‡§™‡§æ‡§≤ ‚Üí {doc1}

Positional Index:
‡§®‡•á‡§™‡§æ‡§≤ ‚Üí {doc1: [0, 4]}  (appears at positions 0 and 4)
```

---

## 2. Theory: Positional Indexing <a name="theory"></a>

### Structure:
```
Term ‚Üí {DocID: [pos1, pos2, pos3, ...]}
```

### Storage Requirements:
- **Simple Index**: `O(T √ó D)` where T = terms, D = docs
- **Positional Index**: `O(T √ó D √ó P)` where P = avg positions per term
- Typically **2-4x larger** than simple inverted index

### Trade-off:
- ‚úì More expressive queries
- ‚úì Better precision
- ‚úó Larger storage
- ‚úó Slower to build

---

## 3. Building Positional Index <a name="building"></a>

In [1]:
from pathlib import Path
from collections import defaultdict

# Load data
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('‡•§,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"‚úì Loaded {len(preprocessed_docs)} documents")

‚úì Loaded 10 documents


In [2]:
def build_positional_index(preprocessed_docs):
    """
    Build positional index.
    
    Structure: {term: {doc_id: [pos1, pos2, ...]}}
    """
    positional_index = defaultdict(lambda: defaultdict(list))
    
    for doc_id, terms in preprocessed_docs.items():
        for position, term in enumerate(terms):
            positional_index[term][doc_id].append(position)
    
    return dict(positional_index)

# Build the index
pos_index = build_positional_index(preprocessed_docs)

print(f"‚úì Built positional index")
print(f"  Unique terms: {len(pos_index)}")

# Show example
sample_term = list(pos_index.keys())[0]
print(f"\nüìå Example term: '{sample_term}'")
for doc_id, positions in list(pos_index[sample_term].items())[:2]:
    print(f"   {doc_id}: positions {positions}")

‚úì Built positional index
  Unique terms: 398

üìå Example term: '‡§®‡•á‡§™‡§æ‡§≤'
   doc01: positions [0, 3, 19, 25, 36, 47]
   doc02: positions [2, 21, 23, 30]


---

## 4. Phrase Queries <a name="phrases"></a>

A **phrase query** finds documents where terms appear consecutively.

### Algorithm:
1. Get posting lists for all terms in phrase
2. Find documents containing ALL terms
3. Check if positions are consecutive

In [3]:
def phrase_query(phrase_terms, pos_index):
    """
    Find documents containing the exact phrase.
    
    Parameters:
    -----------
    phrase_terms : list
        Terms in the phrase (already preprocessed)
    pos_index : dict
        Positional index
    
    Returns:
    --------
    set : Document IDs containing the phrase
    """
    if not phrase_terms:
        return set()
    
    # Get documents containing first term
    if phrase_terms[0] not in pos_index:
        return set()
    
    candidate_docs = set(pos_index[phrase_terms[0]].keys())
    
    # Filter to docs containing all terms
    for term in phrase_terms[1:]:
        if term not in pos_index:
            return set()
        candidate_docs &= set(pos_index[term].keys())
    
    # Check for consecutive positions
    result = set()
    
    for doc_id in candidate_docs:
        # Get positions of first term
        first_positions = pos_index[phrase_terms[0]][doc_id]
        
        for start_pos in first_positions:
            # Check if subsequent terms appear at consecutive positions
            found_phrase = True
            
            for i, term in enumerate(phrase_terms[1:], 1):
                expected_pos = start_pos + i
                if expected_pos not in pos_index[term][doc_id]:
                    found_phrase = False
                    break
            
            if found_phrase:
                result.add(doc_id)
                break  # Found phrase in this doc
    
    return result

# Example phrase query
phrase = "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤"  # Replace with actual Nepali phrase
phrase_tokens = preprocess_text(phrase, stopwords, stem_dict)

print(f"üîç Phrase Query: '{phrase}'")
print(f"   Preprocessed: {phrase_tokens}")

results = phrase_query(phrase_tokens, pos_index)
print(f"\n‚úì Documents containing phrase: {results}")

üîç Phrase Query: '‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤'
   Preprocessed: ['‡§®‡•á‡§™‡§æ‡§≤', '‡§π‡§ø‡§Æ‡§æ‡§≤']

‚úì Documents containing phrase: {'doc02'}


---

## 5. Proximity Queries <a name="proximity"></a>

**Proximity queries** find terms within a certain distance of each other.

Example: Find "‡§®‡•á‡§™‡§æ‡§≤" within 3 words of "‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø"

In [4]:
def proximity_query(term1, term2, max_distance, pos_index):
    """
    Find documents where term1 and term2 appear within max_distance.
    
    Parameters:
    -----------
    term1, term2 : str
        Terms to search for
    max_distance : int
        Maximum distance between terms
    pos_index : dict
        Positional index
    
    Returns:
    --------
    dict : {doc_id: [(pos1, pos2), ...]}
    """
    if term1 not in pos_index or term2 not in pos_index:
        return {}
    
    # Find common documents
    docs1 = set(pos_index[term1].keys())
    docs2 = set(pos_index[term2].keys())
    common_docs = docs1 & docs2
    
    result = {}
    
    for doc_id in common_docs:
        positions1 = pos_index[term1][doc_id]
        positions2 = pos_index[term2][doc_id]
        
        matches = []
        for p1 in positions1:
            for p2 in positions2:
                if abs(p1 - p2) <= max_distance:
                    matches.append((p1, p2))
        
        if matches:
            result[doc_id] = matches
    
    return result

# Example proximity query
print("üîç Proximity Query Example:")
print("   (Would need actual Nepali terms in index)")
print("\nüí° Proximity queries allow flexible matching!")

üîç Proximity Query Example:
   (Would need actual Nepali terms in index)

üí° Proximity queries allow flexible matching!


---

## 6. Summary <a name="summary"></a>

### What We Learned:

1. **Positional Index**
   - Stores term positions within documents
   - Structure: `{term: {doc: [positions]}}`
   - 2-4x larger than simple index

2. **Phrase Queries**
   - Find exact phrase matches
   - Check consecutive positions
   - More precise than Boolean AND

3. **Proximity Queries**
   - Find terms within distance threshold
   - Flexible matching
   - Useful for related concepts

### Comparison:

| Query Type | Example | Matches |
|------------|---------|----------|
| Boolean AND | ‡§®‡•á‡§™‡§æ‡§≤ AND ‡§π‡§ø‡§Æ‡§æ‡§≤ | Both terms anywhere |
| Phrase | "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤" | Exact consecutive |
| Proximity | ‡§®‡•á‡§™‡§æ‡§≤ /3 ‡§π‡§ø‡§Æ‡§æ‡§≤ | Within 3 words |

### Limitations:
- Larger storage requirements
- Slower query processing
- Not suitable for very large corpora without optimization

### Extensions:
- **Bi-word indexes**: Store pairs of consecutive words
- **Skip pointers**: Speed up proximity checks
- **Compressed positions**: Reduce storage using deltas

### References:
- Manning et al., "Introduction to Information Retrieval", Chapter 2.4
- Zobel & Moffat (2006): "Inverted files for text search engines"