# 03. Boolean Retrieval Model

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: Boolean Retrieval](#theory)
3. [Document Term Matrix](#matrix)
4. [Boolean Query Processing](#queries)
5. [Query Examples](#examples)
6. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

The **Boolean Retrieval Model** is one of the oldest and simplest IR models. It treats documents and queries as sets of terms and uses Boolean logic (AND, OR, NOT) to match documents.

### Real-World Uses:
- Library catalog systems
- Legal document retrieval
- Patent search
- E-discovery systems

---

## 2. Theory: Boolean Retrieval <a name="theory"></a>

### Binary Independence:
- Documents either **match** or **don't match** a query
- No concept of partial relevance or ranking
- Each term is either **present (1)** or **absent (0)** in a document

### Boolean Operators:

1. **AND**: Both terms must be present
   ```
   Query: ‡§®‡•á‡§™‡§æ‡§≤ AND ‡§π‡§ø‡§Æ‡§æ‡§≤
   Match: Documents containing BOTH terms
   ```

2. **OR**: At least one term must be present
   ```
   Query: ‡§®‡•á‡§™‡§æ‡§≤ OR ‡§≠‡§æ‡§∞‡§§
   Match: Documents containing EITHER term
   ```

3. **NOT**: Term must be absent
   ```
   Query: ‡§®‡•á‡§™‡§æ‡§≤ AND NOT ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï
   Match: Documents with ‡§®‡•á‡§™‡§æ‡§≤ but without ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï
   ```

### Document-Term Matrix:

```
          ‡§®‡•á‡§™‡§æ‡§≤  ‡§π‡§ø‡§Æ‡§æ‡§≤  ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ  ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï
doc01      1      0      0       0
doc02      1      1      0       1
doc03      1      0      1       0
doc04      0      0      0       0
```

### Advantages:
- ‚úì Simple and fast
- ‚úì Precise control over queries
- ‚úì Reproducible results

### Disadvantages:
- ‚úó No ranking (all results equally relevant)
- ‚úó Requires exact Boolean queries (hard for users)
- ‚úó Feast or famine: Too many or too few results

---

## 3. Document-Term Matrix <a name="matrix"></a>

In [1]:
from pathlib import Path
from collections import defaultdict

# Import preprocessing functions from previous notebook
DATA_DIR = Path('../data')

def load_documents(data_dir):
    """Load all documents."""
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    """Load stopwords from CSV."""
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)  # Skip header
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    """Load stemming dictionary."""
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)  # Skip header
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    """Tokenize text."""
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('‡•§,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    """Complete preprocessing pipeline."""
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

# Load resources
documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

# Preprocess all documents
preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"‚úì Loaded and preprocessed {len(preprocessed_docs)} documents")

‚úì Loaded and preprocessed 10 documents


In [2]:
def build_document_term_matrix(preprocessed_docs):
    """
    Build a binary document-term matrix.
    
    Matrix[doc_id][term] = 1 if term in document, 0 otherwise
    
    Parameters:
    -----------
    preprocessed_docs : dict
        Mapping from doc_id to list of preprocessed terms
    
    Returns:
    --------
    dict : Nested dictionary representing the matrix
    set : Vocabulary (all unique terms)
    """
    # Build vocabulary
    vocabulary = set()
    for terms in preprocessed_docs.values():
        vocabulary.update(terms)
    
    # Build binary matrix
    matrix = {}
    for doc_id, terms in preprocessed_docs.items():
        # Convert terms list to set for fast lookup
        term_set = set(terms)
        matrix[doc_id] = {}
        
        # For each term in vocabulary, check if present in document
        for term in vocabulary:
            matrix[doc_id][term] = 1 if term in term_set else 0
    
    return matrix, vocabulary

# Build matrix
doc_term_matrix, vocabulary = build_document_term_matrix(preprocessed_docs)

print(f"‚úì Built document-term matrix")
print(f"  Documents: {len(doc_term_matrix)}")
print(f"  Vocabulary size: {len(vocabulary)}")
print(f"  Matrix size: {len(doc_term_matrix)} √ó {len(vocabulary)} = {len(doc_term_matrix) * len(vocabulary)} cells")

‚úì Built document-term matrix
  Documents: 10
  Vocabulary size: 398
  Matrix size: 10 √ó 398 = 3980 cells


In [3]:
# Visualize a small portion of the matrix
def show_matrix_sample(matrix, vocabulary, sample_terms, sample_docs=None):
    """
    Display a sample of the document-term matrix.
    """
    if sample_docs is None:
        sample_docs = sorted(matrix.keys())[:5]
    
    print("\nüìä Document-Term Matrix (sample):")
    print("="*80)
    
    # Header
    header = "Doc ID    "
    for term in sample_terms:
        header += f"{term[:8]:<10}"
    print(header)
    print("="*80)
    
    # Rows
    for doc_id in sample_docs:
        row = f"{doc_id:<10}"
        for term in sample_terms:
            value = matrix[doc_id].get(term, 0)
            row += f"{value:<10}"
        print(row)
    
    print("="*80)

# Show sample with interesting terms
interesting_terms = ['‡§®‡•á‡§™‡§æ‡§≤', '‡§π‡§ø‡§Æ‡§æ‡§≤', '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ', '‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï', '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø', '‡§∞‡§æ‡§ú‡§®‡•Ä‡§§‡§ø']
show_matrix_sample(doc_term_matrix, vocabulary, interesting_terms)


üìä Document-Term Matrix (sample):
Doc ID    ‡§®‡•á‡§™‡§æ‡§≤     ‡§π‡§ø‡§Æ‡§æ‡§≤     ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ    ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï    ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç  ‡§∞‡§æ‡§ú‡§®‡•Ä‡§§‡§ø   
doc01     1         1         0         0         0         0         
doc02     1         1         0         1         0         0         
doc03     1         0         1         0         0         0         
doc04     1         0         0         0         0         0         
doc05     1         0         0         0         0         0         


---

## 4. Boolean Query Processing <a name="queries"></a>

Now let's implement Boolean query processing with AND, OR, and NOT operators.

In [4]:
def get_documents_containing_term(term, matrix):
    """
    Get all documents containing a specific term.
    
    Parameters:
    -----------
    term : str
        Search term
    matrix : dict
        Document-term matrix
    
    Returns:
    --------
    set : Set of document IDs containing the term
    """
    result = set()
    for doc_id, terms in matrix.items():
        if terms.get(term, 0) == 1:
            result.add(doc_id)
    return result

def boolean_and(term1, term2, matrix):
    """
    Boolean AND: Documents containing BOTH terms.
    
    Returns:
    --------
    set : Intersection of documents containing term1 and term2
    """
    docs1 = get_documents_containing_term(term1, matrix)
    docs2 = get_documents_containing_term(term2, matrix)
    return docs1 & docs2  # Set intersection

def boolean_or(term1, term2, matrix):
    """
    Boolean OR: Documents containing EITHER term.
    
    Returns:
    --------
    set : Union of documents containing term1 or term2
    """
    docs1 = get_documents_containing_term(term1, matrix)
    docs2 = get_documents_containing_term(term2, matrix)
    return docs1 | docs2  # Set union

def boolean_not(term1, term2, matrix):
    """
    Boolean NOT: Documents containing term1 but NOT term2.
    
    Returns:
    --------
    set : Documents with term1 minus documents with term2
    """
    docs1 = get_documents_containing_term(term1, matrix)
    docs2 = get_documents_containing_term(term2, matrix)
    return docs1 - docs2  # Set difference

print("‚úì Boolean query functions defined")

‚úì Boolean query functions defined


---

## 5. Query Examples <a name="examples"></a>

In [5]:
# Example 1: Single term query
print("\nüîç Query 1: Documents containing '‡§®‡•á‡§™‡§æ‡§≤'")
print("="*70)
results = get_documents_containing_term('‡§®‡•á‡§™‡§æ‡§≤', doc_term_matrix)
print(f"Results: {sorted(results)}")
print(f"Number of documents: {len(results)}")


üîç Query 1: Documents containing '‡§®‡•á‡§™‡§æ‡§≤'
Results: ['doc01', 'doc02', 'doc03', 'doc04', 'doc05', 'doc06', 'doc07', 'doc08', 'doc09', 'doc10']
Number of documents: 10


In [6]:
# Example 2: AND query
print("\nüîç Query 2: '‡§®‡•á‡§™‡§æ‡§≤' AND '‡§π‡§ø‡§Æ‡§æ‡§≤'")
print("="*70)
results = boolean_and('‡§®‡•á‡§™‡§æ‡§≤', '‡§π‡§ø‡§Æ‡§æ‡§≤', doc_term_matrix)
print(f"Results: {sorted(results)}")
print(f"Number of documents: {len(results)}")

# Show context from matching documents
if results:
    print("\nüìÑ Sample from matching document:")
    sample_doc = sorted(results)[0]
    print(f"\n{sample_doc}:")
    print(documents[sample_doc][:200] + "...")


üîç Query 2: '‡§®‡•á‡§™‡§æ‡§≤' AND '‡§π‡§ø‡§Æ‡§æ‡§≤'
Results: ['doc01', 'doc02', 'doc09']
Number of documents: 3

üìÑ Sample from matching document:

doc01:
‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø

‡§®‡•á‡§™‡§æ‡§≤ ‡§¶‡§ï‡•ç‡§∑‡§ø‡§£ ‡§è‡§∂‡§ø‡§Ø‡§æ‡§Æ‡§æ ‡§Ö‡§µ‡§∏‡•ç‡§•‡§ø‡§§ ‡§è‡§â‡§ü‡§æ ‡§∏‡•Å‡§®‡•ç‡§¶‡§∞ ‡§π‡§ø‡§Æ‡§æ‡§≤‡•Ä ‡§¶‡•á‡§∂ ‡§π‡•ã‡•§ ‡§Ø‡•ã ‡§¶‡•á‡§∂ ‡§Ü‡§´‡•ç‡§®‡•ã ‡§∏‡§Æ‡•É‡§¶‡•ç‡§ß ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§µ‡§ø‡§µ‡§ø‡§ß ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø‡§ï‡•ã ‡§≤‡§æ‡§ó‡§ø ‡§µ‡§ø‡§∂‡•ç‡§µ‡§≠‡§∞ ‡§™‡•ç‡§∞‡§∏‡§ø‡§¶‡•ç‡§ß ‡§õ‡•§ ‡§®‡•á‡§™‡§æ‡§≤‡§Æ‡§æ ‡§µ‡§ø‡§≠‡§ø‡§®‡•ç‡§® ‡§ú‡§æ‡§§‡§ú‡§æ‡§§‡§ø ‡§∞ ‡§ß‡§∞‡•ç‡§Æ‡§ï‡§æ ‡§Æ‡§æ‡§®‡§ø‡§∏‡§π‡§∞‡•Ç ‡§∏‡§¶‡•ç‡§≠...


In [7]:
# Example 3: OR query
print("\nüîç Query 3: '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ' OR '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø'")
print("="*70)
results = boolean_or('‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ', '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø', doc_term_matrix)
print(f"Results: {sorted(results)}")
print(f"Number of documents: {len(results)}")


üîç Query 3: '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ' OR '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø'
Results: ['doc03', 'doc07']
Number of documents: 2


In [8]:
# Example 4: NOT query
print("\nüîç Query 4: '‡§®‡•á‡§™‡§æ‡§≤' AND NOT '‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï'")
print("="*70)
results = boolean_not('‡§®‡•á‡§™‡§æ‡§≤', '‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï', doc_term_matrix)
print(f"Results: {sorted(results)}")
print(f"Number of documents: {len(results)}")
print("\nInterpretation: Documents about Nepal but not about tourism")


üîç Query 4: '‡§®‡•á‡§™‡§æ‡§≤' AND NOT '‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï'
Results: ['doc01', 'doc03', 'doc04', 'doc05', 'doc06', 'doc07', 'doc08', 'doc09', 'doc10']
Number of documents: 9

Interpretation: Documents about Nepal but not about tourism


In [11]:
# Example 5: Complex query
def complex_boolean_query(matrix):
    """
    Example: (‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ OR ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø) AND ‡§®‡•á‡§™‡§æ‡§≤
    
    Find documents about education OR technology in Nepal.
    """
    # Step 1: ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ OR ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
    education_or_tech = boolean_or('‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ', '‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø', matrix)
    
    # Step 2: Result AND ‡§®‡•á‡§™‡§æ‡§≤
    nepal_docs = get_documents_containing_term('‡§®‡•á‡§™‡§æ‡§≤', matrix)
    final_result = education_or_tech & nepal_docs
    
    return final_result

print("\nüîç Query 5: (‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ OR ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø) AND ‡§®‡•á‡§™‡§æ‡§≤")
print("="*70)
results = complex_boolean_query(doc_term_matrix)
print(f"Results: {sorted(results)}")
print(f"Number of documents: {len(results)}")
print("\nInterpretation: Documents about education or technology in Nepal")


üîç Query 5: (‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ OR ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø) AND ‡§®‡•á‡§™‡§æ‡§≤
Results: ['doc03']
Number of documents: 1

Interpretation: Documents about education or technology in Nepal


---

## 6. Summary <a name="summary"></a>

### What We Learned:

1. **Boolean Retrieval Model**
   - Binary representation: Term present (1) or absent (0)
   - No ranking: Documents either match or don't match
   - Used in library systems, legal search, patents

2. **Document-Term Matrix**
   - Binary matrix representation of documents
   - Rows = documents, Columns = terms
   - Foundation for retrieval operations

3. **Boolean Operators**
   - **AND**: Intersection (both terms)
   - **OR**: Union (either term)
   - **NOT**: Difference (exclude term)
   - Implemented using set operations

4. **Query Processing**
   - Convert query to set operations
   - Combine results using Boolean logic
   - Support complex queries with multiple operators

### Limitations:
- **No ranking**: Can't distinguish more relevant documents
- **Binary**: Ignores term frequency and importance
- **User burden**: Requires understanding Boolean logic
- **All or nothing**: Too many or too few results

### Next Steps:
In the next notebook (`04_inverted_index.ipynb`), we will:
- Build an efficient inverted index structure
- Optimize Boolean query processing
- Implement posting lists
- Compare performance with matrix approach

### Research References:
- Manning et al., "Introduction to Information Retrieval", Chapter 1
- Boolean retrieval is the foundation of modern IR
- Real systems use ranked retrieval (covered in later notebooks)