# 09.01. Rocchio's Algorithm

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: Rocchio's Algorithm](#theory)
3. [Implementation](#implementation)
4. [Interactive Feedback Loop](#interactive)
5. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

**Rocchio's Algorithm** is a classic relevance feedback method that improves query results by learning from user feedback.

### The Process:
1. User provides initial query
2. System returns initial results
3. **User marks** documents as relevant/non-relevant
4. **System refines** query based on feedback
5. Returns improved results

### Key Idea:

### Formula:
$$
\vec{Q}_{new} = \alpha \vec{Q}_{old} + \beta \frac{1}{|D_r|} \sum_{d \in D_r} \vec{d} - \gamma \frac{1}{|D_{nr}|} \sum_{d \in D_{nr}} \vec{d}
$$

Where:
- $\vec{Q}_{old}$ = Original query vector
- $D_r$ = Set of relevant documents
- $D_{nr}$ = Set of non-relevant documents
- $\alpha, \beta, \gamma$ = Weights (typically: Œ±=1, Œ≤=0.75, Œ≥=0.15)

---

## 2. Theory:Rocchio's Algorithm <a name="theory"></a>

### Intuition:
- **Move query** closer to relevant documents
- **Move query** away from non-relevant documents
- **Preserve** some of original query intent (Œ±)

### Parameters:
- **Œ± (alpha)**: Weight for original query (usually 1.0)
- **Œ≤ (beta)**: Weight for relevant docs (usually 0.75)
- **Œ≥ (gamma)**: Weight for non-relevant docs (usually 0.15)

### Why it Works:
- Relevant docs likely share common features
- Non-relevant docs indicate what to avoid
- Balance between exploration and exploitation

---

## 3. Implementation <a name="implementation"></a>

In [1]:
from pathlib import Path
from collections import Counter
import math

# Load data (same as previous notebooks)
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('‡•§,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"‚úì Loaded {len(preprocessed_docs)} documents")

‚úì Loaded 10 documents


In [2]:
# Build vocabulary and TF vectors
vocabulary = sorted(set(term for terms in preprocessed_docs.values() for term in terms))
term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}

def build_tf_vector(terms, vocabulary, term_to_idx):
    """Build TF vector for a document."""
    vector = [0] * len(vocabulary)
    term_counts = Counter(terms)
    
    for term, count in term_counts.items():
        if term in term_to_idx:
            vector[term_to_idx[term]] = count
    
    return vector

# Build document vectors
doc_vectors = {}
for doc_id, terms in preprocessed_docs.items():
    doc_vectors[doc_id] = build_tf_vector(terms, vocabulary, term_to_idx)

print(f"‚úì Built {len(doc_vectors)} document vectors")
print(f"  Vocabulary size: {len(vocabulary)}")

‚úì Built 10 document vectors
  Vocabulary size: 398


In [3]:
def vector_add(v1, v2, weight=1.0):
    """Add two vectors with optional weight."""
    return [x + weight * y for x, y in zip(v1, v2)]

def vector_scale(vector, scale):
    """Scale a vector by a constant."""
    return [scale * x for x in vector]

def centroid(vectors):
    """Calculate centroid of multiple vectors."""
    if not vectors:
        return [0] * len(vectors[0]) if vectors else []
    
    result = [0] * len(vectors[0])
    for vec in vectors:
        result = vector_add(result, vec)
    
    return vector_scale(result, 1.0 / len(vectors))

def rocchio(query_vector, relevant_docs, non_relevant_docs, 
            alpha=1.0, beta=0.75, gamma=0.15):
    """
    Rocchio's algorithm for query refinement.
    
    Parameters:
    -----------
    query_vector : list
        Original query vector
    relevant_docs : list of lists
        Vectors of relevant documents
    non_relevant_docs : list of lists
        Vectors of non-relevant documents
    alpha, beta, gamma : float
        Rocchio parameters
    
    Returns:
    --------
    list : Modified query vector
    """
    # Start with weighted original query
    new_query = vector_scale(query_vector, alpha)
    
    # Add relevant document centroid
    if relevant_docs:
        rel_centroid = centroid(relevant_docs)
        new_query = vector_add(new_query, rel_centroid, beta)
    
    # Subtract non-relevant document centroid
    if non_relevant_docs:
        nonrel_centroid = centroid(non_relevant_docs)
        new_query = vector_add(new_query, nonrel_centroid, -gamma)
    
    return new_query

print("‚úì Rocchio's algorithm functions defined")

‚úì Rocchio's algorithm functions defined


---

## 4. Interactive Feedback Loop <a name="interactive"></a>

In [4]:
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors."""
    dot_prod = sum(x * y for x, y in zip(v1, v2))
    mag1 = math.sqrt(sum(x ** 2 for x in v1))
    mag2 = math.sqrt(sum(x ** 2 for x in v2))
    
    if mag1 == 0 or mag2 == 0:
        return 0.0
    
    return dot_prod / (mag1 * mag2)

def search(query_vector, doc_vectors, top_k=5):
    """Search documents using query vector."""
    scores = []
    for doc_id, doc_vec in doc_vectors.items():
        score = cosine_similarity(query_vector, doc_vec)
        scores.append((doc_id, score))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

# Example: Relevance Feedback Iteration
print("üìö Rocchio's Algorithm - Relevance Feedback Example")
print("="*70)

# Initial query
query_text = "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤"  
query_terms = preprocess_text(query_text, stopwords, stem_dict)
query_vector = build_tf_vector(query_terms, vocabulary, term_to_idx)

print(f"\nüîç Initial Query: '{query_text}'")
print(f"   Terms: {query_terms}")

# Initial search
results = search(query_vector, doc_vectors, top_k=5)
print(f"\nüìä Initial Results:")
for rank, (doc_id, score) in enumerate(results, 1):
    title = documents[doc_id].split('\n')[0] if documents[doc_id] else doc_id
    print(f"   {rank}. {doc_id} (score: {score:.4f})")

# Simulate user feedback
# (In real implementation, user would mark these)
relevant_doc_ids = [results[0][0], results[1][0]]  # Top 2 as relevant
non_relevant_doc_ids = [results[3][0]]  # 4th as non-relevant

print(f"\n‚úì User Feedback:")
print(f"   Relevant: {relevant_doc_ids}")
print(f"   Non-relevant: {non_relevant_doc_ids}")

# Apply Rocchio
relevant_vecs = [doc_vectors[doc_id] for doc_id in relevant_doc_ids]
non_relevant_vecs = [doc_vectors[doc_id] for doc_id in non_relevant_doc_ids]

modified_query = rocchio(query_vector, relevant_vecs, non_relevant_vecs)

# Search with modified query
new_results = search(modified_query, doc_vectors, top_k=5)

print(f"\nüìä Results After Rocchio:")
for rank, (doc_id, score) in enumerate(new_results, 1):
    marker = "‚úì" if doc_id in relevant_doc_ids else ""
    print(f"   {rank}. {doc_id} (score: {score:.4f}) {marker}")

print("\nüí° Relevant documents ranked higher after feedback!")

üìö Rocchio's Algorithm - Relevance Feedback Example

üîç Initial Query: '‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤'
   Terms: ['‡§®‡•á‡§™‡§æ‡§≤', '‡§π‡§ø‡§Æ‡§æ‡§≤']

üìä Initial Results:
   1. doc02 (score: 0.6152)
   2. doc01 (score: 0.4698)
   3. doc09 (score: 0.4308)
   4. doc05 (score: 0.4045)
   5. doc04 (score: 0.3536)

‚úì User Feedback:
   Relevant: ['doc02', 'doc01']
   Non-relevant: ['doc05']

üìä Results After Rocchio:
   1. doc02 (score: 0.8139) ‚úì
   2. doc01 (score: 0.7570) ‚úì
   3. doc09 (score: 0.3401) 
   4. doc06 (score: 0.3035) 
   5. doc04 (score: 0.2799) 

üí° Relevant documents ranked higher after feedback!


---

## 5. Summary <a name="summary"></a>

### What We Learned:

1. **Rocchio's Formula**
   - Move query toward relevant docs
   - Move query away from non-relevant docs
   - Balance with original query intent

2. **Parameters**
   - Œ± = 1.0 (original query)
   - Œ≤ = 0.75 (relevant docs)
   - Œ≥ = 0.15 (non-relevant docs)

3. **Implementation**
   - Vector operations only
   - Centroid calculation
   - Iterative refinement

### Benefits:
- ‚úì Simple and effective
- ‚úì Works with any vector model
- ‚úì Improves recall and precision
- ‚úì User-friendly interaction

### Limitations:
- Requires user feedback
- May drift from original intent
- Assumes relevant docs are similar
- Query may become too long

### Variations:
- **Positive feedback only**: Set Œ≥ = 0
- **Ide dec-hi**: Similar algorithm with different weights
- **Probabilistic RF**: Use probability instead of vectors

### Real-World Use:
- Google "Did you mean?"
- Amazon product recommendations
- Email spam filters (marking spam helps)
- Music recommendation systems

### References:
- Rocchio, J.J. (1971): "Relevance feedback in information retrieval"
- Manning et al., Chapter 9.1
- Salton & Buckley (1990): "Improving retrieval performance by relevance feedback"