# Part 8: Late Interaction Retrieval (ColBERT)

## Learning Objectives

By the end of this notebook, you will:
1. Understand ColBERT and late interaction retrieval
2. Compare token-level vs document-level embeddings
3. Implement MaxSim scoring mechanism
4. Use RAGatouille for ColBERT indexing
5. Apply ColBERT to code vulnerability patterns
6. Compare ColBERT with dense embeddings
7. Understand when to use ColBERT vs traditional embeddings

## The Evolution of Retrieval Models

### 1. Sparse Retrieval (BM25)
```
Query:  ["sql", "injection", "prevention"]
Doc:    ["sql", "injection", "attack", "prevention", "guide"]
Score:  Term frequency + IDF weighting
```
- ✅ Fast, interpretable
- ❌ No semantic understanding
- ❌ Exact term matching only

### 2. Dense Retrieval (Embeddings - what we've been using)
```
Query:  "sql injection prevention" → [0.23, -0.45, 0.78, ...] (1536 dims)
Doc:    "guide to preventing SQL attacks" → [0.25, -0.43, 0.76, ...]
Score:  cosine_similarity(query_vec, doc_vec)
```
- ✅ Semantic understanding
- ✅ Fast at query time (pre-computed doc embeddings)
- ❌ Single vector per document loses fine-grained information
- ❌ No term-level interactions

### 3. Late Interaction (ColBERT)
```
Query:  "sql injection prevention"
  → ["sql": [0.1, 0.2, ...], "injection": [0.3, 0.4, ...], "prevention": [0.5, 0.6, ...]]

Doc:    "guide to preventing SQL attacks"
  → ["guide": [...], "to": [...], "preventing": [...], "SQL": [...], "attacks": [...]]

Score:  MaxSim for each query token across all doc tokens
```
- ✅ Token-level semantic understanding
- ✅ Captures both semantic AND lexical matches
- ✅ Better for technical terms, code, identifiers
- ❌ Slower than dense embeddings (more vectors to compare)
- ❌ Larger index size

## ColBERT: Contextualized Late Interaction over BERT

### Key Innovation: MaxSim Scoring

For each query token, find the most similar document token:

```python
score = Σ max(similarity(q_token_i, d_token_j) for all j)
        for all query tokens i
```

**Example:**
```
Query: "SQL injection"
  q_token[0] = "SQL" embedding
  q_token[1] = "injection" embedding

Doc: "Prevent SQL attacks and code injection"
  d_token[0] = "Prevent"
  d_token[1] = "SQL" ← Matches q_token[0]
  d_token[2] = "attacks"
  d_token[3] = "and"
  d_token[4] = "code"
  d_token[5] = "injection" ← Matches q_token[1]

Score = max_sim(q[0], d[*]) + max_sim(q[1], d[*])
      = sim(q[0], d[1]) + sim(q[1], d[5])
      = 0.95 + 0.98 = 1.93
```

## Why ColBERT for Security?

1. **Technical Jargon**: "LSASS", "NTDS", "CVE-2024-1234"
2. **Code Patterns**: Function names, variable names, code snippets
3. **Exploit Signatures**: Specific attack patterns
4. **Precise Matching**: Both semantic meaning AND exact terms
5. **Mixed Content**: Natural language + code + identifiers

---
## 1. Environment Setup

In [None]:
# Install RAGatouille (ColBERT wrapper)
!pip install -q ragatouille

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
from typing import List, Dict, Tuple
import numpy as np

# RAGatouille for ColBERT
from ragatouille import RAGPretrainedModel

# LangChain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema import Document

# Load environment variables
load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    print("⚠️  WARNING: OPENAI_API_KEY not found")
else:
    print("✅ OpenAI API key loaded")

In [None]:
# Initialize standard embeddings and LLM (for comparison)
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ Standard embeddings and LLM initialized")

In [None]:
# Load our existing vector store
vectorstore = Chroma(
    collection_name="owasp_llm_top10",
    embedding_function=embeddings,
    persist_directory="../data/chroma_db"
)

# Get all documents
all_docs = vectorstore.similarity_search("", k=100)

print("✅ Vector store loaded")
print(f"   Total documents: {len(all_docs)}")

---
## 2. ColBERT with RAGatouille

RAGatouille is a wrapper around ColBERT that makes it easy to use.

In [None]:
# Initialize RAGatouille ColBERT model
print("🔄 Initializing ColBERT model (this may download the model on first run)...\n")

colbert_model = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

print("✅ ColBERT model initialized")
print("   Model: colbert-ir/colbertv2.0")
print("   Token-level embeddings: 128 dimensions per token")

---
## 3. Index Documents with ColBERT

In [None]:
# Prepare documents for ColBERT indexing
# ColBERT expects a list of strings
colbert_documents = [doc.page_content for doc in all_docs]
document_ids = [f"doc_{i}" for i in range(len(all_docs))]

print(f"📄 Preparing {len(colbert_documents)} documents for ColBERT indexing...")
print(f"   Average document length: {sum(len(d) for d in colbert_documents) / len(colbert_documents):.0f} characters")

In [None]:
# Index documents with ColBERT
# This creates token-level embeddings for each document
print("\n🔄 Indexing documents with ColBERT...")
print("   This may take a few minutes...\n")

index_name = "owasp_security_colbert"
index_path = f"../data/{index_name}"

colbert_model.index(
    collection=colbert_documents,
    document_ids=document_ids,
    index_name=index_name,
    max_document_length=512,  # Maximum tokens per document
    split_documents=True  # Automatically split long documents
)

print("\n✅ ColBERT indexing complete!")
print(f"   Index saved to: {index_path}")

---
## 4. Query with ColBERT (MaxSim Scoring)

In [None]:
def colbert_search(query: str, k: int = 3) -> List[Dict]:
    """
    Search using ColBERT with MaxSim scoring.
    
    Args:
        query: Search query
        k: Number of results to return
        
    Returns:
        List of search results with scores
    """
    results = colbert_model.search(
        query=query,
        k=k
    )
    
    return results

print("✅ ColBERT search function created")

In [None]:
# Test ColBERT search
test_query = "prompt injection attacks"

print("="*80)
print(f"🔍 ColBERT Search: '{test_query}'")
print("="*80)

results = colbert_search(test_query, k=3)

print(f"\nTop {len(results)} results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. Score: {result['score']:.4f}")
    print(f"   Document ID: {result['document_id']}")
    print(f"   Content: {result['content'][:200]}...\n")

---
## 5. Comparison: ColBERT vs Dense Embeddings

Let's compare ColBERT with our standard dense embeddings on various query types.

In [None]:
def compare_retrieval_methods(query: str, vectorstore, k: int = 3):
    """
    Compare ColBERT vs dense embeddings retrieval.
    """
    print("\n" + "="*80)
    print(f"❓ Query: {query}")
    print("="*80)
    
    # Dense embeddings (OpenAI)
    print("\n1️⃣  DENSE EMBEDDINGS (OpenAI text-embedding-3-small)")
    print("-"*80)
    dense_results = vectorstore.similarity_search_with_score(query, k=k)
    print(f"Retrieved {len(dense_results)} documents:\n")
    for i, (doc, score) in enumerate(dense_results, 1):
        print(f"{i}. Distance: {score:.4f} | {doc.metadata.get('id')}: {doc.metadata.get('title')}")
        print(f"   Preview: {doc.page_content[:150]}...\n")
    
    # ColBERT
    print("\n2️⃣  COLBERT (Token-Level Late Interaction)")
    print("-"*80)
    colbert_results = colbert_search(query, k=k)
    print(f"Retrieved {len(colbert_results)} documents:\n")
    for i, result in enumerate(colbert_results, 1):
        print(f"{i}. Score: {result['score']:.4f}")
        print(f"   Preview: {result['content'][:150]}...\n")
    
    print("\n" + "="*80)
    print("📊 ANALYSIS")
    print("="*80)
    print("✅ Dense: Fast, good for semantic similarity")
    print("✅ ColBERT: Better for technical terms, code, precise matching")
    print("✅ ColBERT: Captures both semantic AND lexical matches")
    print("\n" + "="*80 + "\n")

print("✅ Comparison function created")

### Test Case 1: Technical Terms

In [None]:
compare_retrieval_methods(
    "LLM01 prompt injection",
    vectorstore,
    k=3
)

### Test Case 2: Semantic Query

In [None]:
compare_retrieval_methods(
    "How can attackers manipulate AI systems?",
    vectorstore,
    k=3
)

### Test Case 3: Mixed (Semantic + Technical)

In [None]:
compare_retrieval_methods(
    "OWASP LLM security risks and mitigations",
    vectorstore,
    k=3
)

---
## 6. Complete RAG with ColBERT

In [None]:
def rag_with_colbert(
    query: str,
    llm,
    k: int = 3
) -> str:
    """
    Complete RAG pipeline using ColBERT retrieval.
    """
    print(f"\n{'='*80}")
    print(f"🔍 RAG with ColBERT")
    print(f"{'='*80}\n")
    
    # Retrieve with ColBERT
    print(f"1️⃣  Retrieving with ColBERT (token-level matching)...")
    results = colbert_search(query, k=k)
    print(f"   Retrieved {len(results)} documents\n")
    
    # Format context
    context = "\n\n".join([
        f"Document {i+1} (ColBERT Score: {result['score']:.4f}):\n{result['content']}"
        for i, result in enumerate(results)
    ])
    
    # Generate answer
    print("2️⃣  Generating answer...\n")
    
    answer_prompt = ChatPromptTemplate.from_template(
        """You are an AI security expert assistant using ColBERT token-level retrieval.

The context below was retrieved using precise token-level matching, capturing both semantic meaning and exact technical terms.

Context:
{context}

User Question: {question}

Instructions:
1. Provide a comprehensive answer based on the precisely matched context
2. Leverage the technical accuracy from token-level matching
3. Include specific security recommendations
4. Cite relevant vulnerability IDs and technical terms

Answer:"""
    )
    
    prompt_value = answer_prompt.invoke({"context": context, "question": query})
    response = llm.invoke(prompt_value)
    
    return response.content

print("✅ RAG with ColBERT pipeline created")

In [None]:
# Test complete RAG with ColBERT
query = "What are the prevention measures for LLM01 prompt injection?"

answer = rag_with_colbert(
    query=query,
    llm=llm,
    k=3
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

---
## 7. Use Case: Code Vulnerability Patterns

ColBERT excels at matching code patterns and technical identifiers.

In [None]:
# Example code vulnerability documents
code_docs = [
    """SQL Injection Example:
```python
# Vulnerable code
query = f"SELECT * FROM users WHERE username = '{user_input}'"
cursor.execute(query)
```
This is vulnerable to SQL injection. An attacker could input: ' OR '1'='1
""",
    """Cross-Site Scripting (XSS) Example:
```javascript
// Vulnerable code
element.innerHTML = userInput;
```
This allows script injection. Use textContent instead or sanitize input.
""",
    """Command Injection Example:
```python
# Vulnerable code
os.system(f"ping {user_input}")
```
Attacker could inject: 8.8.8.8; rm -rf /
"""
]

print("💻 Code Vulnerability Pattern Matching")
print("="*80)
print("\nColBERT is excellent for matching code patterns because:")
print("1. Token-level matching captures function names, variables")
print("2. Preserves code structure and syntax")
print("3. Matches both semantic intent AND exact identifiers")
print("\nExample queries that benefit from ColBERT:")
print("  - 'Find SQL injection in Python code'")
print("  - 'Show XSS vulnerabilities with innerHTML'")
print("  - 'Locate os.system command injection'")
print("\n" + "="*80)

---
## 8. When to Use ColBERT vs Dense Embeddings

### Use ColBERT When:

1. **Technical Content**
   - Code snippets and patterns
   - Function names, APIs, identifiers
   - Technical jargon and acronyms
   - Version numbers, CVE IDs

2. **Precise Matching Required**
   - Security signatures
   - Exploit patterns
   - Configuration examples
   - Command-line syntax

3. **Mixed Content**
   - Natural language + code
   - Documentation with examples
   - Technical guides with syntax

### Use Dense Embeddings When:

1. **Pure Natural Language**
   - Concept explanations
   - High-level overviews
   - General knowledge questions

2. **Speed is Critical**
   - Real-time search (<100ms)
   - Large-scale retrieval (millions of docs)
   - Resource-constrained environments

3. **Semantic Similarity Only**
   - Paraphrase matching
   - Conceptual similarity
   - Topic clustering

### Hybrid Approach (Best of Both):

1. **Dense for initial retrieval** (fast, top 50-100)
2. **ColBERT for reranking** (precise, top 10)
3. **Combine scores** with weights

---
## 9. Production Considerations

### Index Size and Storage

```python
# Dense embeddings
1000 documents × 1536 dimensions × 4 bytes = 6.14 MB

# ColBERT (token-level)
1000 documents × avg 200 tokens × 128 dimensions × 4 bytes = 102 MB
```

**ColBERT is ~16-20x larger than dense embeddings**

### Query Latency

```
Dense embeddings:  10-50ms  (single dot product per doc)
ColBERT:          50-200ms  (MaxSim across all tokens)
```

**ColBERT is ~5-10x slower than dense embeddings**

### Optimization Strategies

1. **Use ColBERT selectively**
   - Only for technical/code queries
   - Classify query type first
   - Route to appropriate retriever

2. **Two-stage retrieval**
   - Stage 1: Dense retrieval (top 50)
   - Stage 2: ColBERT rerank (top 10)

3. **Cache results**
   - Cache frequent queries
   - Cache ColBERT embeddings

4. **Hardware acceleration**
   - Use GPU for ColBERT
   - Batch queries when possible

### When NOT to Use ColBERT

- Very large corpora (10M+ documents) without GPU
- Real-time latency requirements (<50ms)
- Memory-constrained environments
- Pure concept-based retrieval

---
## 10. Summary and Key Takeaways

### What We Built

✅ Complete ColBERT implementation:
1. **ColBERT with RAGatouille**: Token-level embeddings and indexing
2. **MaxSim Scoring**: Late interaction retrieval mechanism
3. **Comparison Framework**: ColBERT vs dense embeddings
4. **Complete RAG Pipeline**: End-to-end with ColBERT
5. **Use Case Analysis**: When to use each approach
6. **Production Considerations**: Performance and optimization

### Core Concepts Learned

1. **Late Interaction**: Delay interaction between query and document to token level
2. **Token-Level Embeddings**: Each token gets its own embedding
3. **MaxSim Scoring**: Maximum similarity for each query token
4. **Hybrid Retrieval**: Combining dense + late interaction
5. **Trade-offs**: Accuracy vs speed vs storage

### Key Insights

**ColBERT Strengths:**
- ↑↑ Better for technical content (code, identifiers, jargon)
- ↑↑ Precise matching (semantic AND lexical)
- ↑ Captures token-level interactions
- ✅ Essential for security patterns and code

**ColBERT Limitations:**
- ↓ Slower than dense embeddings (5-10x)
- ↓ Larger index size (16-20x)
- ↓ More complex infrastructure

**Best Practice: Hybrid**
- Use dense for speed and scale
- Use ColBERT for precision
- Combine both for best results

### Production Recommendations

1. **Query Classification**: Route queries to appropriate retriever
2. **Two-Stage Retrieval**: Dense → ColBERT rerank
3. **Selective Use**: Only for technical/code content
4. **GPU Acceleration**: Use GPU for ColBERT at scale
5. **Caching**: Cache frequent ColBERT results
6. **Monitor Performance**: Track latency and quality

### Next Steps

In **Part 9**, we'll focus on **Security Hardening**:
- Detect prompt injection in queries
- Prevent jailbreaking attempts
- Implement source verification
- Add confidence scoring
- Redact sensitive information
- Validate outputs for safety

Making the RAG system itself secure against adversarial use!

---

### 🎯 Practice Exercises

1. **Index Code Repository**: Index a code repository with ColBERT
2. **Build Hybrid Retriever**: Combine dense + ColBERT with weights
3. **Benchmark Performance**: Measure latency and quality
4. **Query Classification**: Build router to select retriever
5. **Optimize Index**: Tune ColBERT parameters for your use case

### 📚 Further Reading

- [ColBERT Paper](https://arxiv.org/abs/2004.12832)
- [ColBERTv2 Paper](https://arxiv.org/abs/2112.01488)
- [RAGatouille Documentation](https://github.com/bclavie/RAGatouille)
- [Late Interaction Models](https://arxiv.org/abs/2104.01967)