# Part 6: Intelligent Ranking (Re-ranking by Severity)

## Learning Objectives

By the end of this notebook, you will:
1. Understand why reranking matters for security applications
2. Review and apply Reciprocal Rank Fusion (RRF)
3. Implement semantic reranking with cross-encoders
4. Build security-specific ranking functions
5. Combine multiple ranking signals (semantic + metadata)
6. Compare different reranking strategies
7. Optimize ranking for production use cases

## The Problem with Similarity-Only Ranking

Vector similarity search ranks documents purely by semantic similarity. For security applications, this has limitations:

### Example: "What are the most critical risks to my ML system?"

**Similarity-only ranking might return:**
1. Medium severity vulnerability (high semantic match)
2. Low severity vulnerability (mentions "critical" in text)
3. Critical vulnerability (lower semantic match)

**But security practitioners need:**
1. **Critical** vulnerability (CVSS 9.0+) with active exploits
2. **High** severity vulnerability (CVSS 7.0+) from this year
3. **High** severity vulnerability with proof-of-concept

## Solution: Intelligent Re-ranking

Combine multiple signals to rank documents by both **relevance** and **priority**:

1. **Semantic Relevance**: How well does the content match the query?
2. **Severity/Risk**: How critical is the vulnerability?
3. **Recency**: How recent is the vulnerability?
4. **Exploit Status**: Is it being actively exploited?
5. **Business Impact**: Does it affect our specific systems?

## Reranking Approaches

### 1. Reciprocal Rank Fusion (RRF)
- Combine multiple ranked lists
- Simple, effective, no training needed
- Review from Part 3

### 2. Cross-Encoder Reranking
- Use powerful model to rerank results
- More accurate than bi-encoders (embeddings)
- Slower but higher quality

### 3. Custom Scoring Functions
- Weighted combination of signals
- Domain-specific (security)
- Configurable and explainable

### 4. Hybrid Ranking
- Combine semantic + metadata scoring
- Best of both worlds
- Production-ready approach

---
## 1. Environment Setup

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
from typing import List, Dict, Optional, Tuple
from datetime import datetime, timedelta
from collections import defaultdict
import numpy as np

# LangChain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema import Document

# Load environment variables
load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    print("⚠️  WARNING: OPENAI_API_KEY not found")
else:
    print("✅ OpenAI API key loaded")

In [None]:
# Initialize embeddings and LLM
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ Embeddings and LLM initialized")

In [None]:
# Load vector store
vectorstore = Chroma(
    collection_name="owasp_llm_top10",
    embedding_function=embeddings,
    persist_directory="../data/chroma_db"
)

print("✅ Vector store loaded")
print(f"   Collection: {vectorstore._collection.count()} documents")

---
## 2. Review: Reciprocal Rank Fusion (RRF)

We covered RRF in Part 3. Let's review and apply it to security ranking.

In [None]:
def reciprocal_rank_fusion(
    results: List[List[Document]], 
    k: int = 60
) -> List[Tuple[Document, float]]:
    """
    Apply Reciprocal Rank Fusion to combine multiple ranked lists.
    
    RRF Score = Σ [ 1 / (k + rank) ] for each list
    
    Args:
        results: List of ranked document lists
        k: Constant for RRF formula (default: 60)
        
    Returns:
        List of (document, rrf_score) tuples, sorted by score
    """
    rrf_scores = defaultdict(float)
    doc_map = {}
    
    # Accumulate RRF scores
    for doc_list in results:
        for rank, doc in enumerate(doc_list, 1):
            doc_id = hash(doc.page_content)
            rrf_scores[doc_id] += 1.0 / (k + rank)
            if doc_id not in doc_map:
                doc_map[doc_id] = doc
    
    # Create scored list
    scored_docs = [(doc_map[doc_id], score) for doc_id, score in rrf_scores.items()]
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    
    return scored_docs

print("✅ RRF function created")

---
## 3. Cross-Encoder Reranking (Semantic)

Cross-encoders jointly encode query + document for more accurate relevance scoring.

### Bi-Encoder vs Cross-Encoder

**Bi-Encoder (Embeddings - what we've been using):**
```
query → embedding → [0.23, -0.45, ...]
doc   → embedding → [0.25, -0.43, ...]
similarity = cosine(query_emb, doc_emb)
```
- Fast: Pre-compute document embeddings
- Scalable: Millions of documents
- Less accurate: No query-document interaction

**Cross-Encoder (Reranking):**
```
[query, doc] → model → relevance_score
```
- Slower: Must encode each query-doc pair
- Not scalable: Can't pre-compute
- More accurate: Models query-document interactions

### Strategy: Two-Stage Retrieval

1. **Stage 1 (Bi-Encoder)**: Retrieve top 20-50 candidates quickly
2. **Stage 2 (Cross-Encoder)**: Rerank top candidates accurately

For this demo, we'll implement a simple cross-encoder-style reranking using GPT-4.

In [None]:
def semantic_rerank(
    query: str,
    documents: List[Document],
    llm,
    top_k: int = 5
) -> List[Tuple[Document, float]]:
    """
    Rerank documents using LLM to score relevance.
    
    Note: This is a simplified approach. In production, use:
    - Cohere Rerank API
    - sentence-transformers cross-encoders
    - Specialized reranking models
    
    Args:
        query: User query
        documents: List of candidate documents
        llm: Language model
        top_k: Number of documents to return
        
    Returns:
        List of (document, relevance_score) tuples
    """
    print(f"\n🔄 Semantic reranking {len(documents)} documents...")
    
    scored_docs = []
    
    for i, doc in enumerate(documents):
        # Simplified scoring: use LLM to rate relevance
        # In production, use proper reranking models
        prompt = ChatPromptTemplate.from_template(
            """Rate the relevance of this document to the query on a scale of 0.0 to 1.0.

Query: {query}

Document: {document}

Respond with ONLY a number between 0.0 and 1.0, where:
- 1.0 = Highly relevant, directly answers the query
- 0.5 = Somewhat relevant, partially related
- 0.0 = Not relevant

Relevance score:"""
        )
        
        # For demo purposes, we'll use a simpler heuristic
        # In production, call the LLM or use Cohere Rerank
        
        # Simplified: score based on keyword overlap + metadata
        query_lower = query.lower()
        doc_lower = doc.page_content.lower()
        
        # Keyword overlap score
        query_words = set(query_lower.split())
        doc_words = set(doc_lower.split())
        overlap = len(query_words & doc_words) / len(query_words) if query_words else 0
        
        # Boost score for title match
        title_boost = 0.2 if any(word in doc.metadata.get('title', '').lower() for word in query_words) else 0
        
        score = min(overlap + title_boost, 1.0)
        scored_docs.append((doc, score))
    
    # Sort by score
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    
    print(f"   ✓ Reranked, returning top {top_k}\n")
    return scored_docs[:top_k]

print("✅ Semantic reranking function created")
print("\n💡 Note: This is a simplified demo. In production, use:")
print("   - Cohere Rerank API (highly recommended)")
print("   - sentence-transformers cross-encoders")
print("   - Specialized reranking models")

---
## 4. Security-Specific Ranking Functions

Now let's build ranking functions that prioritize by security-relevant metadata.

In [None]:
def severity_score(doc: Document) -> float:
    """
    Score document by severity/risk level.
    
    Returns:
        Score between 0.0 and 1.0
    """
    risk_level = doc.metadata.get('risk_level', 'Medium')
    
    severity_map = {
        'Critical': 1.0,
        'High': 0.7,
        'Medium': 0.4,
        'Low': 0.2
    }
    
    return severity_map.get(risk_level, 0.4)


def recency_score(doc: Document, decay_days: int = 365) -> float:
    """
    Score document by recency (newer = higher score).
    
    Args:
        doc: Document with optional 'date_published' metadata
        decay_days: Days for score to decay to 0.5
        
    Returns:
        Score between 0.0 and 1.0
    """
    # For demo, since we don't have dates, return neutral score
    # In production, parse date_published and calculate age
    
    date_str = doc.metadata.get('date_published')
    if not date_str:
        return 0.5  # Neutral score if no date
    
    try:
        pub_date = datetime.fromisoformat(date_str)
        age_days = (datetime.now() - pub_date).days
        
        # Exponential decay: score = exp(-age / decay_days)
        score = np.exp(-age_days / decay_days)
        return score
    except:
        return 0.5  # Default if parsing fails


def exploit_score(doc: Document) -> float:
    """
    Score document by exploit availability.
    
    Returns:
        1.0 if exploit available, 0.5 otherwise
    """
    # Check if document mentions exploits
    content_lower = doc.page_content.lower()
    exploit_keywords = ['exploit', 'exploitation', 'actively exploited', 'in-the-wild']
    
    if any(keyword in content_lower for keyword in exploit_keywords):
        return 1.0
    return 0.5


print("✅ Security-specific scoring functions created")

---
## 5. Multi-Signal Ranking

Combine multiple scoring signals with configurable weights.

In [None]:
def multi_signal_rerank(
    query: str,
    documents: List[Document],
    weights: Optional[Dict[str, float]] = None,
    top_k: int = 5
) -> List[Tuple[Document, float, Dict[str, float]]]:
    """
    Rerank documents using multiple signals.
    
    Args:
        query: User query
        documents: Candidate documents (with similarity scores)
        weights: Weight for each signal (default: equal weights)
        top_k: Number of documents to return
        
    Returns:
        List of (document, final_score, signal_scores) tuples
    """
    if weights is None:
        weights = {
            'semantic': 0.4,    # Relevance to query
            'severity': 0.3,    # Risk level
            'recency': 0.15,    # How recent
            'exploit': 0.15     # Exploit availability
        }
    
    print(f"\n🔄 Multi-signal reranking with weights:")
    for signal, weight in weights.items():
        print(f"   {signal}: {weight:.2f}")
    print()
    
    scored_docs = []
    
    for doc in documents:
        # Calculate individual signal scores
        signals = {
            'semantic': 1.0,  # Assume normalized similarity (could use actual sim score)
            'severity': severity_score(doc),
            'recency': recency_score(doc),
            'exploit': exploit_score(doc)
        }
        
        # Weighted combination
        final_score = sum(signals[key] * weights[key] for key in signals.keys())
        
        scored_docs.append((doc, final_score, signals))
    
    # Sort by final score
    scored_docs.sort(key=lambda x: x[1], reverse=True)
    
    print(f"✓ Reranked, returning top {top_k}\n")
    return scored_docs[:top_k]

print("✅ Multi-signal reranking function created")

---
## 6. Demonstrations

Let's compare different reranking strategies.

In [None]:
# Test query
test_query = "What are the most critical security risks?"

print("="*80)
print(f"❓ Query: {test_query}")
print("="*80)

# Initial retrieval
print("\n1️⃣  INITIAL RETRIEVAL (Similarity Only)")
print("-"*80)
initial_docs = vectorstore.similarity_search(test_query, k=5)
print(f"Retrieved {len(initial_docs)} documents:\n")
for i, doc in enumerate(initial_docs, 1):
    print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')}")
    print(f"   Risk: {doc.metadata.get('risk_level')}")
    print()

In [None]:
# Multi-signal reranking
print("\n2️⃣  MULTI-SIGNAL RERANKING")
print("-"*80)
reranked = multi_signal_rerank(test_query, initial_docs, top_k=5)

print("Reranked results:\n")
for i, (doc, final_score, signals) in enumerate(reranked, 1):
    print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')}")
    print(f"   Risk: {doc.metadata.get('risk_level')}")
    print(f"   Final Score: {final_score:.3f}")
    print(f"   Signals: severity={signals['severity']:.2f}, recency={signals['recency']:.2f}, exploit={signals['exploit']:.2f}")
    print()

### Custom Weights Example

In [None]:
# Heavily weight severity for security-critical applications
security_focused_weights = {
    'semantic': 0.2,    # Less emphasis on semantic match
    'severity': 0.5,    # Heavily prioritize severity
    'recency': 0.1,     # Less emphasis on recency
    'exploit': 0.2      # Moderate emphasis on exploits
}

print("\n3️⃣  SECURITY-FOCUSED RERANKING")
print("-"*80)
security_reranked = multi_signal_rerank(
    test_query, 
    initial_docs, 
    weights=security_focused_weights,
    top_k=5
)

print("Security-focused reranking results:\n")
for i, (doc, final_score, signals) in enumerate(security_reranked, 1):
    print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')}")
    print(f"   Risk: {doc.metadata.get('risk_level')}")
    print(f"   Final Score: {final_score:.3f}")
    print()

---
## 7. Complete RAG Pipeline with Reranking

In [None]:
def rag_with_reranking(
    query: str,
    vectorstore: Chroma,
    llm,
    rerank_method: str = 'multi_signal',
    weights: Optional[Dict[str, float]] = None,
    k_retrieve: int = 10,
    k_final: int = 3
) -> str:
    """
    Complete RAG pipeline with reranking.
    
    Args:
        query: User query
        vectorstore: Vector store
        llm: Language model
        rerank_method: 'multi_signal', 'semantic', or 'none'
        weights: Custom weights for multi_signal
        k_retrieve: Number of candidates to retrieve
        k_final: Number of documents to use for generation
        
    Returns:
        Generated answer
    """
    print(f"\n{'='*80}")
    print(f"🔍 RAG with Reranking: {rerank_method}")
    print(f"{'='*80}\n")
    
    # Step 1: Initial retrieval (over-fetch)
    print(f"1️⃣  Retrieving {k_retrieve} candidate documents...")
    candidates = vectorstore.similarity_search(query, k=k_retrieve)
    print(f"   Retrieved {len(candidates)} candidates\n")
    
    # Step 2: Rerank
    print(f"2️⃣  Reranking with method: {rerank_method}")
    
    if rerank_method == 'multi_signal':
        reranked = multi_signal_rerank(query, candidates, weights, k_final)
        final_docs = [doc for doc, score, signals in reranked]
    elif rerank_method == 'semantic':
        reranked = semantic_rerank(query, candidates, llm, k_final)
        final_docs = [doc for doc, score in reranked]
    else:
        # No reranking
        final_docs = candidates[:k_final]
        print(f"   No reranking applied\n")
    
    print(f"   Selected top {len(final_docs)} documents for generation\n")
    
    # Step 3: Generate answer
    print("3️⃣  Generating answer...\n")
    
    context = "\n\n".join([
        f"Document {i+1} ({doc.metadata['id']} - {doc.metadata['title']}, Risk: {doc.metadata['risk_level']}):\n{doc.page_content}"
        for i, doc in enumerate(final_docs)
    ])
    
    answer_prompt = ChatPromptTemplate.from_template(
        """You are an AI security expert assistant.

Use the following security documentation to answer the user's question.
The documents have been ranked by both relevance and security priority.

Context:
{context}

User Question: {question}

Instructions:
1. Provide a comprehensive answer based on the context
2. Prioritize information from higher-risk vulnerabilities
3. Cite specific vulnerabilities and risk levels
4. Include prevention measures and best practices

Answer:"""
    )
    
    prompt_value = answer_prompt.invoke({"context": context, "question": query})
    response = llm.invoke(prompt_value)
    
    return response.content

print("✅ Complete RAG with reranking pipeline created")

In [None]:
# Test complete pipeline
query = "What are the most critical security risks for LLM applications?"

answer = rag_with_reranking(
    query=query,
    vectorstore=vectorstore,
    llm=llm,
    rerank_method='multi_signal',
    k_retrieve=10,
    k_final=3
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

---
## 8. Comparison: Different Reranking Strategies

In [None]:
def compare_reranking_strategies(query: str, vectorstore: Chroma):
    """
    Compare different reranking strategies.
    """
    print("\n" + "="*80)
    print(f"❓ Query: {query}")
    print("="*80)
    
    # Retrieve candidates
    candidates = vectorstore.similarity_search(query, k=10)
    
    # Strategy 1: No reranking (baseline)
    print("\n1️⃣  NO RERANKING (Baseline)")
    print("-"*80)
    baseline = candidates[:5]
    print("Top 5 documents:\n")
    for i, doc in enumerate(baseline, 1):
        print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')} (Risk: {doc.metadata.get('risk_level')})")
    
    # Strategy 2: Multi-signal (balanced)
    print("\n2️⃣  MULTI-SIGNAL (Balanced Weights)")
    print("-"*80)
    balanced = multi_signal_rerank(query, candidates, top_k=5)
    print("Top 5 documents:\n")
    for i, (doc, score, signals) in enumerate(balanced, 1):
        print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')} (Risk: {doc.metadata.get('risk_level')}, Score: {score:.3f})")
    
    # Strategy 3: Severity-focused
    print("\n3️⃣  SEVERITY-FOCUSED")
    print("-"*80)
    severity_weights = {'semantic': 0.2, 'severity': 0.5, 'recency': 0.1, 'exploit': 0.2}
    severity_focused = multi_signal_rerank(query, candidates, weights=severity_weights, top_k=5)
    print("Top 5 documents:\n")
    for i, (doc, score, signals) in enumerate(severity_focused, 1):
        print(f"{i}. {doc.metadata.get('id')}: {doc.metadata.get('title')} (Risk: {doc.metadata.get('risk_level')}, Score: {score:.3f})")
    
    print("\n" + "="*80)
    print("📊 ANALYSIS")
    print("="*80)
    print("✅ Notice how different strategies prioritize different documents")
    print("✅ Severity-focused ranking surfaces Critical/High risks first")
    print("✅ Choose weights based on your use case and priorities")
    print("\n" + "="*80 + "\n")

print("✅ Comparison function created")

In [None]:
# Run comparison
compare_reranking_strategies(
    "What security vulnerabilities should I prioritize?",
    vectorstore
)

---
## 9. Production Best Practices

### Reranking Strategy Selection

**Choose based on use case:**

1. **No Reranking (Baseline)**
   - Use when: Semantic relevance is all that matters
   - Pros: Fastest, simplest
   - Cons: May not prioritize by business impact

2. **Multi-Signal Reranking**
   - Use when: Need to balance relevance + metadata
   - Pros: Configurable, explainable, fast
   - Cons: Requires tuning weights

3. **Cross-Encoder Reranking**
   - Use when: Quality > speed, semantic nuance matters
   - Pros: Most accurate
   - Cons: Slower, more expensive

4. **Hybrid (Two-Stage)**
   - Use when: Need both speed and quality
   - Approach: Bi-encoder retrieval → Multi-signal filter → Cross-encoder rerank top-5
   - Pros: Best of all worlds
   - Cons: Most complex

### Weight Tuning Guidelines

**For different scenarios:**

```python
# Security-critical production system
production_weights = {
    'semantic': 0.2,
    'severity': 0.5,    # Heavily prioritize critical issues
    'recency': 0.1,
    'exploit': 0.2      # Active exploits are urgent
}

# Research/analysis
research_weights = {
    'semantic': 0.6,    # Relevance is key
    'severity': 0.2,
    'recency': 0.1,
    'exploit': 0.1
}

# Compliance/audit
compliance_weights = {
    'semantic': 0.3,
    'severity': 0.4,
    'recency': 0.2,     # Recent changes matter
    'exploit': 0.1
}
```

### Performance Optimization

1. **Over-fetch then rerank**: Retrieve 2-5x more than needed
2. **Cache reranking results**: Cache for repeated queries
3. **Batch reranking**: Process multiple queries together
4. **Async reranking**: Don't block on slow rerankers
5. **Fallback strategy**: If reranking fails, use baseline

### Monitoring & Evaluation

Track these metrics:
- **Ranking quality**: NDCG, MRR, Precision@K
- **Latency**: P50, P95, P99 for reranking time
- **User feedback**: Click-through rate on reranked results
- **Coverage**: % of queries where reranking changes order
- **Signal distribution**: How often each signal influences ranking

---
## 10. Summary and Key Takeaways

### What We Built

✅ Complete reranking system:
1. **RRF Review**: Reciprocal Rank Fusion from Part 3
2. **Semantic Reranking**: Cross-encoder style (simplified)
3. **Security-Specific Scoring**: Severity, recency, exploit status
4. **Multi-Signal Ranking**: Weighted combination of signals
5. **Complete RAG Pipeline**: End-to-end with reranking
6. **Comparison Framework**: Evaluate different strategies

### Core Concepts Learned

1. **Why Reranking Matters**: Semantic similarity ≠ business priority
2. **Two-Stage Retrieval**: Fast bi-encoder → slow but accurate reranker
3. **Multi-Signal Scoring**: Combine semantic + metadata signals
4. **Configurable Weights**: Tune for different use cases
5. **Production Patterns**: Over-fetch, cache, fallback

### Key Insights

**Reranking Benefits:**
- ↑↑ Prioritize by business importance, not just relevance
- ↑ Better user experience (right results first)
- ↑ Configurable and explainable
- ✅ Essential for security/compliance applications

**Trade-offs:**
- **Accuracy vs Speed**: Cross-encoders are slow but accurate
- **Complexity vs Control**: More signals = more tuning
- **Generality vs Specificity**: Custom scoring works for domain

### Production Recommendations

1. **Start with multi-signal**: Simple, fast, configurable
2. **Tune weights for your domain**: Security ≠ e-commerce
3. **Monitor ranking quality**: Track user engagement
4. **Use two-stage retrieval**: Over-fetch then rerank
5. **Implement fallbacks**: Don't fail if reranking breaks
6. **A/B test weights**: Optimize based on real usage

### Next Steps

In **Part 7**, we'll implement **RAPTOR**:
- Hierarchical knowledge organization
- Recursive summarization
- Tree-based retrieval
- Multi-level abstraction (tactics → techniques → procedures)

Example: Navigate MITRE ATT&CK hierarchy from high-level tactics down to specific procedures.

---

### 🎯 Practice Exercises

1. **Integrate Cohere Rerank**: Use real reranking API
2. **Add More Signals**: Implement affected_products, cwe_id scoring
3. **Learn-to-Rank**: Use ML to learn optimal weights
4. **Dynamic Weights**: Adjust weights based on query type
5. **Ranking Explainability**: Show users why docs were ranked

### 📚 Further Reading

- [Cohere Rerank Documentation](https://docs.cohere.com/docs/reranking)
- [Cross-Encoders for Reranking](https://www.sbert.net/examples/applications/cross-encoder/README.html)
- [Learning to Rank](https://en.wikipedia.org/wiki/Learning_to_rank)
- [NDCG and Ranking Metrics](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)