# Part 5: Metadata & Filtering (Query Structuring)

## Learning Objectives

By the end of this notebook, you will:
1. Understand the importance of metadata in RAG systems
2. Define metadata schemas for security documents
3. Extract structured filters from natural language queries
4. Implement metadata-aware retrieval
5. Filter documents by severity, date, category, and other fields
6. Compare filtered vs unfiltered retrieval
7. Build production-ready query structuring

## The Problem with Unfiltered Retrieval

Pure similarity search has limitations for security use cases:

### Example: "Show me critical vulnerabilities"

**Without metadata filtering:**
- Retrieves any document mentioning "critical" and "vulnerabilities"
- May return Medium or Low severity vulnerabilities that mention the word "critical"
- No guarantee the CVSS score is actually Critical (9.0+)
- Can't filter by date, affected systems, or exploit status

**With metadata filtering:**
```python
{
  "severity": "Critical",
  "cvss_score": {"$gte": 9.0}
}
```
- Only returns documents with `severity = "Critical"`
- Guaranteed to have CVSS >= 9.0
- Can add date filters, product filters, etc.

## Common Security Metadata

### CVE/Vulnerability Metadata
- **severity**: Critical, High, Medium, Low
- **cvss_score**: 0.0 - 10.0
- **date_published**: ISO date
- **affected_products**: List of products/libraries
- **exploit_available**: Boolean
- **cwe_id**: Common Weakness Enumeration ID

### MITRE ATT&CK Metadata
- **tactic**: Initial Access, Execution, Persistence, etc.
- **technique_id**: T1190, T1059, etc.
- **platforms**: Windows, Linux, macOS, Cloud
- **data_sources**: Process monitoring, network traffic, etc.

### OWASP LLM Metadata
- **vulnerability_id**: LLM01, LLM02, etc.
- **risk_level**: Critical, High, Medium, Low
- **category**: Input Validation, Output Handling, etc.
- **source**: OWASP Top 10 for LLMs

## Solution: Query Structuring

1. **Extract structure from natural language**: Use LLM to parse filters
2. **Apply metadata filters**: Combine with similarity search
3. **Return precise results**: Only documents matching both semantic and metadata criteria

---
## 1. Environment Setup

In [None]:
# Install additional dependencies
!pip install -q pydantic

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
from typing import List, Dict, Optional, Literal
from datetime import datetime, timedelta
from pydantic import BaseModel, Field

# LangChain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser, PydanticOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import Document

# Load environment variables
load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    print("⚠️  WARNING: OPENAI_API_KEY not found")
else:
    print("✅ OpenAI API key loaded")

In [None]:
# Initialize embeddings and LLM
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

print("✅ Embeddings and LLM initialized")

---
## 2. Metadata Schema Definition

We'll use Pydantic to define structured schemas for our security documents.

In [None]:
# Define metadata schema using Pydantic
class SecurityQueryFilter(BaseModel):
    """Structured filters for security document queries."""
    
    # Severity/Risk filters
    risk_level: Optional[Literal["Critical", "High", "Medium", "Low"]] = Field(
        None,
        description="Filter by risk/severity level"
    )
    
    # Category filters
    category: Optional[str] = Field(
        None,
        description="Filter by vulnerability category (e.g., 'Input Validation', 'Output Handling')"
    )
    
    # ID filters
    vulnerability_id: Optional[str] = Field(
        None,
        description="Filter by specific vulnerability ID (e.g., 'LLM01', 'LLM02')"
    )
    
    # Source filters
    source: Optional[str] = Field(
        None,
        description="Filter by document source (e.g., 'OWASP Top 10 for LLMs', 'MITRE ATT&CK')"
    )
    
    # Free-text query (for hybrid filtering)
    query: str = Field(
        ...,
        description="The semantic search query text"
    )

print("✅ Metadata schema defined")
print("\nSchema fields:")
for field_name, field in SecurityQueryFilter.__fields__.items():
    print(f"  - {field_name}: {field.annotation}")

---
## 3. Structured Query Extraction

We'll use LLM function calling to extract structured filters from natural language.

In [None]:
# Create parser for Pydantic model
parser = PydanticOutputParser(pydantic_object=SecurityQueryFilter)

# Prompt template for query structuring
query_structuring_template = """You are an AI assistant that converts natural language security queries into structured filters.

Extract the following information from the user's query:
1. risk_level: Critical, High, Medium, or Low (if mentioned)
2. category: The vulnerability category (if mentioned)
3. vulnerability_id: Specific vulnerability ID like LLM01, LLM02, etc. (if mentioned)
4. source: The document source like "OWASP Top 10 for LLMs" (if mentioned)
5. query: The semantic search query (always required)

User query: {user_query}

{format_instructions}

Structured Query:"""

query_structuring_prompt = ChatPromptTemplate.from_template(query_structuring_template)

# Create chain
query_structuring_chain = (
    query_structuring_prompt.partial(format_instructions=parser.get_format_instructions())
    | llm
    | parser
)

print("✅ Query structuring chain created")

In [None]:
# Test query structuring
test_queries = [
    "Show me critical vulnerabilities related to prompt injection",
    "What are high severity output handling issues?",
    "Tell me about LLM01",
    "Find medium risk vulnerabilities in the OWASP Top 10",
]

print("🧪 Testing Query Structuring\n")
print("=" * 80)

for query in test_queries:
    print(f"\n❓ Natural Language: '{query}'")
    structured = query_structuring_chain.invoke({"user_query": query})
    print(f"\n📋 Structured Filter:")
    print(f"   risk_level: {structured.risk_level}")
    print(f"   category: {structured.category}")
    print(f"   vulnerability_id: {structured.vulnerability_id}")
    print(f"   source: {structured.source}")
    print(f"   query: '{structured.query}'")
    print("-" * 80)

---
## 4. Load Vector Store with Enhanced Metadata

Let's load our existing vector store and examine the metadata.

In [None]:
# Load vector store
vectorstore = Chroma(
    collection_name="owasp_llm_top10",
    embedding_function=embeddings,
    persist_directory="../data/chroma_db"
)

print("✅ Vector store loaded")
print(f"   Collection: {vectorstore._collection.count()} documents")

In [None]:
# Examine metadata in our documents
print("\n📊 Metadata Schema in Current Documents:\n")
print("=" * 80)

# Get a sample document
sample_docs = vectorstore.similarity_search("test", k=1)
if sample_docs:
    sample_metadata = sample_docs[0].metadata
    print("Sample document metadata:")
    for key, value in sample_metadata.items():
        print(f"  {key}: {value}")
else:
    print("No documents found")

print("\n" + "=" * 80)

---
## 5. Metadata-Aware Retrieval

Now let's implement retrieval that combines similarity search with metadata filtering.

In [None]:
def filtered_retrieval(
    query: str,
    vectorstore: Chroma,
    filters: Optional[SecurityQueryFilter] = None,
    k: int = 3
) -> List[Document]:
    """
    Retrieve documents using semantic similarity + metadata filters.
    
    Args:
        query: Semantic search query
        vectorstore: Chroma vector store
        filters: Structured metadata filters
        k: Number of documents to retrieve
        
    Returns:
        List of filtered documents
    """
    # Build metadata filter dict for Chroma
    where_filter = {}
    
    if filters:
        if filters.risk_level:
            where_filter["risk_level"] = filters.risk_level
        if filters.category:
            where_filter["category"] = filters.category
        if filters.vulnerability_id:
            where_filter["id"] = filters.vulnerability_id
        if filters.source:
            where_filter["source"] = filters.source
    
    # Perform filtered similarity search
    if where_filter:
        print(f"🔍 Searching with filters: {where_filter}")
        docs = vectorstore.similarity_search(
            query=query,
            k=k,
            filter=where_filter
        )
    else:
        print(f"🔍 Searching without filters")
        docs = vectorstore.similarity_search(query=query, k=k)
    
    return docs

print("✅ Filtered retrieval function created")

---
## 6. End-to-End Query Structuring Pipeline

Combine query structuring + filtered retrieval + answer generation.

In [None]:
def structured_query_rag(user_query: str, vectorstore: Chroma, llm) -> str:
    """
    End-to-end RAG with query structuring and metadata filtering.
    
    Args:
        user_query: Natural language query
        vectorstore: Vector store
        llm: Language model
        
    Returns:
        Generated answer
    """
    print(f"\n{'='*80}")
    print(f"❓ User Query: {user_query}")
    print(f"{'='*80}\n")
    
    # Step 1: Extract structured filters
    print("1️⃣  Extracting structured filters...")
    structured = query_structuring_chain.invoke({"user_query": user_query})
    print(f"   Filters: risk_level={structured.risk_level}, category={structured.category}, id={structured.vulnerability_id}")
    print(f"   Query: '{structured.query}'\n")
    
    # Step 2: Retrieve with filters
    print("2️⃣  Retrieving documents...")
    docs = filtered_retrieval(
        query=structured.query,
        vectorstore=vectorstore,
        filters=structured,
        k=3
    )
    print(f"   Retrieved {len(docs)} documents\n")
    
    if not docs:
        return "No documents found matching your criteria. Try broadening your search."
    
    # Show retrieved documents
    print("   📄 Retrieved Documents:")
    for i, doc in enumerate(docs, 1):
        print(f"      {i}. {doc.metadata.get('id', 'N/A')}: {doc.metadata.get('title', 'N/A')} (Risk: {doc.metadata.get('risk_level', 'N/A')})")
    print()
    
    # Step 3: Generate answer
    print("3️⃣  Generating answer...\n")
    
    # Format context
    context = "\n\n".join([
        f"Document {i+1} ({doc.metadata['id']} - {doc.metadata['title']}, Risk: {doc.metadata['risk_level']}):\n{doc.page_content}"
        for i, doc in enumerate(docs)
    ])
    
    # Answer prompt
    answer_prompt = ChatPromptTemplate.from_template(
        """You are an AI security expert assistant.

Use the following security documentation to answer the user's question.

Context:
{context}

User Question: {question}

Instructions:
1. Provide a comprehensive answer based on the context
2. Cite specific vulnerabilities (e.g., LLM01) and risk levels
3. Include prevention measures and best practices
4. Be specific and actionable
5. If the context doesn't fully answer the question, acknowledge limitations

Answer:"""
    )
    
    prompt_value = answer_prompt.invoke({"context": context, "question": user_query})
    response = llm.invoke(prompt_value)
    
    return response.content

print("✅ Structured query RAG pipeline created")

---
## 7. Demonstrations

Let's test the structured query RAG system with various filtered queries.

### Example 1: Filter by Severity

In [None]:
answer = structured_query_rag(
    "Show me critical vulnerabilities",
    vectorstore,
    llm
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

### Example 2: Filter by Category

In [None]:
answer = structured_query_rag(
    "What are the output validation vulnerabilities?",
    vectorstore,
    llm
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

### Example 3: Filter by Specific ID

In [None]:
answer = structured_query_rag(
    "Tell me about LLM01",
    vectorstore,
    llm
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

### Example 4: Combined Filters

In [None]:
answer = structured_query_rag(
    "Show me high severity data privacy issues",
    vectorstore,
    llm
)

print("\n" + "="*80)
print("📄 ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)

---
## 8. Comparison: Filtered vs Unfiltered Retrieval

Let's compare retrieval quality with and without metadata filtering.

In [None]:
def compare_filtering(user_query: str, vectorstore: Chroma):
    """
    Compare filtered vs unfiltered retrieval.
    """
    print("\n" + "="*80)
    print(f"❓ Query: {user_query}")
    print("="*80)
    
    # Extract filters
    structured = query_structuring_chain.invoke({"user_query": user_query})
    
    # Unfiltered retrieval
    print("\n1️⃣  UNFILTERED RETRIEVAL (Similarity Only)")
    print("-" * 80)
    unfiltered_docs = vectorstore.similarity_search(structured.query, k=3)
    print(f"Retrieved {len(unfiltered_docs)} documents:\n")
    for i, doc in enumerate(unfiltered_docs, 1):
        print(f"{i}. {doc.metadata.get('id', 'N/A')}: {doc.metadata.get('title', 'N/A')}")
        print(f"   Risk: {doc.metadata.get('risk_level', 'N/A')}, Category: {doc.metadata.get('category', 'N/A')}")
        print(f"   Preview: {doc.page_content[:100]}...\n")
    
    # Filtered retrieval
    print("\n2️⃣  FILTERED RETRIEVAL (Similarity + Metadata)")
    print("-" * 80)
    filtered_docs = filtered_retrieval(structured.query, vectorstore, structured, k=3)
    print(f"\nRetrieved {len(filtered_docs)} documents:\n")
    for i, doc in enumerate(filtered_docs, 1):
        print(f"{i}. {doc.metadata.get('id', 'N/A')}: {doc.metadata.get('title', 'N/A')}")
        print(f"   Risk: {doc.metadata.get('risk_level', 'N/A')}, Category: {doc.metadata.get('category', 'N/A')}")
        print(f"   Preview: {doc.page_content[:100]}...\n")
    
    # Analysis
    print("\n" + "="*80)
    print("📊 ANALYSIS")
    print("="*80)
    print(f"Filters applied: risk_level={structured.risk_level}, category={structured.category}, id={structured.vulnerability_id}")
    print(f"Unfiltered results: {len(unfiltered_docs)}")
    print(f"Filtered results: {len(filtered_docs)}")
    print(f"\n✅ Filtering ensures documents match both semantic similarity AND metadata criteria")
    print("\n" + "="*80 + "\n")

print("✅ Comparison function created")

In [None]:
# Test comparison
compare_filtering("Show me critical vulnerabilities", vectorstore)

In [None]:
# Another comparison
compare_filtering("What are high severity input validation issues?", vectorstore)

---
## 9. Production Best Practices

### Metadata Design Principles

1. **Consistent Schema**: Use the same metadata fields across all documents
2. **Controlled Vocabulary**: Use enums for categorical fields (e.g., severity levels)
3. **Nullable Fields**: Make most fields optional to handle incomplete data
4. **Rich Metadata**: Include all relevant filtering dimensions
5. **Indexing**: Ensure metadata fields are indexed for fast filtering

### Query Structuring Best Practices

1. **Graceful Degradation**: If no structured filters found, fall back to semantic search
2. **Validation**: Validate extracted filters before applying
3. **User Feedback**: Show users what filters were applied
4. **Refinement**: Allow users to adjust filters interactively
5. **Logging**: Log structured queries for analysis and improvement

### Performance Optimization

1. **Index Metadata Fields**: Ensure vector store indexes metadata for fast filtering
2. **Limit Filter Complexity**: Too many filters can slow down queries
3. **Cache Structured Queries**: Cache filter extraction for common queries
4. **Batch Processing**: Process multiple queries in parallel when possible
5. **Monitor Performance**: Track query latency and filter effectiveness

In [None]:
# Example: Graceful degradation
def robust_structured_query_rag(user_query: str, vectorstore: Chroma, llm) -> str:
    """
    RAG with graceful degradation if filter extraction fails.
    """
    try:
        # Try structured query extraction
        structured = query_structuring_chain.invoke({"user_query": user_query})
        docs = filtered_retrieval(structured.query, vectorstore, structured, k=3)
        
        # If no results with filters, try without
        if not docs:
            print("⚠️  No results with filters, trying unfiltered search...")
            docs = vectorstore.similarity_search(structured.query, k=3)
            
    except Exception as e:
        # If extraction fails, fall back to basic search
        print(f"⚠️  Filter extraction failed ({e}), falling back to basic search...")
        docs = vectorstore.similarity_search(user_query, k=3)
    
    if not docs:
        return "No relevant documents found."
    
    # Generate answer
    context = "\n\n".join([doc.page_content for doc in docs])
    answer_prompt = ChatPromptTemplate.from_template(
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    )
    prompt_value = answer_prompt.invoke({"context": context, "question": user_query})
    response = llm.invoke(prompt_value)
    return response.content

print("✅ Robust structured query RAG created")

---
## 10. Summary and Key Takeaways

### What We Built

✅ Complete metadata filtering pipeline:
1. **Metadata Schema**: Pydantic models for structured data
2. **Query Structuring**: LLM-based filter extraction
3. **Filtered Retrieval**: Combined similarity + metadata search
4. **End-to-End RAG**: Complete pipeline with filtering
5. **Comparison Framework**: Filtered vs unfiltered evaluation

### Core Concepts Learned

1. **Metadata Importance**: Why metadata is critical for production RAG
2. **Structured Queries**: Converting natural language to filters
3. **Hybrid Search**: Combining semantic + metadata filtering
4. **Graceful Degradation**: Handling filter extraction failures
5. **Production Patterns**: Best practices for real-world systems

### Key Insights

**Metadata Filtering Benefits:**
- ↑↑ Precision (only relevant documents)
- ↑ User control (explicit filter criteria)
- ↑ Explainability (clear why documents matched)
- ✅ Essential for enterprise security applications

**When to Use Metadata Filtering:**
- **Categorical queries**: "Show me critical vulnerabilities"
- **Time-bound queries**: "CVEs from last 6 months"
- **Product-specific**: "Vulnerabilities affecting PyTorch"
- **Compliance**: "Show GDPR-relevant issues"

### Production Recommendations

1. **Design metadata schema early**: Plan fields before indexing
2. **Use controlled vocabularies**: Enums for consistency
3. **Validate extracted filters**: Check before applying
4. **Show filters to users**: Transparency builds trust
5. **Monitor filter effectiveness**: Track precision/recall
6. **Implement graceful degradation**: Fall back to semantic search

### Next Steps

In **Part 6**, we'll add **Intelligent Reranking**:
- Rerank by relevance AND priority (severity, recency)
- Use Cohere Rerank for semantic reranking
- Implement security-specific ranking functions
- Combine multiple ranking signals

Example: Boost critical vulnerabilities with recent exploits to the top, even if similarity score is slightly lower.

---

### 🎯 Practice Exercises

1. **Add More Metadata Fields**: Add date_published, affected_products, exploit_available
2. **Implement Date Filtering**: Parse date ranges from queries ("last 6 months")
3. **Add CVSS Filtering**: Filter by CVSS score ranges
4. **Build Interactive UI**: Streamlit app with filter controls
5. **Implement Filter Suggestions**: Suggest relevant filters based on query

### 📚 Further Reading

- [LangChain Self-Query Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query)
- [Chroma Metadata Filtering](https://docs.trychroma.com/usage-guide#filtering-by-metadata)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [OpenAI Function Calling](https://platform.openai.com/docs/guides/function-calling)