# Lab 06: RAG for Security Documentation

Build a Retrieval-Augmented Generation system for security knowledge bases.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab06_security_rag.ipynb)

## Learning Objectives
- Document chunking strategies
- Vector embeddings for semantic search
- RAG pipeline architecture
- Context-aware response generation

In [None]:
# !pip install chromadb sentence-transformers anthropic

In [None]:
from typing import List, Dict
import json
import hashlib

## 1. Security Knowledge Base

In [None]:
# Sample security documentation
SECURITY_DOCS = [
    {
        "title": "Ransomware Response Playbook",
        "content": """When ransomware is detected, immediately isolate affected systems from the network. 
        Do not pay the ransom. Preserve evidence by capturing memory dumps and disk images.
        Identify patient zero and the infection vector. Check backup integrity before restoration.
        Report to law enforcement (FBI IC3) and consider engaging incident response specialists.""",
        "category": "incident_response"
    },
    {
        "title": "Phishing Detection Guidelines",
        "content": """Signs of phishing emails include: sender domain mismatches, urgency language,
        suspicious links, unexpected attachments, grammar errors, and requests for credentials.
        Enable SPF, DKIM, and DMARC for email authentication. Train users to report suspicious emails.
        Use email gateway filtering and sandbox analysis for attachments.""",
        "category": "threat_detection"
    },
    {
        "title": "PowerShell Security Monitoring",
        "content": """Enable PowerShell script block logging (Event ID 4104) and module logging.
        Monitor for encoded commands (-enc, -encodedcommand), download cradles (IEX, Invoke-Expression),
        and AMSI bypass attempts. Use Constrained Language Mode in production.
        Block PowerShell v2 which lacks modern security features.""",
        "category": "endpoint_security"
    },
    {
        "title": "Lateral Movement Detection",
        "content": """Monitor for lateral movement indicators: PsExec usage, WMI remote execution,
        RDP connections from unusual sources, pass-the-hash attacks, and Kerberos anomalies.
        Look for Event IDs 4624 (Type 3 logons), 4648 (explicit credentials), and 5140 (share access).
        Implement network segmentation and jump servers for admin access.""",
        "category": "threat_detection"
    }
]

print(f"Loaded {len(SECURITY_DOCS)} documents")

## 2. Simple Vector Store

In [None]:
import numpy as np

class SimpleVectorStore:
    """Simple in-memory vector store using TF-IDF-like scoring."""
    
    def __init__(self):
        self.documents = []
        self.vocabulary = set()
    
    def add_documents(self, docs: List[Dict]):
        """Add documents to the store."""
        for doc in docs:
            text = doc['content'].lower()
            words = set(text.split())
            self.vocabulary.update(words)
            self.documents.append({
                'title': doc['title'],
                'content': doc['content'],
                'words': words,
                'category': doc.get('category', 'general')
            })
    
    def search(self, query: str, top_k: int = 3) -> List[Dict]:
        """Search for relevant documents."""
        query_words = set(query.lower().split())
        
        scores = []
        for doc in self.documents:
            # Simple word overlap scoring
            overlap = len(query_words & doc['words'])
            score = overlap / (len(query_words) + 0.1)
            scores.append((score, doc))
        
        # Sort by score
        scores.sort(key=lambda x: x[0], reverse=True)
        
        return [{'score': s, 'document': d} for s, d in scores[:top_k]]

# Initialize store
store = SimpleVectorStore()
store.add_documents(SECURITY_DOCS)
print(f"Indexed {len(store.documents)} documents")

## 3. RAG Pipeline

In [None]:
class SecurityRAG:
    """RAG system for security documentation."""
    
    def __init__(self, vector_store):
        self.store = vector_store
        try:
            from anthropic import Anthropic
            self.client = Anthropic()
            self.available = True
        except:
            self.available = False
    
    def query(self, question: str) -> str:
        """Answer a question using RAG."""
        # Retrieve relevant documents
        results = self.store.search(question, top_k=3)
        
        # Build context
        context = "\n\n".join([
            f"Document: {r['document']['title']}\n{r['document']['content']}"
            for r in results if r['score'] > 0
        ])
        
        if not self.available:
            return self._mock_response(question, results)
        
        prompt = f"""Use the following security documentation to answer the question.
If the answer isn't in the documentation, say so.

DOCUMENTATION:
{context}

QUESTION: {question}

Provide a clear, actionable answer."""
        
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=512,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.content[0].text
    
    def _mock_response(self, question: str, results: List) -> str:
        if not results or results[0]['score'] == 0:
            return "I couldn't find relevant information in the security documentation."
        
        doc = results[0]['document']
        return f"""Based on the {doc['title']}:

{doc['content']}

---
(Retrieved from: {doc['category']} documentation)"""

In [None]:
# Test RAG system
rag = SecurityRAG(store)

questions = [
    "What should I do when ransomware is detected?",
    "How can I detect PowerShell attacks?",
    "What are signs of phishing?"
]

for q in questions:
    print(f"\nQ: {q}")
    print("-" * 50)
    print(rag.query(q))

## Summary

We built a RAG system for security documentation with:

1. **Document Store** - Indexed security playbooks and guides
2. **Search** - Simple word-overlap scoring (production: use embeddings)
3. **Generation** - Context-aware answers from LLM

### Next Steps:
1. Use sentence-transformers for semantic embeddings
2. Add ChromaDB or Pinecone for vector storage
3. Implement document chunking for large docs