# Part 1: Introduction & Use Case

## Building an AI Security Analyst Assistant with RAG

Welcome to **Security RAG from Scratch**! In this series of notebooks, we'll build a comprehensive Retrieval-Augmented Generation (RAG) system specifically designed for AI/ML security applications.

### What You'll Learn

This project demonstrates:
- 🎯 **RAG Fundamentals**: From basic retrieval to advanced techniques
- 🛡️ **Security Domain Expertise**: Working with real security frameworks and data
- 🔬 **Cutting-Edge Research**: Implementing RAPTOR and ColBERT
- 🏗️ **Production Readiness**: Security hardening, evaluation, and deployment

### Portfolio Value

This project is ideal for portfolios because it shows:
1. **Technical depth** - Advanced RAG techniques beyond basic implementations
2. **Domain expertise** - Real-world security applications
3. **Research awareness** - Implementing recent academic papers
4. **End-to-end thinking** - From data acquisition to deployed application
5. **Communication skills** - Clear documentation and educational content

## The Problem: Information Overload in Security

Security professionals face an overwhelming amount of information:

- 📚 **Massive Frameworks**: MITRE ATT&CK has 600+ techniques across 14 tactics
- 🔍 **CVE Explosion**: 25,000+ new CVEs published annually
- 📄 **Research Proliferation**: Hundreds of security papers published monthly
- 🎯 **New Attack Surfaces**: AI/ML systems introduce novel vulnerabilities
- ⏰ **Time Pressure**: Security teams need answers quickly during incidents

### Current Challenges

1. **Finding Relevant Information**: Searching through multiple sources manually
2. **Understanding Context**: Security knowledge is interconnected and hierarchical
3. **Staying Current**: New vulnerabilities and techniques emerge constantly
4. **Connecting Dots**: Relating tactics, techniques, vulnerabilities, and mitigations
5. **AI/ML Security**: Emerging field with scattered knowledge

### Our Solution: AI Security Analyst Assistant

A RAG-powered assistant that:
- ✅ Retrieves relevant security information from authoritative sources
- ✅ Provides context-aware answers with source citations
- ✅ Handles complex, multi-part security questions
- ✅ Filters by severity, recency, and affected systems
- ✅ Navigates hierarchical security knowledge (tactics → techniques → procedures)
- ✅ Understands both high-level concepts and specific technical details

## Use Cases We'll Support

### 1. Threat Intelligence Lookup
**Query**: "What are credential dumping techniques used by APT groups?"

**Response**: Retrieves MITRE ATT&CK techniques (T1003), lists sub-techniques, references APT groups known to use them, and provides detection methods.

---

### 2. Vulnerability Assessment
**Query**: "Show me critical vulnerabilities in PyTorch from the last 6 months"

**Response**: Filters CVE database by severity (Critical), product (PyTorch), and date range, returns prioritized list with CVSS scores and remediation guidance.

---

### 3. AI/ML Security Guidance
**Query**: "How do I defend against prompt injection attacks in my LLM application?"

**Response**: Retrieves OWASP Top 10 for LLMs content, provides defense strategies, links to related vulnerabilities, and suggests validation techniques.

---

### 4. Incident Response
**Query**: "What should I do if I suspect model extraction is happening?"

**Response**: Provides step-by-step incident response guidance, detection methods, containment strategies, and references to relevant research.

---

### 5. Security Framework Navigation
**Query**: "Explain the Credential Access tactic and its most common techniques"

**Response**: Hierarchical view of the tactic, lists techniques with descriptions, shows relationships, and provides real-world examples.

---

### 6. Research Synthesis
**Query**: "What are the latest defenses against adversarial attacks on image classifiers?"

**Response**: Synthesizes information from recent research papers, compares approaches, and provides practical implementation guidance.

## RAG Architecture Overview

### What is RAG?

**Retrieval-Augmented Generation (RAG)** combines:
1. **Information Retrieval**: Finding relevant documents from a knowledge base
2. **Language Generation**: Using an LLM to generate answers based on retrieved context

This solves key LLM limitations:
- ❌ **Knowledge cutoff**: LLMs only know information up to their training date
- ❌ **Hallucination**: LLMs may confidently generate incorrect information
- ❌ **No source attribution**: Can't verify where information comes from

RAG provides:
- ✅ **Current information**: Query up-to-date knowledge bases
- ✅ **Grounded answers**: Responses based on retrieved documents
- ✅ **Source citations**: Transparency about information sources

### Basic RAG Pipeline

```
┌─────────────────┐
│  User Question  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Embedding     │  Convert question to vector
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Vector Search   │  Find similar documents
│  (Similarity)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    Retrieve     │  Get top-k documents
│   Documents     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     Prompt      │  Context + Question
│  Construction   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   LLM Answer    │  Generate response
│   Generation    │
└─────────────────┘
```

### Our Advanced RAG Architecture

We'll progressively build toward this sophisticated system:

```
┌──────────────────────────────────────────────────────────────┐
│                     User Question                            │
└──────────────────────┬───────────────────────────────────────┘
                       │
         ┌─────────────┴─────────────┐
         │   Query Processing        │
         ├───────────────────────────┤
         │ • Multi-Query             │
         │ • Decomposition           │
         │ • Query Structuring       │
         │ • Metadata Extraction     │
         └─────────────┬─────────────┘
                       │
         ┌─────────────┴─────────────┐
         │   Retrieval Methods       │
         ├───────────────────────────┤
         │ • Dense Embeddings        │
         │ • ColBERT (Late Inter.)   │
         │ • RAPTOR (Hierarchical)   │
         │ • Metadata Filtering      │
         └─────────────┬─────────────┘
                       │
         ┌─────────────┴─────────────┐
         │   Re-ranking              │
         ├───────────────────────────┤
         │ • Cohere Rerank           │
         │ • Severity Weighting      │
         │ • Recency Boost           │
         │ • RAG-Fusion              │
         └─────────────┬─────────────┘
                       │
         ┌─────────────┴─────────────┐
         │   Security Validation     │
         ├───────────────────────────┤
         │ • Adversarial Detection   │
         │ • Source Verification     │
         │ • Confidence Scoring      │
         └─────────────┬─────────────┘
                       │
         ┌─────────────┴─────────────┐
         │   Answer Generation       │
         ├───────────────────────────┤
         │ • Context-aware LLM       │
         │ • Source Citations        │
         │ • Confidence Indicators   │
         └───────────────────────────┘
```

## Security Data Sources

We'll use authoritative, publicly-available security knowledge bases:

### 1. MITRE ATT&CK Framework

**What**: Globally-accessible knowledge base of adversary tactics and techniques

**Content**:
- 14 Tactics (strategic goals like Credential Access, Lateral Movement)
- 600+ Techniques and Sub-techniques (specific methods to achieve goals)
- Real-world examples from APT groups
- Detection and mitigation strategies

**Why It's Perfect for RAG**:
- Hierarchical structure (Tactics → Techniques → Sub-techniques)
- Rich interconnections between concepts
- Regularly updated with new threat intelligence
- Structured data format (STIX/JSON)

**Example Content**:
```
Tactic: Credential Access (TA0006)
├── Technique: OS Credential Dumping (T1003)
│   ├── Sub-technique: LSASS Memory (T1003.001)
│   ├── Sub-technique: Security Account Manager (T1003.002)
│   └── Sub-technique: NTDS (T1003.003)
```

---

### 2. OWASP Top 10 for LLMs

**What**: A standard awareness document for developers and security teams about the most critical risks to LLM applications

**Content**:
1. Prompt Injection
2. Insecure Output Handling
3. Training Data Poisoning
4. Model Denial of Service
5. Supply Chain Vulnerabilities
6. Sensitive Information Disclosure
7. Insecure Plugin Design
8. Excessive Agency
9. Overreliance
10. Model Theft

**Why It's Perfect for RAG**:
- Emerging field with scattered knowledge
- Each vulnerability has descriptions, examples, and mitigations
- Directly relevant to AI/ML security
- Educational content perfect for Q&A

---

### 3. CVE Database (National Vulnerability Database)

**What**: Comprehensive database of standardized vulnerability information

**Content**:
- CVE identifiers (CVE-2024-12345)
- CVSS severity scores (Critical, High, Medium, Low)
- Affected products and versions
- Vulnerability descriptions
- References and advisories
- CWE classifications (weakness types)

**Why It's Perfect for RAG**:
- Rich metadata for filtering (severity, date, product)
- Structured data ideal for query structuring
- Time-sensitive information requiring recency
- Large volume requiring intelligent retrieval

**Example Metadata**:
```json
{
  "cve_id": "CVE-2024-1234",
  "cvss_score": 9.8,
  "severity": "CRITICAL",
  "published_date": "2024-03-15",
  "affected_products": ["PyTorch < 2.1.0"],
  "description": "Remote code execution vulnerability..."
}
```

---

### 4. Security Research Papers (Optional)

**What**: Academic research on adversarial ML, model security, and defenses

**Sources**:
- arXiv (cs.CR - Cryptography and Security)
- arXiv (cs.LG - Machine Learning)
- Conference proceedings (USENIX Security, IEEE S&P, ACM CCS)

**Why It's Useful**:
- Cutting-edge research on emerging threats
- Detailed technical explanations
- Novel defense mechanisms
- Demonstrates research awareness

## Learning Path: 11-Part Journey

We'll build progressively from fundamentals to advanced techniques:

### Part 1: Introduction & Use Case (This Notebook)
Understanding the problem and architecture overview

### Part 2: Basic RAG with Security Data
- Load MITRE ATT&CK and OWASP content
- Create embeddings and vector store
- Build simple retrieval + generation pipeline
- Answer basic security questions

### Part 3: Advanced Retrieval
- Multi-Query: Generate multiple query perspectives
- RAG-Fusion: Combine results with reciprocal rank fusion
- Improve retrieval coverage and quality

### Part 4: Complex Query Handling
- Decompose complex questions into sub-questions
- Answer sequentially or in parallel
- Synthesize comprehensive responses

### Part 5: Metadata & Filtering
- Convert natural language to structured queries
- Filter by severity, date, affected systems
- Production-ready filtered retrieval

### Part 6: Intelligent Ranking
- Cohere Rerank for semantic reranking
- Security-specific ranking (severity, recency)
- Custom scoring functions

### Part 7: Hierarchical Security Knowledge (RAPTOR)
- Recursive abstractive summarization
- Tree-based knowledge organization
- Multi-level retrieval (tactics → techniques → procedures)

### Part 8: Late Interaction Retrieval (ColBERT)
- Token-level embeddings
- MaxSim scoring
- Perfect for code patterns and technical content

### Part 9: Security Hardening
- Adversarial query detection
- Source verification
- Confidence scoring
- Output validation

### Part 10: Evaluation & Metrics
- Retrieval metrics (Precision@K, Recall@K, MRR, NDCG)
- Generation metrics (Faithfulness, Relevance, Completeness)
- Benchmarking different configurations

### Part 11: Deployment & Demo
- Streamlit web application
- Interactive demo interface
- Production considerations
- Portfolio presentation

## Example Queries We'll Answer

By the end of this series, our system will handle queries like these:

### Simple Lookups
```
Q: "What is prompt injection?"
A: Retrieves OWASP LLM01 content, provides definition, examples, and mitigations.
```

### Filtered Searches
```
Q: "Show me critical CVEs affecting TensorFlow from 2024"
A: Filters by severity=CRITICAL, product=TensorFlow, year=2024, returns ranked list.
```

### Complex Multi-Part Questions
```
Q: "How do I secure my ML pipeline from data poisoning during training and adversarial 
     attacks during inference?"
A: Decomposes into:
   1. What is data poisoning?
   2. How to defend against data poisoning?
   3. What are adversarial attacks at inference?
   4. How to defend against adversarial attacks?
   Then synthesizes comprehensive answer covering both phases.
```

### Hierarchical Navigation
```
Q: "Explain Credential Access techniques"
A: Uses RAPTOR to provide:
   - High-level tactic overview
   - List of techniques under this tactic
   - Can drill down to specific technique details
```

### Code Pattern Matching (ColBERT)
```
Q: "Find examples of SQL injection vulnerabilities"
A: Uses ColBERT token-level matching to find similar code patterns,
   even with different variable names or slight variations.
```

### Research Synthesis
```
Q: "What are the latest defenses against model extraction attacks?"
A: Retrieves recent research papers, synthesizes approaches,
   compares effectiveness, provides implementation guidance.
```

## Technical Stack

### Core Technologies

**LangChain**: RAG framework providing:
- Document loaders and text splitters
- Embedding and vector store integrations
- Prompt templates and chains
- Retriever abstractions

**OpenAI**: LLM provider for:
- GPT-4 for answer generation
- text-embedding-3-small for embeddings

**ChromaDB**: Vector database for:
- Storing document embeddings
- Fast similarity search
- Metadata filtering

### Advanced Components

**RAGatouille**: ColBERT implementation for:
- Token-level embeddings
- Late interaction retrieval
- Superior matching for technical content

**Cohere**: Re-ranking API for:
- Semantic reranking beyond embeddings
- Cross-encoder approach
- Improved result relevance

**LangSmith**: Observability platform for:
- Tracing RAG pipeline execution
- Debugging retrieval and generation
- Evaluation and benchmarking

### Deployment

**Streamlit**: Web framework for:
- Interactive demo application
- Quick prototyping
- Portfolio presentation

## Environment Setup

Let's verify our environment is ready to go!

In [None]:
# Check Python version (should be 3.10+)
import sys
print(f"Python version: {sys.version}")
assert sys.version_info >= (3, 10), "Python 3.10+ required"

In [None]:
# Install required packages (run this if you haven't already)
# Uncomment the line below to install
# !pip install -r ../requirements.txt

In [None]:
# Import key libraries to verify installation
import os
from pathlib import Path

try:
    import langchain
    print(f"✅ LangChain version: {langchain.__version__}")
except ImportError:
    print("❌ LangChain not installed")

try:
    import openai
    print(f"✅ OpenAI version: {openai.__version__}")
except ImportError:
    print("❌ OpenAI not installed")

try:
    import chromadb
    print(f"✅ ChromaDB version: {chromadb.__version__}")
except ImportError:
    print("❌ ChromaDB not installed")

try:
    import tiktoken
    print(f"✅ Tiktoken available")
except ImportError:
    print("❌ Tiktoken not installed")

## API Keys Configuration

You'll need the following API keys:

1. **OpenAI API Key** (Required)
   - Get it from: https://platform.openai.com/api-keys
   - Used for: GPT-4 generation and embeddings

2. **Cohere API Key** (Optional, for Part 6)
   - Get it from: https://dashboard.cohere.com/api-keys
   - Used for: Re-ranking

3. **LangSmith API Key** (Optional, highly recommended)
   - Get it from: https://smith.langchain.com/
   - Used for: Tracing, debugging, evaluation

### Option 1: Environment Variables (Recommended)

Create a `.env` file in the project root:
```bash
OPENAI_API_KEY=sk-your-key-here
COHERE_API_KEY=your-key-here
LANGCHAIN_API_KEY=your-key-here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=security-rag-from-scratch
```

### Option 2: Set in Notebook

If you prefer to set keys directly in the notebook (not recommended for production):

In [None]:
# Load environment variables from .env file
from dotenv import load_dotenv

# Load from .env file in parent directory
load_dotenv(Path("..") / ".env")

# Verify OpenAI key is set
if os.getenv("OPENAI_API_KEY"):
    print("✅ OpenAI API key found")
else:
    print("❌ OpenAI API key not found")
    print("   Set OPENAI_API_KEY environment variable or create .env file")

# Verify optional keys
if os.getenv("COHERE_API_KEY"):
    print("✅ Cohere API key found")
else:
    print("⚠️  Cohere API key not found (optional, needed for Part 6)")

if os.getenv("LANGCHAIN_API_KEY"):
    print("✅ LangSmith API key found")
else:
    print("⚠️  LangSmith API key not found (optional but recommended for tracing)")

## Quick Test: Your First RAG Query

Let's verify everything works with a simple example using public data:

In [None]:
# Simple test to verify OpenAI connection
from langchain_openai import ChatOpenAI

try:
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    response = llm.invoke("Say 'Hello from Security RAG!'")
    print("✅ OpenAI connection successful!")
    print(f"Response: {response.content}")
except Exception as e:
    print(f"❌ OpenAI connection failed: {e}")
    print("   Check your OPENAI_API_KEY")

## Project Directory Structure

Let's verify our project structure is set up correctly:

In [None]:
# Check project structure
project_root = Path("..").resolve()
print(f"Project root: {project_root}\n")

required_dirs = [
    "notebooks",
    "data",
    "assets",
    "app"
]

for dir_name in required_dirs:
    dir_path = project_root / dir_name
    if dir_path.exists():
        print(f"✅ {dir_name}/ directory exists")
    else:
        print(f"❌ {dir_name}/ directory missing")
        dir_path.mkdir(exist_ok=True)
        print(f"   Created {dir_name}/ directory")

## What's Next?

Now that we understand the problem, use cases, and architecture, we're ready to start building!

### Part 2: Basic RAG with Security Data

In the next notebook, we'll:

1. **Load Security Data**
   - Download and parse OWASP Top 10 for LLMs
   - Load sample MITRE ATT&CK content
   - Understand document structure

2. **Create Embeddings**
   - Split documents into chunks
   - Generate vector embeddings
   - Store in ChromaDB

3. **Build Basic RAG**
   - Create retriever
   - Design security-focused prompts
   - Implement generation pipeline

4. **Answer Questions**
   - Test with security queries
   - Analyze results
   - Identify limitations

### Key Takeaways from Part 1

- ✅ **Problem**: Security professionals face information overload
- ✅ **Solution**: RAG-powered AI assistant for security questions
- ✅ **Data Sources**: MITRE ATT&CK, OWASP, CVE database, research papers
- ✅ **Architecture**: Progressive enhancement from basic to advanced RAG
- ✅ **Journey**: 11 notebooks from fundamentals to deployment
- ✅ **Value**: Portfolio-ready demonstration of RAG expertise

Ready to dive into the code? Let's move to **Part 2: Basic RAG with Security Data**!

## Additional Resources

### RAG Fundamentals
- [LangChain RAG Tutorial](https://python.langchain.com/docs/use_cases/question_answering/)
- [Retrieval-Augmented Generation Paper](https://arxiv.org/abs/2005.11401)

### Security Frameworks
- [MITRE ATT&CK](https://attack.mitre.org/)
- [OWASP Top 10 for LLMs](https://owasp.org/www-project-top-10-for-large-language-model-applications/)
- [National Vulnerability Database](https://nvd.nist.gov/)

### Advanced Techniques
- [RAPTOR Paper](https://arxiv.org/abs/2401.18059) - Recursive Abstractive Processing
- [ColBERT Paper](https://arxiv.org/abs/2004.12832) - Late Interaction Retrieval
- [RAG Survey Paper](https://arxiv.org/abs/2312.10997) - Comprehensive overview

### AI/ML Security
- [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox)
- [CleverHans](https://github.com/cleverhans-lab/cleverhans) - Adversarial examples library
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)