From 486296cbbf1984120b5ad5a8686a2953a95b92ad Mon Sep 17 00:00:00 2001 From: "Gabe@w7dev" Date: Sun, 1 Feb 2026 10:44:09 +0000 Subject: [PATCH 1/3] docs: restructure INITIAL-9 into modular three-phase roadmap Decompose monolithic INITIAL-9 into three specialized technical phases: - INITIAL-9: RAG Knowledge Base ("The Memory") - pgvector + OpenAI embeddings - Markdown/OpenAPI-aware chunking - Semantic retrieval endpoints - INITIAL-10: Agentic Layer ("The Brain") - PydanticAI agents (Experiment Orchestrator, RAG Assistant) - Tool orchestration with structured outputs - Human-in-the-loop approval workflow - INITIAL-11: ForecastLab Dashboard ("The Face") - React 19 + Vite + shadcn/ui - TanStack Table/Query for data management - Recharts for time series visualization - Agent chat interface with streaming Update PHASE-index.md and DAILY-FLOW.md to align with new structure. Co-Authored-By: Claude Opus 4.5 --- INITIAL-10.md | 421 ++++++++++++++++++++++++++++++++++++++++++++ INITIAL-11.md | 417 +++++++++++++++++++++++++++++++++++++++++++ INITIAL-9.md | 352 ++++++++++++++++++++++++++++++++---- docs/DAILY-FLOW.md | 36 ++-- docs/PHASE-index.md | 35 ++-- 5 files changed, 1204 insertions(+), 57 deletions(-) create mode 100644 INITIAL-10.md create mode 100644 INITIAL-11.md diff --git a/INITIAL-10.md b/INITIAL-10.md new file mode 100644 index 00000000..1c510772 --- /dev/null +++ b/INITIAL-10.md @@ -0,0 +1,421 @@ +# INITIAL-10.md — Agentic Layer (The Brain) + +## Architectural Role + +**"The Brain"** - Autonomous decision-making, tool orchestration, and structured outputs using PydanticAI. + +This phase provides intelligent orchestration capabilities: +- Experiment automation (config generation → backtest → deploy) +- RAG-powered Q&A with evidence-grounded answers and citations +- Human-in-the-loop approval for sensitive operations +- Structured, schema-enforced outputs + +--- + +## Tech Stack + +| Component | Technology | Purpose | +|-----------|------------|---------| +| Agent Framework | [PydanticAI](https://ai.pydantic.dev/) | Type-safe agent orchestration | +| Tool System | [Function Tools](https://ai.pydantic.dev/tools/) | API binding | +| Tool Groups | [Toolsets](https://ai.pydantic.dev/toolsets/) | Grouped tool management | +| LLM Provider | Anthropic Claude / OpenAI GPT-4 | Configurable provider | +| Streaming | [PydanticAI Streaming](https://ai.pydantic.dev/results/#streamed-results) | Real-time responses | + +--- + +## FEATURE + +### Experiment Orchestrator Agent +Autonomous experiment workflow management: +- **Tools**: `list_models`, `run_backtest`, `compare_runs`, `create_alias`, `archive_run` +- **Workflow**: Generate configs → Run backtests → Analyze metrics → Select best → Deploy alias +- **Output**: Structured `ExperimentReport` with methodology, results, and recommendations + +### RAG Assistant Agent +Evidence-grounded question answering: +- **Tools**: `retrieve_context` (from INITIAL-9), `format_citation` +- **Workflow**: Parse query → Retrieve chunks → Synthesize answer → Format citations +- **Output**: Structured `RAGResponse` with answer, citations, and confidence score + +### Agent Session Management +- Session state persistence for multi-turn conversations +- Tool call logging with correlation IDs +- Human-in-the-loop approval for sensitive actions +- Graceful LLM API failure handling with retries + +--- + +## ENDPOINTS + +### POST /agents/experiment/run +Execute an experiment workflow with the Orchestrator Agent. + +**Request**: +```json +{ + "objective": "Find the best model configuration for store S001, product P001", + "constraints": { + "model_types": ["moving_average", "seasonal_naive"], + "min_train_size": 60, + "max_splits": 5 + }, + "auto_deploy": false, + "session_id": "optional-session-id" +} +``` + +**Response**: +```json +{ + "session_id": "sess_abc123", + "status": "completed", + "report": { + "objective": "Find the best model configuration for store S001, product P001", + "methodology": "Evaluated 6 configurations using 5-fold expanding window CV", + "experiments_run": 6, + "best_run": { + "run_id": "run_xyz789", + "model_type": "moving_average", + "config": {"window": 14}, + "metrics": { + "mae": 12.5, + "smape": 15.2, + "wape": 0.08 + } + }, + "baseline_comparison": { + "vs_naive": { + "mae_improvement_pct": 23.5, + "smape_improvement_pct": 18.2 + } + }, + "recommendation": "Deploy moving_average with window=14", + "approval_required": true, + "pending_action": "create_alias" + }, + "tool_calls": [ + { + "tool": "list_models", + "args": {}, + "result_summary": "Found 4 model types" + }, + { + "tool": "run_backtest", + "args": {"model_type": "moving_average", "window": 7}, + "result_summary": "MAE: 15.2" + } + ], + "tokens_used": 2450, + "duration_ms": 45000 +} +``` + +### POST /agents/experiment/approve +Approve a pending action from an experiment session. + +**Request**: +```json +{ + "session_id": "sess_abc123", + "action": "create_alias", + "approved": true, + "comment": "Approved for staging deployment" +} +``` + +**Response**: +```json +{ + "session_id": "sess_abc123", + "action": "create_alias", + "status": "executed", + "result": { + "alias_name": "production", + "run_id": "run_xyz789" + } +} +``` + +### POST /agents/rag/query +Query with answer generation using the RAG Assistant Agent. + +**Request**: +```json +{ + "query": "How does the backtesting module prevent data leakage?", + "session_id": "optional-session-id", + "include_sources": true +} +``` + +**Response**: +```json +{ + "session_id": "sess_def456", + "answer": "The backtesting module prevents data leakage through several mechanisms:\n\n1. **Time-based splits only**: The TimeSeriesSplitter uses expanding or sliding window strategies, never random splits.\n\n2. **Gap parameter**: Configurable gap between train and test sets simulates operational latency.\n\n3. **Lag feature validation**: Features are computed with explicit cutoff dates to prevent future data access.", + "confidence": 0.92, + "citations": [ + { + "source_type": "markdown", + "source_path": "docs/PHASE/5-BACKTESTING.md", + "chunk_id": "chunk_abc123", + "snippet": "TimeSeriesSplitter uses time-based splits (expanding/sliding window)...", + "relevance_score": 0.94 + }, + { + "source_type": "markdown", + "source_path": "CLAUDE.md", + "chunk_id": "chunk_def456", + "snippet": "Backtesting uses time-based splits (rolling/expanding), never random split...", + "relevance_score": 0.89 + } + ], + "tokens_used": 1250, + "duration_ms": 3200 +} +``` + +### GET /agents/status/{session_id} +Check agent session status. + +**Response**: +```json +{ + "session_id": "sess_abc123", + "agent_type": "experiment_orchestrator", + "status": "awaiting_approval", + "created_at": "2026-02-01T10:30:00Z", + "last_activity": "2026-02-01T10:35:00Z", + "pending_action": { + "action": "create_alias", + "details": { + "alias_name": "production", + "run_id": "run_xyz789" + } + }, + "tool_calls_count": 8, + "tokens_used": 2450 +} +``` + +### WS /agents/stream +WebSocket endpoint for streaming responses. + +**Client → Server**: +```json +{ + "type": "query", + "agent": "rag_assistant", + "payload": { + "query": "Explain the model registry workflow" + } +} +``` + +**Server → Client (streaming)**: +```json +{"type": "token", "content": "The"} +{"type": "token", "content": " model"} +{"type": "token", "content": " registry"} +{"type": "tool_call", "tool": "retrieve_context", "status": "started"} +{"type": "tool_call", "tool": "retrieve_context", "status": "completed", "summary": "Found 5 relevant chunks"} +{"type": "token", "content": " tracks..."} +{"type": "complete", "session_id": "sess_xyz", "tokens_used": 850} +``` + +--- + +## AGENT DEFINITIONS + +### Experiment Orchestrator Agent + +```python +from pydantic_ai import Agent +from pydantic import BaseModel + +class ExperimentReport(BaseModel): + """Structured output for experiment results.""" + objective: str + methodology: str + experiments_run: int + best_run: RunSummary + baseline_comparison: BaselineComparison + recommendation: str + approval_required: bool + pending_action: str | None + +experiment_agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + result_type=ExperimentReport, + system_prompt="""You are an ML experiment orchestrator for retail demand forecasting. + +Your goal is to find the best model configuration through systematic experimentation. +Always: +1. Start with baseline models (naive, seasonal_naive) +2. Compare against baselines with improvement percentages +3. Use time-based backtesting with appropriate train/test splits +4. Recommend the best configuration with justification +5. Request approval before deployment actions""", + tools=[list_models, run_backtest, compare_runs, create_alias, archive_run] +) +``` + +### RAG Assistant Agent + +```python +class RAGResponse(BaseModel): + """Structured output for RAG queries.""" + answer: str + confidence: float # 0.0 - 1.0 + citations: list[Citation] + insufficient_context: bool = False + +rag_agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + result_type=RAGResponse, + system_prompt="""You are a documentation assistant for ForecastLabAI. + +Your responses must be evidence-grounded: +- Only answer based on retrieved context +- Include citations for all claims +- If context is insufficient, set insufficient_context=True and explain what's missing +- Never hallucinate information not in the retrieved chunks""", + tools=[retrieve_context, format_citation] +) +``` + +--- + +## TOOL DEFINITIONS + +### list_models +```python +@tool +async def list_models(ctx: RunContext[AgentDeps]) -> list[ModelInfo]: + """List available forecasting models with their configurations. + + Use this to discover what model types can be experimented with. + Returns model_type, default_config, and description. + """ + ... +``` + +### run_backtest +```python +@tool +async def run_backtest( + ctx: RunContext[AgentDeps], + model_type: str, + config: dict[str, Any], + store_id: str, + product_id: str, + n_splits: int = 5 +) -> BacktestResult: + """Run a backtest for a model configuration. + + Use this to evaluate model performance with time-series CV. + Returns per-fold and aggregated metrics (MAE, sMAPE, WAPE). + """ + ... +``` + +### retrieve_context +```python +@tool +async def retrieve_context( + ctx: RunContext[AgentDeps], + query: str, + top_k: int = 5 +) -> list[RetrievedChunk]: + """Retrieve relevant documentation chunks for a query. + + Use this before answering any question about the system. + Returns chunks with content, source_path, and relevance_score. + """ + ... +``` + +--- + +## CONFIGURATION (Settings) + +```python +# app/core/config.py additions + +# Agent LLM Configuration +agent_default_model: str = "anthropic:claude-sonnet-4-20250514" +agent_fallback_model: str = "openai:gpt-4o" +agent_temperature: float = 0.1 +agent_max_tokens: int = 4096 + +# Agent Execution Configuration +agent_max_tool_calls: int = 10 +agent_timeout_seconds: int = 120 +agent_retry_attempts: int = 3 +agent_retry_delay_seconds: float = 1.0 + +# Human-in-the-Loop Configuration +agent_require_approval: list[str] = ["create_alias", "archive_run"] +agent_approval_timeout_minutes: int = 60 + +# Streaming Configuration +agent_enable_streaming: bool = True +agent_stream_chunk_size: int = 10 # tokens per chunk + +# Session Configuration +agent_session_ttl_minutes: int = 120 +agent_max_sessions_per_user: int = 5 +``` + +--- + +## SUCCESS CRITERIA + +- [ ] Agents produce schema-enforced structured outputs +- [ ] Tool calls are logged with correlation IDs and timing +- [ ] Human-in-the-loop approval blocks sensitive actions +- [ ] Graceful handling of LLM API failures with retries +- [ ] WebSocket streaming delivers tokens in real-time +- [ ] Session state persists across multiple requests +- [ ] Unit tests with mocked LLM responses +- [ ] Integration tests with real LLM calls (rate-limited) +- [ ] Structured logging for all agent operations +- [ ] Token usage tracked per session for cost monitoring + +--- + +## CROSS-MODULE INTEGRATION + +| Direction | Module | Integration Point | +|-----------|--------|-------------------| +| **← RAG Layer** | INITIAL-9 | Uses `retrieve_context` tool | +| **← Registry** | Phase 6 | Uses `list_runs`, `compare_runs`, `create_alias` tools | +| **← Backtesting** | Phase 5 | Uses `run_backtest` tool | +| **← Forecasting** | Phase 4 | Uses `list_models`, `train_model` tools | +| **→ Dashboard** | INITIAL-11 | Provides chat interface backend | +| **→ Jobs** | Phase 7 | Creates job records for audit trail | + +--- + +## DOCUMENTATION LINKS + +- [PydanticAI Documentation](https://ai.pydantic.dev/) +- [PydanticAI Agents](https://ai.pydantic.dev/agents/) +- [PydanticAI Tools](https://ai.pydantic.dev/tools/) +- [PydanticAI Toolsets](https://ai.pydantic.dev/toolsets/) +- [PydanticAI Built-in Tools](https://ai.pydantic.dev/builtin-tools/) +- [PydanticAI Streaming Results](https://ai.pydantic.dev/results/#streamed-results) +- [PydanticAI GitHub](https://github.com/pydantic/pydantic-ai) +- [Anthropic Claude API](https://docs.anthropic.com/en/api) + +--- + +## OTHER CONSIDERATIONS + +- **Structured Outputs**: All agent responses are Pydantic models, never raw text +- **Tool Docstrings**: Follow guidance in CLAUDE.md for agent-optimized tool documentation +- **Cost Control**: Track and limit token usage per session +- **Audit Trail**: All tool calls logged with request correlation for debugging +- **Fallback Provider**: Automatic failover to fallback model on primary failure +- **Approval Workflow**: Pending actions expire after `agent_approval_timeout_minutes` diff --git a/INITIAL-11.md b/INITIAL-11.md new file mode 100644 index 00000000..3138f3c6 --- /dev/null +++ b/INITIAL-11.md @@ -0,0 +1,417 @@ +# INITIAL-11.md — ForecastLab Dashboard (The Face) + +## Architectural Role + +**"The Face"** - User interface, data visualization, and agent interaction using React 19 + shadcn/ui. + +This phase provides the visual layer for: +- Data exploration with server-side pagination and filtering +- Time series visualization with interactive charts +- Agent chat interface with streaming responses +- Admin panel for system management + +--- + +## Tech Stack + +| Component | Technology | Purpose | +|-----------|------------|---------| +| Framework | React 19 + [Vite](https://vite.dev/) | Fast build, HMR | +| Components | [shadcn/ui](https://ui.shadcn.com/) | Accessible, customizable UI | +| Data Tables | [TanStack Table](https://tanstack.com/table/latest) | Server-side data grids | +| Data Fetching | [TanStack Query](https://tanstack.com/query/latest) | Caching, invalidation | +| Charts | [Recharts](https://recharts.org/) | Time series visualization | +| Styling | Tailwind CSS 4 | Utility-first CSS | +| State | React 19 `use()` + TanStack Query | Server state management | + +--- + +## FEATURE + +### Data Explorer +Interactive data tables with full server-side capabilities: +- **Tables**: Sales, Stores, Products, Model Runs, Jobs +- **Features**: Pagination, sorting, filtering, column visibility +- **Export**: CSV download for selected/all rows +- **Pattern**: [shadcn/ui Data Table](https://ui.shadcn.com/docs/components/data-table) + +### Time Series Visualizers +Charts for forecasting analysis: +- **Actual vs Predicted**: Line chart with confidence intervals +- **Backtest Folds**: Train/test split visualization +- **Metric Comparison**: Bar charts for model comparison +- **Interactive**: Tooltips, zoom, pan, brush selection + +### Agent Chat Interface +Real-time interaction with AI agents: +- **Streaming**: WebSocket-based token streaming +- **Citations**: Rendered with source links +- **Tool Calls**: Collapsible visualization of agent actions +- **History**: Session sidebar with conversation threads + +### Admin Panel +System management and monitoring: +- **RAG Sources**: Index/delete documentation sources +- **Model Aliases**: Manage deployment aliases +- **Health Dashboard**: Service status, recent errors +- **Job Monitor**: Active and historical job status + +--- + +## PAGE STRUCTURE + +### /dashboard +Main dashboard with KPI summary cards and quick actions. + +### /explorer/sales +Sales data explorer with date range filtering. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Sales Explorer [Export] │ +├─────────────────────────────────────────────────────────────┤ +│ Filters: [Date Range] [Store ▼] [Product ▼] [Search...] │ +├─────────────────────────────────────────────────────────────┤ +│ Date │ Store │ Product │ Quantity │ Revenue │ +│ 2026-01-15 │ S001 │ P001 │ 150 │ $2,250.00 │ +│ 2026-01-15 │ S001 │ P002 │ 75 │ $1,125.00 │ +│ ... │ ... │ ... │ ... │ ... │ +├─────────────────────────────────────────────────────────────┤ +│ Page 1 of 50 │ [< Prev] [1] [2] [3] ... [50] [Next >] │ +└─────────────────────────────────────────────────────────────┘ +``` + +### /explorer/runs +Model run explorer with comparison capabilities. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Model Runs [Compare Selected] │ +├─────────────────────────────────────────────────────────────┤ +│ [☐] │ Run ID │ Model │ Status │ MAE │ Created │ +│ [☐] │ run_abc │ MA(14) │ SUCCESS │ 12.5 │ 2h ago │ +│ [☐] │ run_def │ SN(7) │ SUCCESS │ 15.2 │ 3h ago │ +│ [☐] │ run_ghi │ Naive │ SUCCESS │ 18.9 │ 5h ago │ +├─────────────────────────────────────────────────────────────┤ +│ Showing 3 of 127 runs │ +└─────────────────────────────────────────────────────────────┘ +``` + +### /visualize/forecast +Forecast visualization with actual vs predicted overlay. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Forecast: Store S001, Product P001 │ +├─────────────────────────────────────────────────────────────┤ +│ [Store ▼] [Product ▼] [Model Run ▼] [Date Range] │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ 200 ─┤ ╭────── │ +│ │ ╭────╯ Predicted │ +│ 150 ─┤ ╭────╯ │ +│ │ ╭────╯ ───── Actual │ +│ 100 ─┤ ╭────╯ - - - Confidence │ +│ │ ╭────╯ │ +│ 50 ─┤ ╭────╯ │ +│ │─╯ │ +│ 0 ─┼──────────────────────────────────────────────── │ +│ Jan 1 Jan 15 Feb 1 Feb 15 Mar 1 │ +│ │ +├─────────────────────────────────────────────────────────────┤ +│ MAE: 12.5 │ sMAPE: 15.2% │ WAPE: 8.1% │ Bias: -2.3 │ +└─────────────────────────────────────────────────────────────┘ +``` + +### /visualize/backtest +Backtest fold visualization. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Backtest: run_abc123 (5-fold Expanding Window) │ +├─────────────────────────────────────────────────────────────┤ +│ │ +│ Fold 1: ████████████░░░░ MAE: 14.2 sMAPE: 16.8% │ +│ Fold 2: █████████████████░░░░ MAE: 13.1 sMAPE: 15.4% │ +│ Fold 3: ███████████████████████░░░░ MAE: 12.8 sMAPE: 14.9│ +│ Fold 4: █████████████████████████████░░░░ MAE: 11.9 │ +│ Fold 5: ███████████████████████████████████░░░░ MAE: 11.2│ +│ │ +│ █ Train ░ Test │ +├─────────────────────────────────────────────────────────────┤ +│ Aggregated: MAE: 12.6 ± 1.1 │ Stability: 0.91 │ +└─────────────────────────────────────────────────────────────┘ +``` + +### /chat +Agent chat interface with streaming. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ ForecastLab Assistant │ +├────────────┬────────────────────────────────────────────────┤ +│ Sessions │ │ +│ ─────────│ How does backtesting prevent data leakage? │ +│ Today │ │ +│ ◉ Current │ The backtesting module prevents data leakage │ +│ ○ 10:30am │ through several mechanisms: │ +│ ○ 9:15am │ │ +│ Yesterday │ 1. **Time-based splits**: Uses expanding... │ +│ ○ 4:45pm │ │ +│ │ 📚 Citations: │ +│ │ [1] docs/PHASE/5-BACKTESTING.md │ +│ │ [2] CLAUDE.md │ +│ │ │ +│ │ ────────────────────────────────────────── │ +│ │ 🔧 Tool: retrieve_context (5 chunks found) │ +│ │ ────────────────────────────────────────── │ +├────────────┴────────────────────────────────────────────────┤ +│ [Type your question...] [Send ➤] │ +└─────────────────────────────────────────────────────────────┘ +``` + +### /admin +Admin panel for system management. + +--- + +## COMPONENTS + +### DataTable (shadcn/ui pattern) + +```tsx +// components/data-table/data-table.tsx +import { + ColumnDef, + flexRender, + getCoreRowModel, + useReactTable, +} from "@tanstack/react-table" + +interface DataTableProps { + columns: ColumnDef[] + data: TData[] + pageCount: number + pageIndex: number + pageSize: number + onPaginationChange: (pagination: PaginationState) => void + onSortingChange: (sorting: SortingState) => void + onFilterChange: (filters: ColumnFiltersState) => void +} + +export function DataTable({ + columns, + data, + pageCount, + ...props +}: DataTableProps) { + const table = useReactTable({ + data, + columns, + pageCount, + manualPagination: true, + manualSorting: true, + manualFiltering: true, + getCoreRowModel: getCoreRowModel(), + // ... + }) + + return ( + + ... + ... +
+ ) +} +``` + +### TimeSeriesChart + +```tsx +// components/charts/time-series-chart.tsx +import { LineChart, Line, XAxis, YAxis, Tooltip, Legend } from 'recharts' + +interface TimeSeriesChartProps { + data: { date: string; actual: number; predicted?: number }[] + showConfidence?: boolean + height?: number +} + +export function TimeSeriesChart({ data, showConfidence, height = 400 }: TimeSeriesChartProps) { + return ( + + + + + + + {data[0]?.predicted !== undefined && ( + + )} + + ) +} +``` + +### ChatMessage + +```tsx +// components/chat/chat-message.tsx +interface ChatMessageProps { + role: 'user' | 'assistant' + content: string + citations?: Citation[] + toolCalls?: ToolCall[] + isStreaming?: boolean +} + +export function ChatMessage({ role, content, citations, toolCalls, isStreaming }: ChatMessageProps) { + return ( +
+
+ {content} + {isStreaming && } + {citations && } + {toolCalls && } +
+
+ ) +} +``` + +--- + +## API HOOKS (TanStack Query) + +```tsx +// hooks/use-sales.ts +export function useSales(params: SalesQueryParams) { + return useQuery({ + queryKey: ['sales', params], + queryFn: () => api.get('/analytics/drilldowns', { params }), + placeholderData: keepPreviousData, + }) +} + +// hooks/use-runs.ts +export function useRuns(params: RunsQueryParams) { + return useQuery({ + queryKey: ['runs', params], + queryFn: () => api.get('/registry/runs', { params }), + }) +} + +// hooks/use-chat.ts +export function useChat(sessionId?: string) { + const [messages, setMessages] = useState([]) + const ws = useWebSocket(`${WS_URL}/agents/stream`) + + const sendMessage = useCallback((content: string) => { + ws.send(JSON.stringify({ type: 'query', agent: 'rag_assistant', payload: { query: content } })) + }, [ws]) + + return { messages, sendMessage, isConnected: ws.readyState === WebSocket.OPEN } +} +``` + +--- + +## CONFIGURATION (Environment) + +```env +# .env.example for frontend + +# API Configuration +VITE_API_BASE_URL=http://localhost:8123 +VITE_WS_URL=ws://localhost:8123/agents/stream + +# Feature Flags +VITE_ENABLE_AGENT_CHAT=true +VITE_ENABLE_ADMIN_PANEL=true + +# Visualization +VITE_DEFAULT_PAGE_SIZE=25 +VITE_MAX_CHART_POINTS=365 +``` + +--- + +## EXAMPLES + +### examples/ui/README.md +```markdown +# Dashboard Page Map + +| Page | API Endpoints | Description | +|------|---------------|-------------| +| /dashboard | GET /analytics/kpis | KPI summary cards | +| /explorer/sales | GET /analytics/drilldowns | Sales data table | +| /explorer/runs | GET /registry/runs | Model run table | +| /visualize/forecast | GET /forecasting/predict | Forecast chart | +| /visualize/backtest | GET /backtesting/results/{run_id} | Fold visualization | +| /chat | WS /agents/stream | Agent chat | +| /admin | GET /rag/sources, GET /registry/aliases | Admin panel | + +## Running the Dashboard + +\`\`\`bash +cd frontend +pnpm install +pnpm dev +\`\`\` + +Open http://localhost:5173 +``` + +--- + +## SUCCESS CRITERIA + +- [ ] Data tables handle 10k+ rows with virtual scrolling +- [ ] Server-side pagination, sorting, filtering all functional +- [ ] Charts render smoothly with 365+ data points +- [ ] WebSocket chat shows streaming tokens in real-time +- [ ] Citations render as clickable source links +- [ ] Tool calls displayed in collapsible sections +- [ ] Responsive design works on tablet and mobile +- [ ] Lighthouse performance score > 90 +- [ ] Accessibility: keyboard navigation, screen reader support +- [ ] Dark/light theme toggle + +--- + +## CROSS-MODULE INTEGRATION + +| Direction | Module | Integration Point | +|-----------|--------|-------------------| +| **← RAG Layer** | INITIAL-9 | Displays indexed sources, allows re-indexing | +| **← Agentic Layer** | INITIAL-10 | Chat interface, experiment status display | +| **← Registry** | Phase 6 | Run leaderboard, comparison views | +| **← Analytics** | Phase 7 | KPI dashboard, drilldown charts | +| **← Jobs** | Phase 7 | Job status monitoring | +| **← Dimensions** | Phase 7 | Store/product selectors | + +--- + +## DOCUMENTATION LINKS + +- [shadcn/ui Documentation](https://ui.shadcn.com/) +- [shadcn/ui Data Table](https://ui.shadcn.com/docs/components/data-table) +- [shadcn/ui Table](https://ui.shadcn.com/docs/components/table) +- [TanStack Table](https://tanstack.com/table/latest) +- [TanStack Query](https://tanstack.com/query/latest) +- [Recharts](https://recharts.org/) +- [Vite Documentation](https://vite.dev/) +- [React 19 Documentation](https://react.dev/) +- [Tailwind CSS 4](https://tailwindcss.com/) + +--- + +## OTHER CONSIDERATIONS + +- **No Hardcoded URLs**: API base URL from environment variable only +- **Error Boundaries**: Graceful error handling with retry options +- **Loading States**: Skeleton components for all async data +- **Optimistic Updates**: Instant UI feedback for mutations +- **Caching**: TanStack Query manages cache invalidation +- **Bundle Size**: Code splitting per route for fast initial load diff --git a/INITIAL-9.md b/INITIAL-9.md index e82c4453..da491760 100644 --- a/INITIAL-9.md +++ b/INITIAL-9.md @@ -1,33 +1,319 @@ -# INITIAL-9.md — Dashboard + RAG + Agentic Layer (PydanticAI) - -## FEATURE: -- Dashboard (React + Vite + shadcn/ui Data Table): - - Data Explorer (tables, filters, export) - - Model Runs (leaderboard, compare) - - Train & Predict (forms, status) - - Predictions (tabular view) -- RAG assistant (pgvector): - - indexed sources: README.md, /docs/*, OpenAPI export, run reports - - retrieve top-k → answer with citations -- Optional PydanticAI: - - agent with tools: - - experiment orchestrator (generate configs → backtest → select best → report) - - rag assistant (query → retrieve → structured answer) - - enforced structured outputs - -## EXAMPLES: -- `examples/ui/README.md` — page map + API mapping (no hardcoded base URL). -- `examples/rag/index_docs.py` — chunk+embed+store (Settings-driven). -- `examples/rag/query.http` — Q&A returning a citations schema. -- `examples/agent/` — best-practice agent setup (providers, tools, dependencies). - -## DOCUMENTATION: -- shadcn/ui Data Table pattern + TanStack Table -- pgvector similarity search + indexing strategies -- PydanticAI docs (include link in README as a code block) - -## OTHER CONSIDERATIONS: -- Required: `.env.example` for frontend (`VITE_API_BASE_URL`). -- RAG must be evidence-grounded: if no support, return “not found” (no hallucinations). -- Stable citation schema: source_type, source_id/path, chunk_id, snippet/span. -- Embedding model + dimension must come from Settings (never hardcoded). +# INITIAL-9.md — RAG Knowledge Base (The Memory) + +## Architectural Role + +**"The Memory"** - Vector storage, document ingestion, and semantic retrieval infrastructure. + +This phase provides the foundational knowledge layer that enables: +- Indexed documentation and run reports for AI-assisted search +- Semantic retrieval with relevance scoring +- Evidence-grounded context for the Agentic Layer (INITIAL-10) + +--- + +## Tech Stack + +| Component | Technology | Purpose | +|-----------|------------|---------| +| Vector Store | PostgreSQL 16 + [pgvector](https://github.com/pgvector/pgvector) | Similarity search | +| Embeddings | [OpenAI text-embedding-3-small](https://platform.openai.com/docs/models/text-embedding-3-small) | 1536-dim vectors (configurable) | +| Chunking | Markdown-aware, OpenAPI endpoint-aware | Semantic boundaries | +| Index Type | HNSW (default) or IVFFlat | Approximate nearest neighbor | + +--- + +## FEATURE + +### Database Layer +- `document_chunk` table with vector column (`embedding VECTOR(1536)`) +- HNSW index for cosine similarity search +- Unique constraint `(source_id, chunk_index)` for idempotent re-indexing +- Metadata JSONB for source type, heading hierarchy, timestamps + +### Ingestion Pipeline +- **Markdown Chunker**: Heading-aware splitting (configurable size/overlap) +- **OpenAPI Chunker**: Endpoint-based granularity (one chunk per operation) +- **Embedding Service**: Async batch processing with rate limiting +- **Source Registry**: Track indexed sources with version/hash for change detection + +### Retrieval Engine +- Top-k semantic search with configurable similarity threshold +- Metadata filtering (source_type, date_range, tags) +- Relevance score normalization (0.0 - 1.0) +- Context window assembly for downstream consumption + +--- + +## ENDPOINTS + +### POST /rag/index +Index documents from various sources. + +**Request**: +```json +{ + "source_type": "markdown", + "source_path": "docs/ARCHITECTURE.md", + "metadata": { + "category": "documentation", + "version": "1.0.0" + } +} +``` + +**Response**: +```json +{ + "source_id": "src_abc123", + "chunks_created": 15, + "tokens_processed": 4250, + "duration_ms": 1234.56, + "status": "indexed" +} +``` + +### POST /rag/retrieve +Semantic search across indexed documents. + +**Request**: +```json +{ + "query": "How does backtesting prevent data leakage?", + "top_k": 5, + "similarity_threshold": 0.7, + "filters": { + "source_type": ["markdown", "openapi"], + "category": "documentation" + } +} +``` + +**Response**: +```json +{ + "results": [ + { + "chunk_id": "chunk_xyz789", + "source_id": "src_abc123", + "source_path": "docs/PHASE/5-BACKTESTING.md", + "content": "TimeSeriesSplitter uses time-based splits (expanding/sliding window) to prevent leakage...", + "relevance_score": 0.92, + "metadata": { + "heading": "Leakage Prevention", + "section_path": ["Phase 5: Backtesting", "Implementation", "Leakage Prevention"] + } + } + ], + "query_embedding_time_ms": 45.2, + "search_time_ms": 12.8, + "total_chunks_searched": 1250 +} +``` + +### GET /rag/sources +List all indexed sources with metadata. + +**Response**: +```json +{ + "sources": [ + { + "source_id": "src_abc123", + "source_type": "markdown", + "source_path": "docs/ARCHITECTURE.md", + "chunk_count": 15, + "indexed_at": "2026-02-01T10:30:00Z", + "content_hash": "sha256:abc123..." + } + ], + "total_sources": 12, + "total_chunks": 450 +} +``` + +### DELETE /rag/sources/{source_id} +Remove an indexed source and all its chunks. + +**Response**: +```json +{ + "source_id": "src_abc123", + "chunks_deleted": 15, + "status": "deleted" +} +``` + +--- + +## DATABASE SCHEMA + +```sql +-- Enable pgvector extension +CREATE EXTENSION IF NOT EXISTS vector; + +-- Document source registry +CREATE TABLE document_source ( + source_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + source_type VARCHAR(50) NOT NULL, -- 'markdown', 'openapi', 'run_report' + source_path TEXT NOT NULL, + content_hash VARCHAR(64) NOT NULL, -- SHA-256 for change detection + metadata JSONB DEFAULT '{}', + indexed_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE (source_type, source_path) +); + +-- Document chunks with embeddings +CREATE TABLE document_chunk ( + chunk_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + source_id UUID NOT NULL REFERENCES document_source(source_id) ON DELETE CASCADE, + chunk_index INTEGER NOT NULL, + content TEXT NOT NULL, + embedding VECTOR(1536), -- Configurable dimension + token_count INTEGER NOT NULL, + metadata JSONB DEFAULT '{}', -- heading, section_path, etc. + created_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE (source_id, chunk_index) +); + +-- HNSW index for cosine similarity +CREATE INDEX idx_chunk_embedding_hnsw +ON document_chunk +USING hnsw (embedding vector_cosine_ops) +WITH (m = 16, ef_construction = 64); + +-- Metadata filtering index +CREATE INDEX idx_chunk_metadata ON document_chunk USING gin (metadata); +``` + +--- + +## EXAMPLES + +### examples/rag/index_docs.py +```python +"""Index documentation into RAG knowledge base.""" +import asyncio +from pathlib import Path +import httpx + +async def index_markdown_docs(): + """Index all markdown docs from docs/ directory.""" + async with httpx.AsyncClient(base_url="http://localhost:8123") as client: + docs_dir = Path("docs") + for md_file in docs_dir.rglob("*.md"): + response = await client.post( + "/rag/index", + json={ + "source_type": "markdown", + "source_path": str(md_file), + "metadata": {"category": "documentation"} + } + ) + result = response.json() + print(f"Indexed {md_file}: {result['chunks_created']} chunks") + +if __name__ == "__main__": + asyncio.run(index_markdown_docs()) +``` + +### examples/rag/query.http +```http +### Semantic search query +POST http://localhost:8123/rag/retrieve +Content-Type: application/json + +{ + "query": "How do I configure backtesting splits?", + "top_k": 5, + "similarity_threshold": 0.7 +} + +### List all indexed sources +GET http://localhost:8123/rag/sources + +### Re-index after documentation update +POST http://localhost:8123/rag/index +Content-Type: application/json + +{ + "source_type": "markdown", + "source_path": "README.md", + "metadata": {"category": "overview"} +} +``` + +--- + +## CONFIGURATION (Settings) + +```python +# app/core/config.py additions + +# RAG Embedding Configuration +rag_embedding_model: str = "text-embedding-3-small" +rag_embedding_dimension: int = 1536 +rag_embedding_batch_size: int = 100 + +# RAG Chunking Configuration +rag_chunk_size: int = 512 # tokens +rag_chunk_overlap: int = 50 # tokens +rag_min_chunk_size: int = 100 # minimum tokens per chunk + +# RAG Retrieval Configuration +rag_top_k: int = 5 +rag_similarity_threshold: float = 0.7 +rag_max_context_tokens: int = 4000 + +# RAG Index Configuration +rag_index_type: Literal["hnsw", "ivfflat"] = "hnsw" +rag_hnsw_m: int = 16 +rag_hnsw_ef_construction: int = 64 +``` + +--- + +## SUCCESS CRITERIA + +- [ ] pgvector extension enabled and tested in docker-compose +- [ ] Markdown chunker respects heading boundaries +- [ ] OpenAPI chunker produces one chunk per endpoint +- [ ] Embeddings generated via async batch processing +- [ ] Retrieval returns top-k with normalized relevance scores +- [ ] Re-indexing is idempotent (content_hash change detection) +- [ ] Unique constraint prevents duplicate chunks +- [ ] HNSW index provides sub-100ms search latency +- [ ] Integration tests with real embeddings (mocked in unit tests) +- [ ] Structured logging for all index/retrieve operations + +--- + +## CROSS-MODULE INTEGRATION + +| Direction | Module | Integration Point | +|-----------|--------|-------------------| +| **→ Agentic Layer** | INITIAL-10 | Provides `retrieve_context` tool for RAG Assistant agent | +| **→ Dashboard** | INITIAL-11 | Sources list displayed in Admin panel | +| **← Registry** | Phase 6 | Run reports indexed as knowledge sources | +| **← Jobs** | Phase 7 | Indexing operations tracked as jobs | + +--- + +## DOCUMENTATION LINKS + +- [pgvector GitHub](https://github.com/pgvector/pgvector) +- [pgvector Tutorial (DataCamp)](https://www.datacamp.com/tutorial/pgvector-tutorial) +- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings) +- [OpenAI API Reference](https://platform.openai.com/docs/api-reference/embeddings) +- [Neon pgvector Docs](https://neon.com/docs/extensions/pgvector) +- [HNSW Algorithm Paper](https://arxiv.org/abs/1603.09320) + +--- + +## OTHER CONSIDERATIONS + +- **Evidence-Grounded**: Retrieval returns raw chunks only; no answer generation in this layer +- **Idempotency**: Content hash comparison prevents unnecessary re-embedding +- **Rate Limiting**: Respect OpenAI API rate limits during batch embedding +- **Cost Tracking**: Log token counts for embedding cost monitoring +- **Dimension Flexibility**: Support for other embedding models (e.g., 3072-dim text-embedding-3-large) diff --git a/docs/DAILY-FLOW.md b/docs/DAILY-FLOW.md index 66521dbc..7ecba511 100644 --- a/docs/DAILY-FLOW.md +++ b/docs/DAILY-FLOW.md @@ -162,21 +162,29 @@ gh run watch --- -## Következő Phase: Forecasting (PRP-5) +## Következő Phases (INITIAL-9 → INITIAL-11) -```bash -# Kezdés -git checkout dev -git pull origin dev -git checkout -b feat/prp-5-forecasting +A projekt a moduláris három-fázisú roadmap szerint halad: -# Fejlesztés... -# PR → dev → main → release → phase-4 snapshot ``` +Phase 8: RAG Knowledge Base ("The Memory") + ↓ +Phase 9: Agentic Layer ("The Brain") + ↓ +Phase 10: ForecastLab Dashboard ("The Face") +``` + +### Phase 8: RAG Knowledge Base (INITIAL-9) +- pgvector embeddings + semantic retrieval +- Markdown/OpenAPI chunking +- POST /rag/index, POST /rag/retrieve endpoints + +### Phase 9: Agentic Layer (INITIAL-10) +- PydanticAI agents (Experiment Orchestrator, RAG Assistant) +- Tool orchestration + structured outputs +- WebSocket streaming -### PRP-5 Scope (INITIAL-5) -- Model zoo: naive, seasonal naive, moving average -- Unified model interface: fit/predict, serialize/load -- Scikit-learn Pipeline: Scaling → Encoding → Regressor -- Joblib-based ModelBundle persistence -- Multi-horizon recursive forecasting +### Phase 10: Dashboard (INITIAL-11) +- React 19 + Vite + shadcn/ui +- Data tables + time series charts +- Agent chat interface diff --git a/docs/PHASE-index.md b/docs/PHASE-index.md index 280fa43b..b655d0c9 100644 --- a/docs/PHASE-index.md +++ b/docs/PHASE-index.md @@ -17,8 +17,8 @@ This document indexes all implementation phases of the ForecastLabAI project. | 6 | Model Registry | Completed | PRP-7 | [6-MODEL_REGISTRY.md](./PHASE/6-MODEL_REGISTRY.md) | | 7 | Serving Layer | Completed | PRP-8 | [7-SERVING_LAYER.md](./PHASE/7-SERVING_LAYER.md) | | 8 | RAG Knowledge Base | Pending | PRP-9 | - | -| 9 | Dashboard | Pending | PRP-10 | - | -| 10 | Agentic Layer | Pending | - | - | +| 9 | Agentic Layer | Pending | PRP-10 | - | +| 10 | ForecastLab Dashboard | Pending | PRP-11 | - | --- @@ -277,14 +277,29 @@ jobs_retention_days: int = 30 ## Pending Phases -### Phase 8: RAG Knowledge Base -pgvector embeddings with evidence-grounded answers and citations. - -### Phase 9: Dashboard -React + Vite + shadcn/ui frontend with data tables and visualizations. - -### Phase 10: Agentic Layer (Optional) -PydanticAI integration for experiment orchestration. +### Phase 8: RAG Knowledge Base ("The Memory") +Vector storage, document ingestion, and semantic retrieval infrastructure. +- PostgreSQL 16 + pgvector extension +- OpenAI text-embedding-3-small embeddings (1536 dimensions) +- Markdown-aware and OpenAPI endpoint-aware chunking +- HNSW index for cosine similarity search +- Endpoints: POST /rag/index, POST /rag/retrieve, GET /rag/sources, DELETE /rag/sources/{id} + +### Phase 9: Agentic Layer ("The Brain") +Autonomous decision-making, tool orchestration, and structured outputs using PydanticAI. +- Experiment Orchestrator Agent (backtest → compare → deploy workflow) +- RAG Assistant Agent (query → retrieve → answer with citations) +- Human-in-the-loop approval for sensitive operations +- WebSocket streaming for real-time responses +- Endpoints: POST /agents/experiment/run, POST /agents/rag/query, WS /agents/stream + +### Phase 10: ForecastLab Dashboard ("The Face") +User interface, data visualization, and agent interaction. +- React 19 + Vite + shadcn/ui + Tailwind CSS 4 +- TanStack Table for server-side data grids +- TanStack Query for data fetching and caching +- Recharts for time series visualization +- Agent chat interface with streaming and citations --- From e6a8fdaac564636bc1d778e9651b7dc128bcd9e4 Mon Sep 17 00:00:00 2001 From: "Gabe@w7dev" Date: Sun, 1 Feb 2026 10:55:42 +0000 Subject: [PATCH 2/3] docs(prp): add PRP-9 RAG Knowledge Base implementation plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive PRP for INITIAL-9 RAG Knowledge Base feature: - pgvector + SQLAlchemy 2.0 integration patterns - Markdown-aware and OpenAPI-aware chunking - Async OpenAI embeddings with batch processing - HNSW index for cosine similarity search - 15 ordered implementation tasks - 5-level validation loop (syntax → types → unit → integration → smoke) - Full ORM models and Pydantic schemas - Known gotchas and anti-patterns documented Confidence score: 8.5/10 Co-Authored-By: Claude Opus 4.5 --- PRPs/PRP-9-rag-knowledge-base.md | 776 +++++++++++++++++++++++++++++++ 1 file changed, 776 insertions(+) create mode 100644 PRPs/PRP-9-rag-knowledge-base.md diff --git a/PRPs/PRP-9-rag-knowledge-base.md b/PRPs/PRP-9-rag-knowledge-base.md new file mode 100644 index 00000000..011ef88b --- /dev/null +++ b/PRPs/PRP-9-rag-knowledge-base.md @@ -0,0 +1,776 @@ +# PRP-9: RAG Knowledge Base ("The Memory") + +**Feature**: INITIAL-9.md — RAG Knowledge Base +**Status**: Ready for Implementation +**Confidence Score**: 8.5/10 + +--- + +## Goal + +Build the RAG Knowledge Base layer providing: +1. **Document ingestion** with markdown-aware and OpenAPI-aware chunking +2. **Vector storage** using PostgreSQL + pgvector for embeddings +3. **Semantic retrieval** with configurable top-k and similarity thresholds +4. **Idempotent re-indexing** via content hash comparison + +This is the foundational "Memory" layer that INITIAL-10 (Agentic Layer) will consume via the `retrieve_context` tool. + +--- + +## Why + +- **Agent-Ready**: Provides `retrieve_context` tool for INITIAL-10 RAG Assistant +- **Evidence-Grounded**: Returns raw chunks with citations (no hallucination) +- **Cost-Effective**: Uses existing PostgreSQL (no new infrastructure) +- **Portfolio Value**: Demonstrates full-stack RAG implementation + +--- + +## What + +### Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `POST` | `/rag/index` | Index document (markdown/openapi) | +| `POST` | `/rag/retrieve` | Semantic search with filters | +| `GET` | `/rag/sources` | List indexed sources | +| `DELETE` | `/rag/sources/{source_id}` | Remove source and chunks | + +### Success Criteria + +- [ ] pgvector extension enabled via migration +- [ ] Markdown chunker respects heading boundaries +- [ ] OpenAPI chunker produces one chunk per endpoint +- [ ] Async batch embedding with OpenAI API +- [ ] HNSW index for sub-100ms retrieval +- [ ] Idempotent re-indexing (content_hash change detection) +- [ ] 80+ unit tests, 15+ integration tests +- [ ] All validation gates green (ruff, mypy, pyright, pytest) + +--- + +## All Needed Context + +### Documentation & References + +```yaml +# CRITICAL - pgvector SQLAlchemy Integration +- url: https://github.com/pgvector/pgvector-python + why: "Official pgvector Python library - Vector column, HNSW index, cosine_distance" + +- url: https://github.com/pgvector/pgvector-python/blob/master/README.md + why: "SQLAlchemy 2.0 patterns, Index creation with postgresql_ops" + +# pgvector Indexing +- url: https://neon.com/blog/understanding-vector-search-and-hnsw-index-with-pgvector + why: "HNSW vs IVFFlat tradeoffs, index tuning parameters" + +# OpenAI Embeddings +- url: https://platform.openai.com/docs/api-reference/embeddings + why: "Embeddings API reference - batch processing, input limits (8192 tokens)" + +- url: https://platform.openai.com/docs/guides/embeddings + why: "Best practices, token counting with tiktoken cl100k_base" + +# Markdown Chunking +- url: https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/ + why: "MarkdownHeaderTextSplitter pattern for heading-aware splitting" + +# Codebase Patterns (CRITICAL) +- file: app/features/registry/models.py + why: "ORM pattern with JSONB, TimestampMixin, Index creation" + +- file: app/features/registry/schemas.py + why: "Pydantic v2 patterns - ConfigDict, field_validator, from_attributes" + +- file: app/features/registry/routes.py + why: "FastAPI patterns - APIRouter, response_model, HTTPException" + +- file: app/features/registry/service.py + why: "Async service pattern - get_settings(), structured logging" + +- file: app/features/registry/tests/conftest.py + why: "Test fixtures - db_session, client, cleanup patterns" + +# ADR +- file: docs/ADR/ADR-0003-vector-storage-pgvector-in-postgres.md + why: "Architectural decision for pgvector over dedicated vector DB" +``` + +### Current Codebase Tree (Relevant Parts) + +``` +app/ +├── core/ +│ ├── config.py # Settings singleton - ADD RAG settings here +│ ├── database.py # Base, get_db, get_engine +│ ├── logging.py # get_logger, structured logging +│ └── exceptions.py # ForecastLabError base class +├── shared/ +│ └── models.py # TimestampMixin +├── features/ +│ ├── registry/ # REFERENCE: Follow this pattern exactly +│ │ ├── models.py +│ │ ├── schemas.py +│ │ ├── routes.py +│ │ ├── service.py +│ │ ├── storage.py +│ │ └── tests/ +│ └── rag/ # NEW: Create this vertical slice +├── main.py # Include rag router here +docker-compose.yml # Already uses pgvector/pgvector:pg16 +alembic/versions/ # Add migration for pgvector extension + tables +``` + +### Desired Codebase Tree (Files to Create) + +``` +app/features/rag/ +├── __init__.py # Export router +├── models.py # DocumentSource, DocumentChunk ORM models +├── schemas.py # IndexRequest/Response, RetrieveRequest/Response, etc. +├── routes.py # FastAPI router with /rag/* endpoints +├── service.py # RAGService - indexing and retrieval logic +├── chunkers.py # MarkdownChunker, OpenAPIChunker classes +├── embeddings.py # EmbeddingService - async OpenAI API calls +├── tests/ +│ ├── __init__.py +│ ├── conftest.py # db_session, client fixtures +│ ├── test_schemas.py # Schema validation tests +│ ├── test_chunkers.py # Chunking logic tests (unit, no DB) +│ ├── test_embeddings.py # Embedding tests with mocked API +│ ├── test_service.py # Service tests (unit + integration) +│ └── test_routes.py # Route integration tests + +alembic/versions/ +└── xxxx_create_rag_tables.py # Migration with CREATE EXTENSION vector + +examples/rag/ +├── index_docs.py # Example: index docs/ directory +└── query.http # HTTP client examples +``` + +### Known Gotchas & Library Quirks + +```python +# CRITICAL: pgvector SQLAlchemy requires explicit import +from pgvector.sqlalchemy import Vector # NOT from sqlalchemy + +# CRITICAL: HNSW index requires vector_cosine_ops for cosine distance +Index( + "ix_embedding_hnsw", + DocumentChunk.embedding, + postgresql_using="hnsw", + postgresql_with={"m": 16, "ef_construction": 64}, + postgresql_ops={"embedding": "vector_cosine_ops"}, # MUST match query distance +) + +# CRITICAL: Cosine distance query uses cosine_distance method +from pgvector.sqlalchemy import Vector +stmt = select(DocumentChunk).order_by( + DocumentChunk.embedding.cosine_distance(query_embedding) # NOT <=> operator +).limit(top_k) + +# CRITICAL: OpenAI embeddings input limit is 8192 tokens per text +# Use tiktoken to count tokens before sending to API +import tiktoken +enc = tiktoken.get_encoding("cl100k_base") +tokens = enc.encode(text) +if len(tokens) > 8191: + # Truncate or split + +# CRITICAL: OpenAI API returns embeddings in same order as input +# But batch requests should be <= 2048 inputs per call + +# CRITICAL: Pydantic v2 uses ConfigDict, not class Config +from pydantic import BaseModel, ConfigDict +class MySchema(BaseModel): + model_config = ConfigDict(from_attributes=True, extra="forbid") + +# CRITICAL: SQLAlchemy 2.0 uses Mapped[] and mapped_column() +from sqlalchemy.orm import Mapped, mapped_column +embedding = mapped_column(Vector(1536)) # Vector column + +# CRITICAL: Alembic migration needs op.execute for CREATE EXTENSION +op.execute("CREATE EXTENSION IF NOT EXISTS vector") +``` + +--- + +## Implementation Blueprint + +### Data Models + +#### ORM Models (models.py) + +```python +"""RAG knowledge base ORM models.""" +from __future__ import annotations +import uuid +from datetime import datetime +from typing import Any +from sqlalchemy import ( + DateTime, Index, Integer, String, Text, UniqueConstraint, ForeignKey, +) +from sqlalchemy.dialects.postgresql import JSONB +from sqlalchemy.orm import Mapped, mapped_column, relationship +from pgvector.sqlalchemy import Vector +from app.core.database import Base +from app.shared.models import TimestampMixin + + +class DocumentSource(TimestampMixin, Base): + """Registered document source for indexing.""" + __tablename__ = "document_source" + + id: Mapped[int] = mapped_column(Integer, primary_key=True) + source_id: Mapped[str] = mapped_column(String(32), unique=True, index=True) + source_type: Mapped[str] = mapped_column(String(50), index=True) # markdown, openapi + source_path: Mapped[str] = mapped_column(Text, nullable=False) + content_hash: Mapped[str] = mapped_column(String(64), nullable=False) # SHA-256 + metadata_: Mapped[dict[str, Any] | None] = mapped_column("metadata", JSONB, nullable=True) + indexed_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False) + + # Relationship + chunks: Mapped[list[DocumentChunk]] = relationship( + back_populates="source", cascade="all, delete-orphan" + ) + + __table_args__ = ( + UniqueConstraint("source_type", "source_path", name="uq_source_type_path"), + ) + + +class DocumentChunk(TimestampMixin, Base): + """Indexed document chunk with embedding.""" + __tablename__ = "document_chunk" + + id: Mapped[int] = mapped_column(Integer, primary_key=True) + chunk_id: Mapped[str] = mapped_column(String(32), unique=True, index=True) + source_id: Mapped[int] = mapped_column( + Integer, ForeignKey("document_source.id", ondelete="CASCADE"), index=True + ) + chunk_index: Mapped[int] = mapped_column(Integer, nullable=False) + content: Mapped[str] = mapped_column(Text, nullable=False) + embedding = mapped_column(Vector(1536), nullable=True) # Dimension from settings + token_count: Mapped[int] = mapped_column(Integer, nullable=False) + metadata_: Mapped[dict[str, Any] | None] = mapped_column("metadata", JSONB, nullable=True) + + # Relationship + source: Mapped[DocumentSource] = relationship(back_populates="chunks") + + __table_args__ = ( + UniqueConstraint("source_id", "chunk_index", name="uq_source_chunk_index"), + Index( + "ix_chunk_embedding_hnsw", + "embedding", + postgresql_using="hnsw", + postgresql_with={"m": 16, "ef_construction": 64}, + postgresql_ops={"embedding": "vector_cosine_ops"}, + ), + Index("ix_chunk_metadata_gin", "metadata", postgresql_using="gin"), + ) +``` + +#### Pydantic Schemas (schemas.py) + +```python +"""Pydantic schemas for RAG API contracts.""" +from datetime import datetime +from typing import Any, Literal +from pydantic import BaseModel, ConfigDict, Field, field_validator + + +class IndexRequest(BaseModel): + """Request to index a document.""" + model_config = ConfigDict(extra="forbid") + + source_type: Literal["markdown", "openapi"] = Field( + ..., description="Type of document to index" + ) + source_path: str = Field(..., min_length=1, max_length=500) + content: str | None = Field(None, description="Optional content override") + metadata: dict[str, Any] | None = Field(None, description="Custom metadata") + + +class IndexResponse(BaseModel): + """Response from indexing operation.""" + model_config = ConfigDict(from_attributes=True) + + source_id: str + source_path: str + chunks_created: int + tokens_processed: int + duration_ms: float + status: Literal["indexed", "updated", "unchanged"] + + +class RetrieveRequest(BaseModel): + """Request for semantic search.""" + model_config = ConfigDict(extra="forbid") + + query: str = Field(..., min_length=1, max_length=2000) + top_k: int = Field(default=5, ge=1, le=50) + similarity_threshold: float = Field(default=0.7, ge=0.0, le=1.0) + filters: dict[str, Any] | None = Field(None, description="Metadata filters") + + +class ChunkResult(BaseModel): + """Single chunk in retrieval results.""" + model_config = ConfigDict(from_attributes=True) + + chunk_id: str + source_id: str + source_path: str + source_type: str + content: str + relevance_score: float + metadata: dict[str, Any] | None = None + + +class RetrieveResponse(BaseModel): + """Response from retrieval operation.""" + results: list[ChunkResult] + query_embedding_time_ms: float + search_time_ms: float + total_chunks_searched: int + + +class SourceResponse(BaseModel): + """Source details response.""" + model_config = ConfigDict(from_attributes=True) + + source_id: str + source_type: str + source_path: str + chunk_count: int + content_hash: str + indexed_at: datetime + metadata: dict[str, Any] | None = None + + +class SourceListResponse(BaseModel): + """List of indexed sources.""" + sources: list[SourceResponse] + total_sources: int + total_chunks: int + + +class DeleteResponse(BaseModel): + """Response from delete operation.""" + source_id: str + chunks_deleted: int + status: Literal["deleted"] +``` + +--- + +## Task List + +### Task 1: Add Dependencies to pyproject.toml + +```yaml +MODIFY: pyproject.toml +ADD to dependencies: + - "pgvector>=0.3.0" # pgvector SQLAlchemy support + - "openai>=1.40.0" # OpenAI API client (async) + - "tiktoken>=0.7.0" # Token counting for chunk size + - "httpx>=0.28.0" # Already in dev, may need in main for async HTTP +``` + +### Task 2: Add RAG Settings to config.py + +```yaml +MODIFY: app/core/config.py +ADD after "jobs_retention_days" (~line 65): + # RAG Embedding Configuration + rag_embedding_model: str = "text-embedding-3-small" + rag_embedding_dimension: int = 1536 + rag_embedding_batch_size: int = 100 + openai_api_key: str = "" # Required for embeddings + + # RAG Chunking Configuration + rag_chunk_size: int = 512 # tokens + rag_chunk_overlap: int = 50 # tokens + rag_min_chunk_size: int = 100 + + # RAG Retrieval Configuration + rag_top_k: int = 5 + rag_similarity_threshold: float = 0.7 + rag_max_context_tokens: int = 4000 + + # RAG Index Configuration + rag_index_type: Literal["hnsw", "ivfflat"] = "hnsw" + rag_hnsw_m: int = 16 + rag_hnsw_ef_construction: int = 64 +``` + +### Task 3: Create Alembic Migration + +```yaml +CREATE: alembic/versions/xxxx_create_rag_tables.py +PATTERN: Follow app/features/registry migration pattern + +Pseudocode: +def upgrade(): + # Enable pgvector extension + op.execute("CREATE EXTENSION IF NOT EXISTS vector") + + # Create document_source table + op.create_table("document_source", ...) + + # Create document_chunk table with Vector column + op.create_table("document_chunk", + sa.Column("embedding", Vector(1536), nullable=True), + ... + ) + + # Create HNSW index + op.create_index( + "ix_chunk_embedding_hnsw", + "document_chunk", + ["embedding"], + postgresql_using="hnsw", + postgresql_with={"m": 16, "ef_construction": 64}, + postgresql_ops={"embedding": "vector_cosine_ops"}, + ) +``` + +### Task 4: Create ORM Models + +```yaml +CREATE: app/features/rag/models.py +MIRROR: app/features/registry/models.py pattern +CRITICAL: + - Use pgvector.sqlalchemy.Vector for embedding column + - Add HNSW index in __table_args__ + - Use TimestampMixin + - Cascade delete from source to chunks +``` + +### Task 5: Create Pydantic Schemas + +```yaml +CREATE: app/features/rag/schemas.py +MIRROR: app/features/registry/schemas.py pattern +INCLUDE: + - IndexRequest, IndexResponse + - RetrieveRequest, RetrieveResponse, ChunkResult + - SourceResponse, SourceListResponse + - DeleteResponse +``` + +### Task 6: Create Chunker Classes + +```yaml +CREATE: app/features/rag/chunkers.py + +Classes: + BaseChunker (ABC): + - chunk(content: str) -> list[ChunkData] + + MarkdownChunker(BaseChunker): + - Split on heading boundaries (# ## ###) + - Respect chunk_size and chunk_overlap from settings + - Extract heading hierarchy for metadata + - Use tiktoken cl100k_base for token counting + + OpenAPIChunker(BaseChunker): + - Parse OpenAPI JSON/YAML + - One chunk per endpoint (path + method) + - Include operation summary, description, parameters + +CRITICAL: + - Use tiktoken for token counting (cl100k_base encoding) + - Never exceed 8191 tokens per chunk (OpenAI limit) +``` + +### Task 7: Create Embedding Service + +```yaml +CREATE: app/features/rag/embeddings.py + +Class EmbeddingService: + __init__(self): + - Load settings (api_key, model, dimension, batch_size) + - Initialize AsyncOpenAI client + + async def embed_texts(self, texts: list[str]) -> list[list[float]]: + - Batch texts into groups of batch_size + - Call OpenAI embeddings API for each batch + - Handle rate limits with exponential backoff + - Return embeddings in same order as input + + async def embed_query(self, query: str) -> list[float]: + - Single text embedding for retrieval queries + +CRITICAL: + - Use openai.AsyncOpenAI for async calls + - Validate token count before API call + - Log token usage for cost tracking +``` + +### Task 8: Create RAG Service + +```yaml +CREATE: app/features/rag/service.py +MIRROR: app/features/registry/service.py pattern + +Class RAGService: + async def index_document(self, db, request: IndexRequest) -> IndexResponse: + - Read content from source_path (or use provided content) + - Compute SHA-256 content hash + - Check if source exists with same hash (skip if unchanged) + - Chunk content using appropriate chunker + - Generate embeddings for all chunks + - Upsert source record + - Delete old chunks, insert new chunks + - Return IndexResponse with stats + + async def retrieve(self, db, request: RetrieveRequest) -> RetrieveResponse: + - Generate query embedding + - Build pgvector similarity query with cosine_distance + - Apply metadata filters if provided + - Execute query, compute relevance scores + - Return top-k results above threshold + + async def list_sources(self, db) -> SourceListResponse: + - Query all sources with chunk counts + - Return paginated list + + async def delete_source(self, db, source_id: str) -> DeleteResponse: + - Find source by source_id + - Delete (cascades to chunks) + - Return delete count + +CRITICAL: + - Use cosine_distance for similarity (NOT l2_distance) + - Relevance score = 1 - cosine_distance (normalized to 0-1) + - Handle source not found with 404 +``` + +### Task 9: Create FastAPI Routes + +```yaml +CREATE: app/features/rag/routes.py +MIRROR: app/features/registry/routes.py pattern + +Routes: + POST /rag/index -> IndexResponse (201 CREATED) + POST /rag/retrieve -> RetrieveResponse (200 OK) + GET /rag/sources -> SourceListResponse (200 OK) + DELETE /rag/sources/{source_id} -> DeleteResponse (200 OK) + +CRITICAL: + - Use structured logging with rag.* event prefix + - Handle OpenAI API errors gracefully + - Validate source_id format +``` + +### Task 10: Register Router in main.py + +```yaml +MODIFY: app/main.py +ADD import: from app.features.rag.routes import router as rag_router +ADD router: app.include_router(rag_router) +``` + +### Task 11: Create Test Fixtures + +```yaml +CREATE: app/features/rag/tests/conftest.py +MIRROR: app/features/registry/tests/conftest.py + +Fixtures: + - db_session: Async session with cleanup (delete test-* sources) + - client: AsyncClient with db override + - sample_markdown_content: Test markdown with headings + - sample_openapi_content: Test OpenAPI spec + - mock_embedding_service: Mocked EmbeddingService for unit tests +``` + +### Task 12: Create Unit Tests + +```yaml +CREATE: app/features/rag/tests/test_schemas.py + - Test IndexRequest validation + - Test RetrieveRequest validation (query length, threshold bounds) + +CREATE: app/features/rag/tests/test_chunkers.py + - Test MarkdownChunker respects heading boundaries + - Test MarkdownChunker respects chunk_size + - Test MarkdownChunker extracts heading metadata + - Test OpenAPIChunker creates one chunk per endpoint + - Test chunk token counts are within limits + +CREATE: app/features/rag/tests/test_embeddings.py + - Test embed_texts batching logic + - Test embed_query returns correct dimension + - Mock OpenAI API responses + +CREATE: app/features/rag/tests/test_service.py (unit) + - Test content hash computation + - Test idempotent re-indexing logic + - Test relevance score normalization +``` + +### Task 13: Create Integration Tests + +```yaml +CREATE: app/features/rag/tests/test_routes.py +@pytest.mark.integration tests: + - test_index_markdown_creates_chunks + - test_index_same_content_returns_unchanged + - test_index_updated_content_re_indexes + - test_retrieve_returns_relevant_chunks + - test_retrieve_respects_threshold + - test_list_sources_returns_all + - test_delete_source_removes_chunks + - test_delete_nonexistent_returns_404 +``` + +### Task 14: Create Examples + +```yaml +CREATE: examples/rag/index_docs.py + - Script to index docs/ directory + +CREATE: examples/rag/query.http + - HTTP client examples for all endpoints +``` + +### Task 15: Update .env.example + +```yaml +MODIFY: .env.example +ADD: + # RAG Configuration + OPENAI_API_KEY=sk-... + RAG_EMBEDDING_MODEL=text-embedding-3-small + RAG_CHUNK_SIZE=512 + RAG_TOP_K=5 +``` + +--- + +## Validation Loop + +### Level 1: Syntax & Style + +```bash +# Run FIRST - fix any errors before proceeding +uv run ruff check app/features/rag/ --fix +uv run ruff format app/features/rag/ + +# Expected: No errors +``` + +### Level 2: Type Checking + +```bash +# MUST be green +uv run mypy app/features/rag/ +uv run pyright app/features/rag/ + +# Expected: 0 errors on both +``` + +### Level 3: Unit Tests + +```bash +# No database required +uv run pytest app/features/rag/tests/ -v -m "not integration" + +# Expected: All pass +# If failing: Read error, fix code, re-run +``` + +### Level 4: Integration Tests + +```bash +# Requires PostgreSQL running +docker-compose up -d + +# Run migrations +uv run alembic upgrade head + +# Run integration tests +uv run pytest app/features/rag/tests/ -v -m integration + +# Expected: All pass +``` + +### Level 5: Manual Smoke Test + +```bash +# Start API +uv run uvicorn app.main:app --reload --port 8123 + +# Index a document +curl -X POST http://localhost:8123/rag/index \ + -H "Content-Type: application/json" \ + -d '{"source_type": "markdown", "source_path": "README.md"}' + +# Expected: {"source_id": "...", "chunks_created": N, ...} + +# Retrieve +curl -X POST http://localhost:8123/rag/retrieve \ + -H "Content-Type: application/json" \ + -d '{"query": "What is ForecastLabAI?", "top_k": 3}' + +# Expected: {"results": [...], ...} + +# List sources +curl http://localhost:8123/rag/sources + +# Delete source +curl -X DELETE http://localhost:8123/rag/sources/{source_id} +``` + +--- + +## Final Validation Checklist + +- [ ] All tests pass: `uv run pytest app/features/rag/tests/ -v` +- [ ] No linting errors: `uv run ruff check app/features/rag/` +- [ ] No type errors: `uv run mypy app/features/rag/ && uv run pyright app/features/rag/` +- [ ] Migration applies cleanly: `uv run alembic upgrade head` +- [ ] Manual smoke test successful +- [ ] Structured logging events follow `rag.*` prefix +- [ ] Content hash prevents duplicate embeddings +- [ ] HNSW index used for similarity queries + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't use `l2_distance` when you want cosine similarity +- ❌ Don't forget to enable pgvector extension in migration +- ❌ Don't exceed 8191 tokens per embedding input +- ❌ Don't use sync OpenAI client - use AsyncOpenAI +- ❌ Don't hardcode embedding dimensions - use settings +- ❌ Don't catch all exceptions - be specific +- ❌ Don't skip content hash comparison (wastes API calls) +- ❌ Don't create new patterns when registry patterns work + +--- + +## Confidence Score: 8.5/10 + +**Strengths:** +- Docker already has pgvector image +- Clear patterns from registry module to follow +- Comprehensive documentation available +- ADR decision already made + +**Risks:** +- OpenAI API rate limits during bulk indexing +- HNSW index creation on large datasets may be slow +- tiktoken token counting edge cases + +**Mitigations:** +- Implement exponential backoff for API calls +- Create index after initial data load +- Extensive unit tests for chunking edge cases From 16d6b5a7d2cead156672c1dfff1525cf88f3947a Mon Sep 17 00:00:00 2001 From: "Gabe@w7dev" Date: Sun, 1 Feb 2026 11:01:58 +0000 Subject: [PATCH 3/3] docs(prp): add PRP-10 Agentic Layer implementation plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Comprehensive PRP for INITIAL-10 Agentic Layer feature: - PydanticAI agent framework integration - Experiment Orchestrator Agent (backtest → compare → deploy) - RAG Assistant Agent (query → retrieve → answer with citations) - Human-in-the-loop approval workflow for sensitive actions - WebSocket streaming for real-time token delivery - Session persistence with JSONB message history - 17 ordered implementation tasks - Tool definitions for registry, backtesting, forecasting, RAG - Full Pydantic schemas and ORM models Confidence score: 7.5/10 Co-Authored-By: Claude Opus 4.5 --- PRPs/PRP-10-agentic-layer.md | 920 +++++++++++++++++++++++++++++++++++ 1 file changed, 920 insertions(+) create mode 100644 PRPs/PRP-10-agentic-layer.md diff --git a/PRPs/PRP-10-agentic-layer.md b/PRPs/PRP-10-agentic-layer.md new file mode 100644 index 00000000..6cade0dc --- /dev/null +++ b/PRPs/PRP-10-agentic-layer.md @@ -0,0 +1,920 @@ +# PRP-10: Agentic Layer ("The Brain") + +**Feature**: INITIAL-10.md — Agentic Layer +**Status**: Ready for Implementation +**Confidence Score**: 7.5/10 + +--- + +## Goal + +Build the Agentic Layer using PydanticAI providing: +1. **Experiment Orchestrator Agent** - Autonomous model experimentation workflow +2. **RAG Assistant Agent** - Evidence-grounded Q&A with citations +3. **Human-in-the-Loop Approval** - Blocking sensitive actions until approved +4. **WebSocket Streaming** - Real-time token delivery to clients +5. **Session Management** - Persistent state across multi-turn conversations + +This is the "Brain" layer that orchestrates tools from INITIAL-9 (RAG), Phase 5 (Backtesting), and Phase 6 (Registry). + +--- + +## Why + +- **Autonomous Experimentation**: Agent runs backtests, compares results, deploys winners +- **Evidence-Grounded Answers**: RAG-powered Q&A prevents hallucination +- **Safety Controls**: Human approval for deployment actions +- **Real-Time UX**: Streaming responses for responsive chat interface +- **Portfolio Value**: Demonstrates modern AI agent architecture + +--- + +## What + +### Endpoints + +| Method | Path | Description | +|--------|------|-------------| +| `POST` | `/agents/experiment/run` | Execute experiment workflow | +| `POST` | `/agents/experiment/approve` | Approve pending action | +| `POST` | `/agents/rag/query` | Query with answer generation | +| `GET` | `/agents/status/{session_id}` | Check session status | +| `WS` | `/agents/stream` | WebSocket for streaming | + +### Success Criteria + +- [ ] Agents produce schema-enforced structured outputs +- [ ] Tool calls logged with correlation IDs and timing +- [ ] Human-in-the-loop blocks sensitive actions +- [ ] WebSocket streaming delivers tokens in real-time +- [ ] Session state persists across requests +- [ ] Graceful LLM API failure handling with retries +- [ ] 60+ unit tests with mocked LLM responses +- [ ] 15+ integration tests (rate-limited real LLM calls) +- [ ] All validation gates green + +--- + +## All Needed Context + +### Documentation & References + +```yaml +# CRITICAL - PydanticAI Documentation +- url: https://ai.pydantic.dev/ + why: "Official PydanticAI docs - main reference" + +- url: https://ai.pydantic.dev/agents/ + why: "Agent constructor, result_type, system_prompt, run/run_stream methods" + +- url: https://ai.pydantic.dev/tools/ + why: "@agent.tool decorator, RunContext, deps_type, tool parameters" + +- url: https://ai.pydantic.dev/output/ + why: "AgentRunResult, StreamedRunResult, token usage tracking" + +- url: https://ai.pydantic.dev/examples/chat-app/ + why: "FastAPI + streaming integration example" + +- url: https://github.com/pydantic/pydantic-ai + why: "Source code for edge cases" + +# Anthropic API (fallback reference) +- url: https://docs.anthropic.com/en/api + why: "Claude model IDs, rate limits, error codes" + +# Codebase Patterns (CRITICAL) +- file: app/features/registry/service.py + why: "Service pattern - __init__, get_settings(), structured logging" + +- file: app/features/jobs/service.py + why: "Job execution pattern - state machine, error handling, audit trail" + +- file: app/features/backtesting/service.py + why: "BacktestingService - the agent will call this via tools" + +- file: app/features/registry/routes.py + why: "Route patterns - APIRouter, response_model, HTTPException" + +- file: app/features/registry/tests/conftest.py + why: "Test fixtures - db_session, client, async patterns" + +# RAG Integration (INITIAL-9 dependency) +- file: PRPs/PRP-9-rag-knowledge-base.md + why: "RAG layer the agent will consume via retrieve_context tool" +``` + +### Current Codebase Tree (Relevant Parts) + +``` +app/ +├── core/ +│ ├── config.py # Settings - ADD agent settings +│ ├── database.py # get_db dependency +│ ├── logging.py # get_logger +│ └── exceptions.py # ForecastLabError base +├── features/ +│ ├── backtesting/ # Agent tool: run_backtest +│ ├── registry/ # Agent tools: list_runs, compare_runs, create_alias +│ ├── forecasting/ # Agent tool: list_models +│ ├── rag/ # INITIAL-9 - Agent tool: retrieve_context +│ └── agents/ # NEW: Create this vertical slice +├── main.py # Include agents router + WebSocket +``` + +### Desired Codebase Tree (Files to Create) + +``` +app/features/agents/ +├── __init__.py # Export router +├── models.py # AgentSession ORM model +├── schemas.py # Request/Response Pydantic schemas +├── routes.py # REST endpoints +├── websocket.py # WebSocket endpoint handler +├── service.py # AgentService orchestration +├── agents/ +│ ├── __init__.py +│ ├── base.py # Base agent configuration +│ ├── experiment.py # Experiment Orchestrator Agent +│ └── rag_assistant.py # RAG Assistant Agent +├── tools/ +│ ├── __init__.py +│ ├── registry_tools.py # list_runs, compare_runs, create_alias +│ ├── backtesting_tools.py # run_backtest +│ ├── forecasting_tools.py # list_models +│ └── rag_tools.py # retrieve_context, format_citation +├── deps.py # AgentDeps dataclass for dependency injection +├── tests/ +│ ├── __init__.py +│ ├── conftest.py # Fixtures with mocked LLM +│ ├── test_schemas.py +│ ├── test_tools.py +│ ├── test_agents.py +│ ├── test_service.py +│ └── test_routes.py + +alembic/versions/ +└── xxxx_create_agent_sessions_table.py + +examples/agents/ +├── experiment_demo.py +├── rag_query.http +└── websocket_client.py +``` + +### Known Gotchas & Library Quirks + +```python +# CRITICAL: PydanticAI model identifier format +# Use "anthropic:claude-sonnet-4-20250514" NOT "claude-sonnet-4-20250514" +agent = Agent(model="anthropic:claude-sonnet-4-20250514") + +# CRITICAL: deps_type must match RunContext generic parameter +agent = Agent( + model="anthropic:claude-sonnet-4-20250514", + deps_type=AgentDeps, # Your dependency dataclass +) + +@agent.tool +def my_tool(ctx: RunContext[AgentDeps], param: str) -> str: + # ctx.deps is typed as AgentDeps + db = ctx.deps.db + ... + +# CRITICAL: Use @agent.tool for context access, @agent.tool_plain without +@agent.tool_plain +def roll_dice() -> str: + """No RunContext needed here.""" + return str(random.randint(1, 6)) + +# CRITICAL: output_type (not result_type) for structured outputs +agent = Agent( + model="...", + output_type=ExperimentReport, # NOT result_type +) + +# CRITICAL: run() is async, run_sync() is sync wrapper +result = await agent.run(prompt, deps=deps) # Async +result = agent.run_sync(prompt, deps=deps) # Sync + +# CRITICAL: Streaming requires async context manager +async with agent.run_stream(prompt, deps=deps) as result: + async for text in result.stream_text(): + yield text + +# CRITICAL: Access token usage after run completes +print(result.usage()) # RunUsage(input_tokens=X, output_tokens=Y) + +# CRITICAL: Message history for multi-turn +result2 = await agent.run( + "follow-up question", + deps=deps, + message_history=result.messages, # Previous messages +) + +# CRITICAL: Tool docstrings become schema descriptions +@agent.tool +async def run_backtest( + ctx: RunContext[AgentDeps], + model_type: str, + config: dict[str, Any], +) -> BacktestResult: + """Run a backtest for a model configuration. + + Use this to evaluate model performance with time-series CV. + Returns per-fold and aggregated metrics (MAE, sMAPE, WAPE). + + Args: + model_type: Type of model (naive, seasonal_naive, moving_average) + config: Model-specific configuration + """ + ... + +# CRITICAL: FastAPI WebSocket pattern +from fastapi import WebSocket, WebSocketDisconnect + +@router.websocket("/agents/stream") +async def websocket_stream(websocket: WebSocket): + await websocket.accept() + try: + while True: + data = await websocket.receive_json() + # Process and stream response + async for chunk in stream_agent_response(data): + await websocket.send_json(chunk) + except WebSocketDisconnect: + pass + +# CRITICAL: PydanticAI retry mechanism +from pydantic_ai import ModelRetry + +@agent.tool +async def risky_tool(ctx: RunContext[AgentDeps]) -> str: + try: + return await external_api() + except APIError as e: + raise ModelRetry(f"API failed: {e}. Please try again.") from e +``` + +--- + +## Implementation Blueprint + +### Data Models + +#### ORM Model (models.py) + +```python +"""Agent session persistence.""" +from __future__ import annotations +from datetime import datetime +from enum import Enum +from typing import Any +from sqlalchemy import DateTime, Integer, String, Text +from sqlalchemy.dialects.postgresql import JSONB +from sqlalchemy.orm import Mapped, mapped_column +from app.core.database import Base +from app.shared.models import TimestampMixin + + +class SessionStatus(str, Enum): + """Agent session states.""" + ACTIVE = "active" + AWAITING_APPROVAL = "awaiting_approval" + COMPLETED = "completed" + EXPIRED = "expired" + FAILED = "failed" + + +class AgentType(str, Enum): + """Available agent types.""" + EXPERIMENT_ORCHESTRATOR = "experiment_orchestrator" + RAG_ASSISTANT = "rag_assistant" + + +class AgentSession(TimestampMixin, Base): + """Persistent agent session for multi-turn conversations.""" + __tablename__ = "agent_session" + + id: Mapped[int] = mapped_column(Integer, primary_key=True) + session_id: Mapped[str] = mapped_column(String(32), unique=True, index=True) + agent_type: Mapped[str] = mapped_column(String(50), index=True) + status: Mapped[str] = mapped_column(String(30), default=SessionStatus.ACTIVE.value) + + # Message history for multi-turn + message_history: Mapped[list[dict[str, Any]]] = mapped_column(JSONB, default=list) + + # Pending approval + pending_action: Mapped[dict[str, Any] | None] = mapped_column(JSONB, nullable=True) + + # Usage tracking + total_tokens_used: Mapped[int] = mapped_column(Integer, default=0) + tool_calls_count: Mapped[int] = mapped_column(Integer, default=0) + + # Timing + last_activity: Mapped[datetime] = mapped_column(DateTime(timezone=True)) + expires_at: Mapped[datetime] = mapped_column(DateTime(timezone=True)) +``` + +#### Dependencies (deps.py) + +```python +"""Agent dependencies for tool access.""" +from dataclasses import dataclass +from sqlalchemy.ext.asyncio import AsyncSession + + +@dataclass +class AgentDeps: + """Dependencies passed to agent tools via RunContext.""" + db: AsyncSession + session_id: str + request_id: str | None = None +``` + +#### Pydantic Schemas (schemas.py) + +```python +"""Agent API schemas.""" +from datetime import datetime +from typing import Any, Literal +from pydantic import BaseModel, ConfigDict, Field + + +# === Experiment Agent === + +class ExperimentConstraints(BaseModel): + """Constraints for experiment search.""" + model_config = ConfigDict(extra="forbid") + + model_types: list[str] = Field(default_factory=lambda: ["naive", "seasonal_naive"]) + min_train_size: int = Field(default=60, ge=30) + max_splits: int = Field(default=5, ge=1, le=20) + + +class ExperimentRequest(BaseModel): + """Request to run experiment workflow.""" + model_config = ConfigDict(extra="forbid") + + objective: str = Field(..., min_length=10, max_length=500) + store_id: int = Field(..., ge=1) + product_id: int = Field(..., ge=1) + constraints: ExperimentConstraints = Field(default_factory=ExperimentConstraints) + auto_deploy: bool = False + session_id: str | None = None + + +class RunSummary(BaseModel): + """Summary of a model run.""" + run_id: str + model_type: str + config: dict[str, Any] + metrics: dict[str, float] + + +class BaselineComparison(BaseModel): + """Comparison against baseline models.""" + vs_naive: dict[str, float] | None = None + vs_seasonal_naive: dict[str, float] | None = None + + +class ExperimentReport(BaseModel): + """Structured output from Experiment Agent.""" + objective: str + methodology: str + experiments_run: int + best_run: RunSummary | None + baseline_comparison: BaselineComparison | None + recommendation: str + approval_required: bool + pending_action: str | None = None + + +class ToolCallSummary(BaseModel): + """Summary of a tool call.""" + tool: str + args: dict[str, Any] + result_summary: str + duration_ms: float + + +class ExperimentResponse(BaseModel): + """Response from experiment workflow.""" + session_id: str + status: Literal["completed", "awaiting_approval", "failed"] + report: ExperimentReport | None = None + tool_calls: list[ToolCallSummary] = Field(default_factory=list) + tokens_used: int = 0 + duration_ms: float = 0 + + +# === Approval === + +class ApprovalRequest(BaseModel): + """Request to approve/reject pending action.""" + model_config = ConfigDict(extra="forbid") + + session_id: str + action: str + approved: bool + comment: str | None = Field(None, max_length=500) + + +class ApprovalResponse(BaseModel): + """Response from approval action.""" + session_id: str + action: str + status: Literal["executed", "rejected"] + result: dict[str, Any] | None = None + + +# === RAG Agent === + +class RAGQueryRequest(BaseModel): + """Request for RAG-powered Q&A.""" + model_config = ConfigDict(extra="forbid") + + query: str = Field(..., min_length=5, max_length=2000) + session_id: str | None = None + include_sources: bool = True + + +class Citation(BaseModel): + """Citation from RAG retrieval.""" + source_type: str + source_path: str + chunk_id: str + snippet: str + relevance_score: float + + +class RAGQueryResponse(BaseModel): + """Response from RAG query.""" + session_id: str + answer: str + confidence: float = Field(..., ge=0.0, le=1.0) + citations: list[Citation] = Field(default_factory=list) + insufficient_context: bool = False + tokens_used: int = 0 + duration_ms: float = 0 + + +# === Session Status === + +class SessionStatusResponse(BaseModel): + """Session status details.""" + session_id: str + agent_type: str + status: str + created_at: datetime + last_activity: datetime + pending_action: dict[str, Any] | None = None + tool_calls_count: int + tokens_used: int + + +# === WebSocket Messages === + +class WSMessage(BaseModel): + """WebSocket message from client.""" + type: Literal["query", "approve", "cancel"] + agent: Literal["rag_assistant", "experiment_orchestrator"] + payload: dict[str, Any] + + +class WSEvent(BaseModel): + """WebSocket event to client.""" + type: Literal["token", "tool_call", "complete", "error"] + content: str | None = None + tool: str | None = None + status: str | None = None + summary: str | None = None + session_id: str | None = None + tokens_used: int | None = None +``` + +--- + +## Task List + +### Task 1: Add Dependencies to pyproject.toml + +```yaml +MODIFY: pyproject.toml +ADD to dependencies: + - "pydantic-ai>=0.1.0" # PydanticAI agent framework + - "anthropic>=0.40.0" # Anthropic SDK for Claude + - "websockets>=13.0" # WebSocket support (already in uvicorn[standard]) +``` + +### Task 2: Add Agent Settings to config.py + +```yaml +MODIFY: app/core/config.py +ADD after RAG settings: + + # Agent LLM Configuration + agent_default_model: str = "anthropic:claude-sonnet-4-20250514" + agent_fallback_model: str = "openai:gpt-4o" + agent_temperature: float = 0.1 + agent_max_tokens: int = 4096 + anthropic_api_key: str = "" # Required + + # Agent Execution Configuration + agent_max_tool_calls: int = 10 + agent_timeout_seconds: int = 120 + agent_retry_attempts: int = 3 + agent_retry_delay_seconds: float = 1.0 + + # Human-in-the-Loop Configuration + agent_require_approval: list[str] = ["create_alias", "archive_run"] + agent_approval_timeout_minutes: int = 60 + + # Session Configuration + agent_session_ttl_minutes: int = 120 + agent_max_sessions_per_user: int = 5 + + # Streaming Configuration + agent_enable_streaming: bool = True +``` + +### Task 3: Create Alembic Migration + +```yaml +CREATE: alembic/versions/xxxx_create_agent_sessions_table.py +PATTERN: Follow existing migration patterns + +Key columns: + - session_id (String 32, unique, indexed) + - agent_type (String 50, indexed) + - status (String 30) + - message_history (JSONB) + - pending_action (JSONB, nullable) + - total_tokens_used (Integer) + - tool_calls_count (Integer) + - last_activity (DateTime TZ) + - expires_at (DateTime TZ) + - created_at, updated_at (TimestampMixin) +``` + +### Task 4: Create ORM Models + +```yaml +CREATE: app/features/agents/models.py +MIRROR: app/features/registry/models.py pattern +INCLUDE: + - SessionStatus enum + - AgentType enum + - AgentSession model with JSONB columns +``` + +### Task 5: Create Dependencies Dataclass + +```yaml +CREATE: app/features/agents/deps.py +CONTENT: + - AgentDeps dataclass + - Fields: db (AsyncSession), session_id, request_id +``` + +### Task 6: Create Pydantic Schemas + +```yaml +CREATE: app/features/agents/schemas.py +MIRROR: app/features/registry/schemas.py pattern +INCLUDE: + - ExperimentRequest, ExperimentResponse, ExperimentReport + - ApprovalRequest, ApprovalResponse + - RAGQueryRequest, RAGQueryResponse, Citation + - SessionStatusResponse + - WSMessage, WSEvent +``` + +### Task 7: Create Tool Modules + +```yaml +CREATE: app/features/agents/tools/registry_tools.py +TOOLS: + - list_runs(ctx, filters) -> list[RunSummary] + - compare_runs(ctx, run_id_a, run_id_b) -> CompareResult + - create_alias(ctx, alias_name, run_id) -> AliasResult + - archive_run(ctx, run_id) -> ArchiveResult + +CREATE: app/features/agents/tools/backtesting_tools.py +TOOLS: + - run_backtest(ctx, model_type, config, store_id, product_id, n_splits) -> BacktestResult + +CREATE: app/features/agents/tools/forecasting_tools.py +TOOLS: + - list_models(ctx) -> list[ModelInfo] + +CREATE: app/features/agents/tools/rag_tools.py +TOOLS: + - retrieve_context(ctx, query, top_k) -> list[RetrievedChunk] + - format_citation(ctx, chunk) -> Citation + +CRITICAL for all tools: + - Use @agent.tool decorator (not @agent.tool_plain) for db access + - First param is RunContext[AgentDeps] + - Detailed docstrings for LLM schema + - Structured logging with timing +``` + +### Task 8: Create Agent Definitions + +```yaml +CREATE: app/features/agents/agents/base.py +CONTENT: + - get_agent_settings() helper + - Common model configuration + +CREATE: app/features/agents/agents/experiment.py +CONTENT: + - ExperimentReport output schema + - experiment_agent = Agent(...) + - System prompt for experiment orchestration + - Tools: list_models, run_backtest, compare_runs, create_alias + +CREATE: app/features/agents/agents/rag_assistant.py +CONTENT: + - RAGResponse output schema + - rag_agent = Agent(...) + - System prompt for evidence-grounded answers + - Tools: retrieve_context, format_citation +``` + +### Task 9: Create Agent Service + +```yaml +CREATE: app/features/agents/service.py +MIRROR: app/features/jobs/service.py pattern + +Class AgentService: + async def run_experiment(self, db, request) -> ExperimentResponse: + - Create/resume session + - Build AgentDeps + - Run experiment_agent with tools + - Capture tool calls and timing + - Handle approval_required check + - Update session state + - Return structured response + + async def run_rag_query(self, db, request) -> RAGQueryResponse: + - Create/resume session + - Run rag_agent with tools + - Extract citations from tool results + - Return structured response + + async def approve_action(self, db, request) -> ApprovalResponse: + - Load session + - Validate pending_action matches + - Execute action if approved + - Update session status + - Return result + + async def get_session_status(self, db, session_id) -> SessionStatusResponse: + - Load session + - Return status details + + async def stream_response(self, db, message) -> AsyncGenerator[WSEvent]: + - Route to appropriate agent + - Use run_stream for token-by-token delivery + - Yield WSEvent for each chunk +``` + +### Task 10: Create REST Routes + +```yaml +CREATE: app/features/agents/routes.py +MIRROR: app/features/registry/routes.py pattern + +Routes: + POST /agents/experiment/run -> ExperimentResponse + POST /agents/experiment/approve -> ApprovalResponse + POST /agents/rag/query -> RAGQueryResponse + GET /agents/status/{session_id} -> SessionStatusResponse + +CRITICAL: + - Structured logging with agents.* prefix + - Handle LLM API errors gracefully + - Timeout handling +``` + +### Task 11: Create WebSocket Handler + +```yaml +CREATE: app/features/agents/websocket.py +PATTERN: FastAPI WebSocket with async iteration + +Key functions: + websocket_stream(websocket: WebSocket): + - Accept connection + - Receive JSON messages + - Parse WSMessage + - Call service.stream_response() + - Send WSEvent for each chunk + - Handle disconnect gracefully + +CRITICAL: + - Use asyncio.wait_for for timeout + - Catch WebSocketDisconnect + - Log all events with correlation ID +``` + +### Task 12: Register Router in main.py + +```yaml +MODIFY: app/main.py +ADD import: from app.features.agents.routes import router as agents_router +ADD import: from app.features.agents.websocket import websocket_stream +ADD router: app.include_router(agents_router) +ADD websocket: app.add_api_websocket_route("/agents/stream", websocket_stream) +``` + +### Task 13: Create Test Fixtures + +```yaml +CREATE: app/features/agents/tests/conftest.py +FIXTURES: + - db_session: Async session with cleanup + - client: AsyncClient with db override + - mock_anthropic: Mock Anthropic API responses + - sample_experiment_request: Test request + - sample_rag_request: Test request +``` + +### Task 14: Create Unit Tests + +```yaml +CREATE: app/features/agents/tests/test_schemas.py + - Test all request/response validation + +CREATE: app/features/agents/tests/test_tools.py + - Test each tool function with mocked deps + - Test tool return types + - Test error handling + +CREATE: app/features/agents/tests/test_agents.py + - Test agent with mocked LLM + - Test structured output parsing + - Test tool call ordering +``` + +### Task 15: Create Integration Tests + +```yaml +CREATE: app/features/agents/tests/test_routes.py +@pytest.mark.integration: + - test_experiment_run_creates_session + - test_experiment_approval_workflow + - test_rag_query_returns_citations + - test_session_status_returns_details + - test_websocket_streaming (with TestClient) +``` + +### Task 16: Create Examples + +```yaml +CREATE: examples/agents/experiment_demo.py + - Full experiment workflow demo + +CREATE: examples/agents/rag_query.http + - HTTP client examples + +CREATE: examples/agents/websocket_client.py + - Python WebSocket client example +``` + +### Task 17: Update .env.example + +```yaml +MODIFY: .env.example +ADD: + # Agent Configuration + ANTHROPIC_API_KEY=sk-ant-... + AGENT_DEFAULT_MODEL=anthropic:claude-sonnet-4-20250514 + AGENT_MAX_TOOL_CALLS=10 + AGENT_TIMEOUT_SECONDS=120 +``` + +--- + +## Validation Loop + +### Level 1: Syntax & Style + +```bash +# Run FIRST +uv run ruff check app/features/agents/ --fix +uv run ruff format app/features/agents/ + +# Expected: No errors +``` + +### Level 2: Type Checking + +```bash +# MUST be green +uv run mypy app/features/agents/ +uv run pyright app/features/agents/ + +# Expected: 0 errors +``` + +### Level 3: Unit Tests + +```bash +# No LLM calls required (mocked) +uv run pytest app/features/agents/tests/ -v -m "not integration" + +# Expected: All pass +``` + +### Level 4: Integration Tests + +```bash +# Requires PostgreSQL + API keys +docker-compose up -d +uv run alembic upgrade head +uv run pytest app/features/agents/tests/ -v -m integration + +# Expected: All pass (rate-limited) +``` + +### Level 5: Manual Smoke Test + +```bash +# Start API +uv run uvicorn app.main:app --reload --port 8123 + +# RAG Query +curl -X POST http://localhost:8123/agents/rag/query \ + -H "Content-Type: application/json" \ + -d '{"query": "How does backtesting prevent data leakage?"}' + +# Expected: {"session_id": "...", "answer": "...", "citations": [...]} + +# Experiment (requires indexed RAG data) +curl -X POST http://localhost:8123/agents/experiment/run \ + -H "Content-Type: application/json" \ + -d '{ + "objective": "Find best model for store 1, product 1", + "store_id": 1, + "product_id": 1 + }' + +# Expected: {"session_id": "...", "status": "completed", "report": {...}} + +# WebSocket test +python examples/agents/websocket_client.py +``` + +--- + +## Final Validation Checklist + +- [ ] All tests pass: `uv run pytest app/features/agents/tests/ -v` +- [ ] No linting errors: `uv run ruff check app/features/agents/` +- [ ] No type errors: `uv run mypy && pyright` +- [ ] Migration applies: `uv run alembic upgrade head` +- [ ] Manual smoke tests pass +- [ ] Structured logging with `agents.*` prefix +- [ ] Tool calls logged with timing +- [ ] Session state persists across requests +- [ ] Approval workflow blocks sensitive actions +- [ ] WebSocket streaming works + +--- + +## Anti-Patterns to Avoid + +- ❌ Don't use `result_type` - use `output_type` in PydanticAI +- ❌ Don't forget `deps_type` when using `RunContext[AgentDeps]` +- ❌ Don't use `@agent.tool_plain` when db access needed +- ❌ Don't forget to handle `WebSocketDisconnect` +- ❌ Don't block on LLM calls without timeout +- ❌ Don't store raw message_history as strings - use JSONB +- ❌ Don't skip structured logging for tool calls +- ❌ Don't hardcode model names - use settings + +--- + +## Confidence Score: 7.5/10 + +**Strengths:** +- PydanticAI has excellent documentation +- Clear FastAPI integration patterns +- Existing service patterns to follow +- Tool integrations with existing modules + +**Risks:** +- PydanticAI is relatively new (versioning may change) +- WebSocket streaming with tools is complex +- LLM rate limits may affect tests +- Message history serialization edge cases + +**Mitigations:** +- Pin PydanticAI version in pyproject.toml +- Comprehensive mocking for unit tests +- Rate-limited integration tests +- JSONB for flexible message storage