vectorlessflow · zTgx · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/Cargo.toml b/Cargo.toml
@@ -14,8 +14,8 @@ members = [
     # "vectorless-core/vectorless-query",
     # "vectorless-core/vectorless-agent",
     # "vectorless-core/vectorless-retrieval",
-    "vectorless-core/vectorless-index",
-    "vectorless-core/vectorless-rerank",
+    # "vectorless-core/vectorless-rerank",
+    "vectorless-core/vectorless-compiler",
     "vectorless-core/vectorless-primitives",
     "vectorless-core/vectorless-engine",
     "vectorless-core/vectorless-py",
@@ -24,7 +24,7 @@ resolver = "2"
 
 [workspace.package]
 version = "0.1.12"
-description = "Document Understanding Engine for AI"
+description = "Knowing by reasoning, not vectors."
 edition = "2024"
 authors = ["zTgx <beautifularea@gmail.com>"]
 license = "Apache-2.0"

diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,19 @@
 # HISTORY
 
+## 0.1.12 (2026-04-24)
+
+- **Compile pipeline**: renamed index pipeline to compile pipeline with passes-based architecture
+- **Compiler refactor**: renamed stages to passes, removed deprecated `StageResult` alias and `CustomStageBuilder`
+- New backend compilation passes: query routing, reasoning chains, overlap detection, and scoring
+- Agent acceleration data added to compiled documents
+- LLM-powered cross-document insight extraction in ask module
+- Enhanced JSON parsing with proper error handling
+- Upgraded minimum Python version to 3.11
+- Removed unused modules: agent, memory backend, validation, ReferenceResolver, SufficiencyLevel
+- Restructured configuration modules and removed legacy retrieval config
+- Simplified storage layer by removing memory backend
+- Documentation updates for architecture and compilation pipeline
+
 ## 0.1.11 (2026-04-21)
 
 - Project description updated to "reasoning-based document engine"

diff --git a/docs/docs/architecture.mdx b/docs/docs/architecture.mdx
@@ -10,7 +10,7 @@ Vectorless transforms documents into hierarchical semantic trees and uses LLM-po
 
 ```text
 ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
-│   Document   │────▶│    Index     │────▶│   Storage    │
+│   Document   │────▶│   Compile    │────▶│   Storage    │
 │  (PDF/MD)    │     │   Pipeline   │     │   (Disk)     │
 └──────────────┘     └──────────────┘     └──────┬───────┘
                                                   │
@@ -20,23 +20,42 @@ Vectorless transforms documents into hierarchical semantic trees and uses LLM-po
                      └──────────────┘     └──────────────┘
 ```
 
-## Index Pipeline
+## Compile Pipeline
 
-The indexing pipeline processes documents through ordered stages:
+The compile pipeline processes documents through four phases (Frontend → Analysis → Transform → Backend), each containing independent passes:
 
-| Stage | Priority | Description |
-|-------|----------|-------------|
-| **Parse** | 10 | Parse document into raw nodes (Markdown headings, PDF pages) |
-| **Build** | 20 | Construct arena-based tree with thinning and content merge |
-| **Validate** | 22 | Tree integrity checks |
-| **Split** | 25 | Split oversized leaf nodes (>4000 tokens) |
-| **Enhance** | 30 | Generate LLM summaries (Full, Selective, or Lazy strategy) |
-| **Enrich** | 40 | Calculate metadata, page ranges, resolve cross-references |
-| **Reasoning Index** | 45 | Build keyword-to-node mappings, synonym expansion, summary shortcuts |
-| **Navigation Index** | 50 | Build NavEntry + ChildRoute data for agent navigation |
-| **Optimize** | 60 | Final tree optimization |
+| Phase | Pass | Priority | Description |
+|-------|------|----------|-------------|
+| **Frontend** | Parse | 10 | Parse document into raw nodes (Markdown headings, PDF pages) |
+| **Frontend** | Build | 20 | Construct arena-based tree with thinning and content merge |
+| **Analysis** | Validate | 22 | Tree integrity checks |
+| **Transform** | Split | 25 | Split oversized leaf nodes (>4000 tokens) |
+| **Transform** | Enhance | 30 | Generate LLM summaries (Full, Selective, or Lazy strategy) |
+| **Transform** | Enrich | 40 | Calculate metadata, page ranges, resolve cross-references |
+| **Backend** | Reasoning Index | 45 | Build keyword-to-node mappings, synonym expansion, summary shortcuts |
+| **Backend** | Concept | 46 | Extract key concepts with section associations |
+| **Backend** | Navigation Index | 50 | Build NavEntry + ChildRoute data for agent navigation |
+| **Backend** | Route | 52 | Build query routing table (intent routes + concept routes) |
+| **Backend** | Chain | 54 | Build reasoning chains from cross-references |
+| **Backend** | Overlap | 56 | Detect content overlap between nodes (Jaccard similarity) |
+| **Backend** | Score | 58 | Compute evidence quality scores (density, richness, specificity) |
+| **Backend** | Verify | 59 | Validate compiled output integrity |
+| **Backend** | Optimize | 60 | Final tree optimization |
 
-Each stage is independently configurable. The pipeline supports incremental re-indexing via content fingerprinting.
+Each pass is independently configurable. The pipeline supports incremental recompilation via content fingerprinting and checkpoint/resume for fault tolerance.
+
+### Agent Acceleration Data
+
+The backend passes produce pre-computed acceleration data used by Workers during retrieval:
+
+| Data | Pass | Purpose |
+|------|------|---------|
+| **QueryRoutingTable** | Route | Maps intents and concepts to scored target nodes |
+| **ChainIndex** | Chain | Connects sections via reasoning chains (elaboration, supporting, etc.) |
+| **ContentOverlapMap** | Overlap | Flags duplicate/subset/summary overlap between nodes |
+| **EvidenceScoreMap** | Score | Ranks nodes by information density and data richness |
+
+This data is injected as Phase 1.5 hints into the Worker's navigation plan, allowing the LLM to make informed routing decisions without additional navigation steps.
 
 ## Tree Structure
 
@@ -104,7 +123,7 @@ When the user specifies document IDs directly, the Orchestrator skips the analys
 Each Worker navigates a single document's tree to collect evidence through a command-based loop:
 
 1. **Bird's-eye** — `ls` the root for an overview
-2. **Plan** — LLM generates a navigation plan based on keyword index hits
+2. **Plan** — LLM generates a navigation plan based on keyword index hits + acceleration data
 3. **Navigate** — Loop: LLM selects command → execute → observe result → repeat
 4. **Return** — Collected evidence only — no answer synthesis
 
@@ -132,6 +151,7 @@ Workers prioritize keyword-based navigation over manual exploration:
 1. When keyword index hits are available, Workers use `find` with the exact keyword to jump directly to relevant sections
 2. Workers use `ls` when no keyword hints exist or when discovering unknown structure
 3. Workers use `findtree` when the section title pattern is known but not the exact name
+4. Pre-computed acceleration data (routes, scores, chains) is injected as Phase 1.5 hints to guide the Worker toward high-value nodes
 
 #### Dynamic Re-planning
 
@@ -149,11 +169,11 @@ The system returns raw evidence text — no LLM synthesis or paraphrasing. This
 
 ## DocCard Catalog
 
-When multiple documents are indexed, Vectorless maintains a lightweight `catalog.bin` containing DocCard metadata for each document. This allows the Orchestrator to analyze and select relevant documents without loading the full document trees — a significant optimization for workspaces with many documents.
+When multiple documents are compiled, Vectorless maintains a lightweight `catalog.bin` containing DocCard metadata for each document. This allows the Orchestrator to analyze and select relevant documents without loading the full document trees — a significant optimization for workspaces with many documents.
 
 ## Cross-Document Graph
 
-When multiple documents are indexed, Vectorless automatically builds a relationship graph based on shared keywords and Jaccard similarity. The graph is constructed as a background task after each indexing operation.
+When multiple documents are compiled, Vectorless automatically builds a relationship graph based on shared keywords and Jaccard similarity. The graph is constructed as a background task after each compilation operation.
 
 ## Zero Infrastructure
 

diff --git a/docs/docs/compiler/checkpoint.mdx b/docs/docs/compiler/checkpoint.mdx
@@ -0,0 +1,117 @@
+---
+sidebar_position: 6
+---
+
+# Checkpoint and Resume
+
+Checkpointing allows the pipeline to resume from where it left off after an interruption (crash, timeout, process kill). This is critical for large documents where LLM-enhanced compilation can take minutes.
+
+## How It Works
+
+When `PipelineOptions::checkpoint_dir` is set, the orchestrator saves state to disk after each execution group completes:
+
+```text
+Group 0: [ParsePass]              → save checkpoint
+Group 1: [BuildPass]              → save checkpoint
+Group 2: [ValidatePass, SplitPass] → save checkpoint
+Group 3: [EnhancePass]            → save checkpoint  ← expensive LLM calls
+...
+```
+
+On restart, the orchestrator loads the checkpoint and skips already-completed passes.
+
+## What's Stored
+
+Each checkpoint contains:
+
+```rust
+pub struct PipelineCheckpoint {
+    pub doc_id: String,
+    pub source_hash: String,           // SHA-256 of source content
+    pub processing_version: u32,       // Algorithm version
+    pub config_fingerprint: String,    // Hash of PipelineOptions
+    pub completed_stages: Vec<String>, // Names of completed passes
+    pub context_data: CheckpointContextData,
+    pub timestamp: DateTime<Utc>,
+}
+
+pub struct CheckpointContextData {
+    pub raw_nodes: Vec<RawNode>,          // From ParsePass
+    pub tree: Option<DocumentTree>,        // From BuildPass
+    pub metrics: IndexMetrics,             // Cumulative metrics
+    pub page_count: Option<usize>,
+    pub line_count: Option<usize>,
+    pub description: Option<String>,
+}
+```
+
+## Validation
+
+Before resuming, the checkpoint is validated against the current input:
+
+| Check | Purpose |
+|---|---|
+| `source_hash` matches | Source content hasn't changed |
+| `processing_version` matches | Algorithm hasn't been upgraded |
+| `config_fingerprint` matches | Pipeline options haven't changed |
+
+If any check fails, the checkpoint is discarded and the pipeline starts fresh.
+
+## Lifecycle
+
+```text
+┌──────────────┐     ┌──────────────┐     ┌──────────────┐
+│  Start       │────▶│ Load         │────▶│ Valid?       │
+│  Pipeline    │     │ Checkpoint   │     │              │
+└──────────────┘     └──────────────┘     └──┬───────┬───┘
+                                              │       │
+                                         Yes  │   No  │
+                                              │       │
+                                    ┌─────────▼──┐  ┌─▼──────────┐
+                                    │ Resume from │  │ Start fresh │
+                                    │ completed   │  │             │
+                                    │ stages      │  │             │
+                                    └──────┬──────┘  └────────────┘
+                                           │
+                              ┌────────────▼─────────────┐
+                              │ Execute remaining passes  │
+                              │ Save after each group     │
+                              └────────────┬─────────────┘
+                                           │
+                              ┌────────────▼─────────────┐
+                              │ All complete?             │
+                              │ → Clear checkpoint file   │
+                              └──────────────────────────┘
+```
+
+## Configuration
+
+```rust
+let options = PipelineOptions::default()
+    .with_checkpoint_dir("./workspace/checkpoints");
+```
+
+Checkpoints are stored as individual JSON files in the checkpoint directory, one per document (keyed by `doc_id`). On successful completion, the checkpoint file is deleted.
+
+## CheckpointManager API
+
+```rust
+let manager = CheckpointManager::new("./checkpoints");
+
+// Save checkpoint
+manager.save(&doc_id, &checkpoint)?;
+
+// Load checkpoint
+let checkpoint = manager.load(&doc_id);
+
+// Check if valid for resume
+let valid = CheckpointManager::is_valid_for_resume(
+    &checkpoint,
+    &source_hash,
+    processing_version,
+    &config_fingerprint,
+);
+
+// Clear after successful completion
+manager.clear(&doc_id)?;
+```