Skip to content
Merged

Dev #116

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
716dbb7
refactor(engine): remove ask functionality and query context in favor…
zTgx Apr 24, 2026
80a3e38
refactor(agent): remove unused agent module and update workspace conf…
zTgx Apr 24, 2026
5997d05
refactor: remove unused validation module and simplify codebase
zTgx Apr 24, 2026
ad42c1a
refactor(config): restructure configuration modules and remove retrie…
zTgx Apr 24, 2026
0ee2f3a
refactor(config): remove sufficiency and strategy configs from storag…
zTgx Apr 24, 2026
6675b92
feat(ask): enhance JSON parsing with proper error handling
zTgx Apr 24, 2026
2a583c2
refactor(vectorless-document): remove unused ReferenceResolver struct
zTgx Apr 24, 2026
96e3f05
refactor(engine): remove unused ask method and related types from eng…
zTgx Apr 24, 2026
bc80113
refactor(storage): remove memory backend and simplify persistence layer
zTgx Apr 24, 2026
63f81f8
refactor(workspace): remove unused test module
zTgx Apr 24, 2026
5500841
feat: upgrade minimum Python version to 3.11
zTgx Apr 24, 2026
24120bb
feat: update project description and Python version requirement
zTgx Apr 24, 2026
04247bf
refactor(core): remove unused query types and events from engine
zTgx Apr 24, 2026
340f22a
refactor(vectorless-document): remove unused exports and types from u…
zTgx Apr 24, 2026
e43213f
refactor(core): remove SufficiencyLevel enum and consolidate utility …
zTgx Apr 24, 2026
f1f10cc
feat(compiler): rename index module to compiler with updated types
zTgx Apr 24, 2026
1c49ca8
feat(ask): add LLM-powered cross-document insight extraction
zTgx Apr 24, 2026
cc7b66f
refactor(compiler): rename stages to passes and update module structure
zTgx Apr 24, 2026
5355ed0
docs(vectorless-compiler): update pipeline documentation with phase c…
zTgx Apr 24, 2026
9fde833
docs(compiler): add comprehensive documentation for compilation pipeline
zTgx Apr 24, 2026
8339fc0
docs(compiler): add documentation for custom passes, parsers, and sta…
zTgx Apr 24, 2026
666d51a
feat(compiler): add new backend passes for query routing, reasoning c…
zTgx Apr 24, 2026
1c824f4
feat(compiler): add new compilation passes for enhanced functionality
zTgx Apr 24, 2026
03fc6be
refactor(compiler): rename index pipeline to compile pipeline
zTgx Apr 24, 2026
6f212cc
refactor(compiler): remove deprecated StageResult alias and unused Cu…
zTgx Apr 24, 2026
e091cef
feat(document): add agent acceleration data to compiled documents
zTgx Apr 24, 2026
e4ef462
docs(architecture): rename index pipeline to compile pipeline with en…
zTgx Apr 24, 2026
88b1b61
docs(markdown): update example code snippets to use correct module paths
zTgx Apr 24, 2026
d8664b6
refactor(compiler): reorganize module declarations and imports
zTgx Apr 24, 2026
e1d0632
docs(HISTORY): add release notes for version 0.1.12
zTgx Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ members = [
# "vectorless-core/vectorless-query",
# "vectorless-core/vectorless-agent",
# "vectorless-core/vectorless-retrieval",
"vectorless-core/vectorless-index",
"vectorless-core/vectorless-rerank",
# "vectorless-core/vectorless-rerank",
"vectorless-core/vectorless-compiler",
"vectorless-core/vectorless-primitives",
"vectorless-core/vectorless-engine",
"vectorless-core/vectorless-py",
Expand All @@ -24,7 +24,7 @@ resolver = "2"

[workspace.package]
version = "0.1.12"
description = "Document Understanding Engine for AI"
description = "Knowing by reasoning, not vectors."
edition = "2024"
authors = ["zTgx <beautifularea@gmail.com>"]
license = "Apache-2.0"
Expand Down
14 changes: 14 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# HISTORY

## 0.1.12 (2026-04-24)

- **Compile pipeline**: renamed index pipeline to compile pipeline with passes-based architecture
- **Compiler refactor**: renamed stages to passes, removed deprecated `StageResult` alias and `CustomStageBuilder`
- New backend compilation passes: query routing, reasoning chains, overlap detection, and scoring
- Agent acceleration data added to compiled documents
- LLM-powered cross-document insight extraction in ask module
- Enhanced JSON parsing with proper error handling
- Upgraded minimum Python version to 3.11
- Removed unused modules: agent, memory backend, validation, ReferenceResolver, SufficiencyLevel
- Restructured configuration modules and removed legacy retrieval config
- Simplified storage layer by removing memory backend
- Documentation updates for architecture and compilation pipeline

## 0.1.11 (2026-04-21)

- Project description updated to "reasoning-based document engine"
Expand Down
56 changes: 38 additions & 18 deletions docs/docs/architecture.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Vectorless transforms documents into hierarchical semantic trees and uses LLM-po

```text
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Document │────▶│ Index │────▶│ Storage │
│ Document │────▶│ Compile │────▶│ Storage │
│ (PDF/MD) │ │ Pipeline │ │ (Disk) │
└──────────────┘ └──────────────┘ └──────┬───────┘
Expand All @@ -20,23 +20,42 @@ Vectorless transforms documents into hierarchical semantic trees and uses LLM-po
└──────────────┘ └──────────────┘
```

## Index Pipeline
## Compile Pipeline

The indexing pipeline processes documents through ordered stages:
The compile pipeline processes documents through four phases (Frontend → Analysis → Transform → Backend), each containing independent passes:

| Stage | Priority | Description |
|-------|----------|-------------|
| **Parse** | 10 | Parse document into raw nodes (Markdown headings, PDF pages) |
| **Build** | 20 | Construct arena-based tree with thinning and content merge |
| **Validate** | 22 | Tree integrity checks |
| **Split** | 25 | Split oversized leaf nodes (>4000 tokens) |
| **Enhance** | 30 | Generate LLM summaries (Full, Selective, or Lazy strategy) |
| **Enrich** | 40 | Calculate metadata, page ranges, resolve cross-references |
| **Reasoning Index** | 45 | Build keyword-to-node mappings, synonym expansion, summary shortcuts |
| **Navigation Index** | 50 | Build NavEntry + ChildRoute data for agent navigation |
| **Optimize** | 60 | Final tree optimization |
| Phase | Pass | Priority | Description |
|-------|------|----------|-------------|
| **Frontend** | Parse | 10 | Parse document into raw nodes (Markdown headings, PDF pages) |
| **Frontend** | Build | 20 | Construct arena-based tree with thinning and content merge |
| **Analysis** | Validate | 22 | Tree integrity checks |
| **Transform** | Split | 25 | Split oversized leaf nodes (>4000 tokens) |
| **Transform** | Enhance | 30 | Generate LLM summaries (Full, Selective, or Lazy strategy) |
| **Transform** | Enrich | 40 | Calculate metadata, page ranges, resolve cross-references |
| **Backend** | Reasoning Index | 45 | Build keyword-to-node mappings, synonym expansion, summary shortcuts |
| **Backend** | Concept | 46 | Extract key concepts with section associations |
| **Backend** | Navigation Index | 50 | Build NavEntry + ChildRoute data for agent navigation |
| **Backend** | Route | 52 | Build query routing table (intent routes + concept routes) |
| **Backend** | Chain | 54 | Build reasoning chains from cross-references |
| **Backend** | Overlap | 56 | Detect content overlap between nodes (Jaccard similarity) |
| **Backend** | Score | 58 | Compute evidence quality scores (density, richness, specificity) |
| **Backend** | Verify | 59 | Validate compiled output integrity |
| **Backend** | Optimize | 60 | Final tree optimization |

Each stage is independently configurable. The pipeline supports incremental re-indexing via content fingerprinting.
Each pass is independently configurable. The pipeline supports incremental recompilation via content fingerprinting and checkpoint/resume for fault tolerance.

### Agent Acceleration Data

The backend passes produce pre-computed acceleration data used by Workers during retrieval:

| Data | Pass | Purpose |
|------|------|---------|
| **QueryRoutingTable** | Route | Maps intents and concepts to scored target nodes |
| **ChainIndex** | Chain | Connects sections via reasoning chains (elaboration, supporting, etc.) |
| **ContentOverlapMap** | Overlap | Flags duplicate/subset/summary overlap between nodes |
| **EvidenceScoreMap** | Score | Ranks nodes by information density and data richness |

This data is injected as Phase 1.5 hints into the Worker's navigation plan, allowing the LLM to make informed routing decisions without additional navigation steps.

## Tree Structure

Expand Down Expand Up @@ -104,7 +123,7 @@ When the user specifies document IDs directly, the Orchestrator skips the analys
Each Worker navigates a single document's tree to collect evidence through a command-based loop:

1. **Bird's-eye** — `ls` the root for an overview
2. **Plan** — LLM generates a navigation plan based on keyword index hits
2. **Plan** — LLM generates a navigation plan based on keyword index hits + acceleration data
3. **Navigate** — Loop: LLM selects command → execute → observe result → repeat
4. **Return** — Collected evidence only — no answer synthesis

Expand Down Expand Up @@ -132,6 +151,7 @@ Workers prioritize keyword-based navigation over manual exploration:
1. When keyword index hits are available, Workers use `find` with the exact keyword to jump directly to relevant sections
2. Workers use `ls` when no keyword hints exist or when discovering unknown structure
3. Workers use `findtree` when the section title pattern is known but not the exact name
4. Pre-computed acceleration data (routes, scores, chains) is injected as Phase 1.5 hints to guide the Worker toward high-value nodes

#### Dynamic Re-planning

Expand All @@ -149,11 +169,11 @@ The system returns raw evidence text — no LLM synthesis or paraphrasing. This

## DocCard Catalog

When multiple documents are indexed, Vectorless maintains a lightweight `catalog.bin` containing DocCard metadata for each document. This allows the Orchestrator to analyze and select relevant documents without loading the full document trees — a significant optimization for workspaces with many documents.
When multiple documents are compiled, Vectorless maintains a lightweight `catalog.bin` containing DocCard metadata for each document. This allows the Orchestrator to analyze and select relevant documents without loading the full document trees — a significant optimization for workspaces with many documents.

## Cross-Document Graph

When multiple documents are indexed, Vectorless automatically builds a relationship graph based on shared keywords and Jaccard similarity. The graph is constructed as a background task after each indexing operation.
When multiple documents are compiled, Vectorless automatically builds a relationship graph based on shared keywords and Jaccard similarity. The graph is constructed as a background task after each compilation operation.

## Zero Infrastructure

Expand Down
117 changes: 117 additions & 0 deletions docs/docs/compiler/checkpoint.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
sidebar_position: 6
---

# Checkpoint and Resume

Checkpointing allows the pipeline to resume from where it left off after an interruption (crash, timeout, process kill). This is critical for large documents where LLM-enhanced compilation can take minutes.

## How It Works

When `PipelineOptions::checkpoint_dir` is set, the orchestrator saves state to disk after each execution group completes:

```text
Group 0: [ParsePass] → save checkpoint
Group 1: [BuildPass] → save checkpoint
Group 2: [ValidatePass, SplitPass] → save checkpoint
Group 3: [EnhancePass] → save checkpoint ← expensive LLM calls
...
```

On restart, the orchestrator loads the checkpoint and skips already-completed passes.

## What's Stored

Each checkpoint contains:

```rust
pub struct PipelineCheckpoint {
pub doc_id: String,
pub source_hash: String, // SHA-256 of source content
pub processing_version: u32, // Algorithm version
pub config_fingerprint: String, // Hash of PipelineOptions
pub completed_stages: Vec<String>, // Names of completed passes
pub context_data: CheckpointContextData,
pub timestamp: DateTime<Utc>,
}

pub struct CheckpointContextData {
pub raw_nodes: Vec<RawNode>, // From ParsePass
pub tree: Option<DocumentTree>, // From BuildPass
pub metrics: IndexMetrics, // Cumulative metrics
pub page_count: Option<usize>,
pub line_count: Option<usize>,
pub description: Option<String>,
}
```

## Validation

Before resuming, the checkpoint is validated against the current input:

| Check | Purpose |
|---|---|
| `source_hash` matches | Source content hasn't changed |
| `processing_version` matches | Algorithm hasn't been upgraded |
| `config_fingerprint` matches | Pipeline options haven't changed |

If any check fails, the checkpoint is discarded and the pipeline starts fresh.

## Lifecycle

```text
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Start │────▶│ Load │────▶│ Valid? │
│ Pipeline │ │ Checkpoint │ │ │
└──────────────┘ └──────────────┘ └──┬───────┬───┘
│ │
Yes │ No │
│ │
┌─────────▼──┐ ┌─▼──────────┐
│ Resume from │ │ Start fresh │
│ completed │ │ │
│ stages │ │ │
└──────┬──────┘ └────────────┘
┌────────────▼─────────────┐
│ Execute remaining passes │
│ Save after each group │
└────────────┬─────────────┘
┌────────────▼─────────────┐
│ All complete? │
│ → Clear checkpoint file │
└──────────────────────────┘
```

## Configuration

```rust
let options = PipelineOptions::default()
.with_checkpoint_dir("./workspace/checkpoints");
```

Checkpoints are stored as individual JSON files in the checkpoint directory, one per document (keyed by `doc_id`). On successful completion, the checkpoint file is deleted.

## CheckpointManager API

```rust
let manager = CheckpointManager::new("./checkpoints");

// Save checkpoint
manager.save(&doc_id, &checkpoint)?;

// Load checkpoint
let checkpoint = manager.load(&doc_id);

// Check if valid for resume
let valid = CheckpointManager::is_valid_for_resume(
&checkpoint,
&source_hash,
processing_version,
&config_fingerprint,
);

// Clear after successful completion
manager.clear(&doc_id)?;
```
Loading
Loading