Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "vectorless"
version = "0.1.13"
version = "0.1.14"
edition = "2024"
authors = ["zTgx <beautifularea@gmail.com>"]
description = "Hierarchical, reasoning-native document intelligence engine"
Expand Down
42 changes: 29 additions & 13 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,51 @@
# Vectorless Documentation

## Brand Assets
Welcome to the Vectorless documentation.

Logos and icons for use in README, website, and presentations.
## What is Vectorless?

- [assets/brand/](assets/brand/) — Logo variants (light, dark, horizontal, icon)
Vectorless is a **reasoning-native document intelligence engine** that uses LLM-powered tree navigation instead of vector embeddings. It preserves document structure and uses intelligent navigation to find relevant content.

## Design Documents
## Key Features

System architecture and core mechanism documentation.
- **Dual Pipeline Architecture** - Separate Index and Retrieval pipelines
- **Pilot System** - LLM-guided navigation with layered fallback
- **Multi-Strategy Retrieval** - Keyword, LLM, and Structure-aware strategies
- **Zero Infrastructure** - No vector database, no embeddings
- **Multi-Format Support** - Markdown, PDF, DOCX, HTML

| Document | Description |
|----------|-------------|
| [architecture.svg](design/architecture.svg) | System architecture diagram |
| [recovery.md](design/recovery.md) | Graceful degradation and error recovery strategy |
## Getting Started

## Development Guides
- [Quick Start Guide](guides/quick-start.md) - Get up and running in 5 minutes

Guides for using and contributing to Vectorless.
## Guides

| Guide | Description |
|-------|-------------|
| [deployment.md](guides/deployment.md) | Production deployment checklist |
| [Quick Start](guides/quick-start.md) | Get up and running quickly |
| [Dual Pipeline](guides/dual-pipeline.md) | Understand Index + Retrieval pipelines |
| [Pilot System](guides/pilot-system.md) | LLM-guided navigation |
| [Multi-Strategy Retrieval](guides/multi-strategy.md) | Keyword, LLM, Structure strategies |

## Design Documents

System architecture and core mechanism documentation.

| Document | Description |
|----------|-------------|
| [pilot.md](design/pilot.md) | Pilot system design |
| [content-aggregation.md](design/content-aggregation.md) | Content aggregation design |
| [client-module.md](design/client-module.md) | Client API design |
| [v3.md](design/v3.md) | Version 3 architecture |

## RFCs (Feature Proposals)

Detailed design documents for new features.

| RFC | Title | Status |
|-----|-------|--------|
| [0001](rfcs/0001-docx-parser.md) | DOCX Parser | Proposed |
| [0001](rfcs/0001-docx-parser.md) | DOCX Parser | Implemented |
| [0002](rfcs/0002-html-parser.md) | HTML Parser | Implemented |

### RFC Process

Expand Down
4 changes: 3 additions & 1 deletion docs/guides/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
# Guide
# Vectorless Guides

Practical guides for using Vectorless effectively.
152 changes: 152 additions & 0 deletions docs/guides/dual-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Understanding the Dual Pipeline

Vectorless uses a **dual pipeline architecture** that separates document processing from retrieval. This design enables efficient indexing and intelligent retrieval.

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Vectorless Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────┐ ┌─────────────────────────────┐ │
│ │ INDEX PIPELINE │ │ RETRIEVAL PIPELINE │ │
│ │ │ │ │ │
│ │ Parse → Build → Enrich │ │ Analyze → Plan → Search │ │
│ │ ↓ ↓ ↓ │ │ ↓ ↓ ↓ │ │
│ │ Enhance → Optimize → │ │ Evaluate (Sufficiency) │ │
│ │ Persist │ │ ↑_____________│ │ │
│ │ │ │ │ (NeedMoreData)│ │ │
│ └─────────────────────────────┘ └─────────────────────────────┘ │
│ │ ▲ │
│ └──────────── Workspace ─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Index Pipeline

The Index Pipeline processes documents and builds a searchable tree structure.

### Stages

| Stage | Purpose |
|-------|---------|
| **Parse** | Extract content from file (MD, PDF, DOCX, HTML) |
| **Build** | Construct hierarchical document tree |
| **Enrich** | Add metadata, TOC, references |
| **Enhance** | Generate summaries (optional) |
| **Optimize** | Prune, compress, optimize tree |
| **Persist** | Save to workspace storage |

### Example

```rust
// Index pipeline is triggered automatically
let doc_id = engine.index(IndexContext::from_path("./manual.md")).await?;

// With summary generation
let doc_id = engine.index(
IndexContext::from_path("./manual.md")
.with_options(IndexOptions::new().with_summaries())
).await?;
```

## Retrieval Pipeline

The Retrieval Pipeline processes queries and retrieves relevant content.

### Stages

| Stage | Purpose |
|-------|---------|
| **Analyze** | Analyze query complexity, extract keywords |
| **Plan** | Select retrieval strategy and algorithm |
| **Search** | Navigate tree to find candidates |
| **Evaluate** | Check sufficiency, aggregate content |

### The Evaluate Stage

The Evaluate stage is crucial - it determines if retrieved content is sufficient:

```text
┌─────────────┐
│ Search │
└──────┬──────┘
┌─────────────┐
│ Evaluate │
└──────┬──────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
Sufficient PartialSufficient Insufficient
│ │ │
▼ ▼ ▼
Return More Search Expand Beam
(1 iteration) (2 iterations)
```

### Retrieval Strategies

```rust
// Three built-in strategies:

// 1. Keyword - Fast, exact matching
// 2. LLM - Semantic understanding via Pilot
// 3. Structure - Hierarchy-aware navigation
```

## The Pilot System

Pilot is the "brain" of the Retrieval Pipeline:

- **Query Analysis**: Understands what the user is asking
- **Context Building**: Creates navigation context from TOC
- **Decision Making**: Decides which branches to explore
- **Fallback**: Algorithm takes over when LLM fails

See [The Pilot System](./pilot-system.md) for details.

## Data Flow

```
Document ──► Index Pipeline ──► Workspace
Query ──► Retrieval Pipeline ──────────┘
RetrievalResult
├── content
├── node_ids
├── confidence
└── trace
```

## Session-Based Operations

For multi-document operations, use sessions:

```rust
// Create a session
let session = engine.session().await;

// Index multiple documents
session.index(IndexContext::from_path("./doc1.md")).await?;
session.index(IndexContext::from_path("./doc2.md")).await?;

// Query across all documents
let results = session.query_all("What is the architecture?").await?;

for result in results {
println!("From {}: {}", result.doc_id, result.content);
}
```

## See Also

- [Multi-Strategy Retrieval](./multi-strategy.md)
- [Content Aggregation](./content-aggregation.md)
- [Sufficiency Checking](./sufficiency.md)
89 changes: 89 additions & 0 deletions docs/guides/quick-start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Quick Start Guide

Get up and running with Vectorless in 5 minutes.

## Prerequisites

- Rust 1.70+ installed
- An OpenAI API key (or compatible LLM endpoint)

## Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
vectorless = "0.1"
tokio = { version = "1", features = ["full"] }
```

## Basic Usage

```rust
use vectorless::{Engine, IndexContext};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// 1. Create an engine with OpenAI
let engine = Engine::builder()
.with_workspace("./workspace")
.with_openai(std::env::var("OPENAI_API_KEY")?)
.build()
.await?;

// 2. Index a document
let doc_id = engine.index(IndexContext::from_path("./manual.md")).await?;
println!("Indexed: {}", doc_id);

// 3. Query the document
let result = engine.query(&doc_id, "How do I configure authentication?").await?;
println!("Answer: {}", result.content);

Ok(())
}
```

## Index from Different Sources

```rust
// From file path
let id1 = engine.index(IndexContext::from_path("./doc.pdf")).await?;

// From string content
let html = "<html><body><h1>Title</h1><p>Content</p></body></html>";
let id2 = engine.index(
IndexContext::from_content(html, vectorless::parser::DocumentFormat::Html)
.with_name("webpage")
).await?;

// From bytes (e.g., from HTTP response)
let pdf_bytes = std::fs::read("./document.pdf")?;
let id3 = engine.index(
IndexContext::from_bytes(pdf_bytes, vectorless::parser::DocumentFormat::Pdf)
).await?;
```

## Index Modes

```rust
use vectorless::IndexMode;

// Default: Skip if already indexed
engine.index(IndexContext::from_path("./doc.md")).await?;

// Force: Always re-index
engine.index(
IndexContext::from_path("./doc.md").with_mode(IndexMode::Force)
).await?;

// Incremental: Only re-index if changed
engine.index(
IndexContext::from_path("./doc.md").with_mode(IndexMode::Incremental)
).await?;
```

## Next Steps

- [Understanding the Dual Pipeline](./dual-pipeline.md) - Learn how Vectorless works
- [Indexing Documents](./indexing.md) - Deep dive into document indexing
- [Querying Documents](./querying.md) - Advanced query techniques
Loading