Skip to content

Feat understanding#108

Merged
zTgx merged 28 commits intodevfrom
feat-understanding
Apr 23, 2026
Merged

Feat understanding#108
zTgx merged 28 commits intodevfrom
feat-understanding

Conversation

@zTgx
Copy link
Copy Markdown
Member

@zTgx zTgx commented Apr 23, 2026

Summary

Changes

Checklist

  • Code compiles (cargo build)
  • Tests pass (cargo test --lib --all-features)
  • No new clippy warnings (cargo clippy --all-features)
  • Public APIs have documentation comments
  • Python bindings updated (if Rust API changed)

Notes

zTgx added 28 commits April 22, 2026 22:00
- Add PyAnswer wrapper with content, evidence, confidence, and trace
  getters
- Rename DocumentInfo to reflect "understood" instead of "indexed"
- Change id field to doc_id for clarity
- Replace summary field with concepts extraction
- Update section_count and rename list method to list_documents
- Add Concept class for key concept extraction
- Refactor Engine methods: index->ingest, query->ask, remove->forget
- Remove deprecated streaming and context modules
- Update documentation examples to use new API
- Rename "reasoning-native document intelligence engine" to
  "Document Understanding Engine for AI"
- Update project structure to reflect cargo workspace with
  vectorless-core/vectorless and vectorless-py crates
- Change Engine.query() to Engine.ask() in retrieval flow
- Update build commands to use workspace root
- Adjust development workflow paths to use crates/vectorless
- Update Python binding paths to crates/vectorless-py/src/lib.rs
- Add Python SDK development notes
Update the development workflow documentation to reflect new directory
structure:
- Change feature implementation path from crates/vectorless/src/ to
  vectorless-core/vectorless/src/
- Update Python bindings path from crates/vectorless-py/src/lib.rs to
  vectorless-core/vectorless-py/src/lib.rs
- Update Python SDK path from python/vectorless/ to vectorless/
- Update description from "Reasoning-based Document Engine" to
  "Document Understanding Engine for AI"
- Bump minimum Python requirement from 3.9 to 3.10
- Update author name from "vectorless developers" to "Vectorless"
- Remove Python 3.9 classifier and add clarifying comment for tomli dependency
- Update keywords to better reflect document understanding focus
- Update mypy and ruff target versions to Python 3.10
- Add uv tool configuration with dev dependencies
- Remove exclude directive from Cargo.toml that was excluding docs/,
  examples/, and .* patterns
- Delete all example files including deep_retrieval.rs, events.rs,
  flow.rs, graph.rs, index_directory.rs, index_incremental.rs,
  and index_pdf.rs
- Change Rust examples description from "Rust examples (flow, indexing, pdf, batch, etc.)"
  to "Rust examples (legacy, no new additions)"
- Add Python examples entry with description "Python examples (primary, for Python ecosystem)"
- Remove samples/ directory from documentation
- Replace deprecated IndexContext and QueryContext imports with IngestInput
- Update method names: index/list/remove to ingest/list_documents/forget
- Change query API usage from QueryContext to direct ask method call
- Update terminology from 'indexing' to 'ingesting' and 'understanding'
- Rename challenge queries to challenge questions
- Add confidence, evidence count, and trace steps to output display
- Update variable names from doc.id to doc.doc_id for consistency
- Create HISTORY.md to track project changes and version history
- Add complete history tracking from initial release (0.1.0) to current version (0.1.11)
- Document core principles: "Reason don't vector", "Model fails we fail", "No thought no answer"
- Include detailed changelog covering agent-based retrieval architecture,
  navigation commands, orchestrator supervisor loop, and query understanding pipeline
- Track evolution from basic indexing to reasoning-based document engine
- Document PDF parsing improvements, streaming retrieval, and multi-document support
- Downgrade workspace package version from 0.1.32 to 0.1.12
- Update description from "Reasoning-based Document Engine"
  to "Document Understanding Engine for AI"
- Change pyproject.toml to use dynamic version management instead
  of hardcoded version 0.1.11
- Introduce ConceptExtractionStage that extracts key concepts from
  document topics and summaries using LLM calls
- Add fallback mechanism for keyword-based concept extraction when
  LLM is unavailable
- Implement maximum limits for topics (20) and concepts (15) to
  control processing scope
- Add proper error handling with fallback to basic extraction on LLM
  failures

feat(document): add utility methods for document navigation

- Add `cat()` method to get node content by ID for agent commands
- Add `find()` method to search nodes by keyword in title/content
- Add `node_title()` method to retrieve node titles by ID
- Add `section_count()` method to get total number of sections

refactor(index): integrate concept extraction into pipeline

- Register ConceptExtractionStage in pipeline executor at priority 47
- Update pipeline documentation to reflect new stage ordering
- Modify IndexContext to include concepts field for stage output
- Update PipelineResult to include concepts for final output

refactor(storage): persist concepts in indexed documents

- Add concepts field to PersistedDocument struct with serde
  serialization
- Include concepts in IndexedDocument for runtime access
- Ensure concepts are properly saved and loaded during persistence

refactor(indexer): pass concepts through indexing workflow

- Update indexer to transfer concepts from pipeline results to
  indexed documents
- Ensure concepts are properly persisted along with other document
  metadata
Add trace_steps field to Output and WorkerOutput structs to capture
reasoning trace steps during agent navigation. Initialize trace_steps
in constructors and extend WorkerState with trace collection
capabilities.

Add navigation index building and verification stage to pipeline that
validates ingest output reliability by checking tree structure,
document summary, and concept extraction results before persistence.

Refactor document loading to use unified Document structure and
implement trace collection in agent state management.
- Split the main crate into multiple specialized crates including
 vectorless-error, vectorless-document, vectorless-config,
  vectorless-utils, vectorless-scoring, vectorless-graph,
  vectorless-events, vectorless-metrics, vectorless-llm,
 vectorless-storage, vectorless-query, vectorless-index,
  vectorless-agent, vectorless-retrieval, vectorless-rerank,
  and vectorless-engine
- Add comprehensive command parsing system for agent navigation
  with support for ls, cd, cat, find, grep, head, findtree, wc,
  pwd, check, and done commands
- Implement quote-stripping and multi-level target resolution
  with exact, case-insensitive, substring, and numeric matching
- Add extended target resolution with deep search capability
  up to depth 4 using BFS algorithm
- Create agent configuration system with worker and answer
  pipeline settings including navigation budgets and evidence
  caps
- Implement structured output types for agent results including
  evidence collection, metrics tracking, and confidence scoring
- Add read-only context wrappers for accessing document
  navigation indices, content trees, and reasoning indexes
- Include comprehensive test suite for command parsing and
  target resolution functionality
- Add Python script to fix crate:: import references across
  split modules
- Add vectorless-rerank dependency to vectorless-agent
- Introduce Evidence type in vectorless-rerank and re-export it from
  vectorless-agent instead of defining locally
- Move query-related types (EvidenceItem, QueryMetrics, QueryResultItem,
  Confidence) from vectorless-engine to vectorless-retrieval
- Update imports across multiple modules to use correct paths after
  refactoring
- Add necessary dependencies (regex, serde_json) and remove
  vectorless-agent dependency from vectorless-rerank
- Update module visibility for config, memo, and throttle in
  vectorless-llm

This change centralizes query result types in vectorless-retrieval
module and introduces proper re-ranking capabilities through the new
vectorless-rerank module.

BREAKING CHANGE: Evidence type is now re-exported from
vectorless-rerank::types instead of being defined in vectorless-agent.
- Move tempfile to dev-dependencies in Cargo.toml
- Update import path from crate::llm::throttle to crate::throttle
  in client.rs and executor.rs test modules
- Fixes incorrect module path references in test code
Change import from crate::document::DocumentTree to
crate::tree::DocumentTree across multiple test modules to align
with updated module structure.

BREAKING CHANGE: This change updates the internal module
structure and import paths for DocumentTree.
Change the type annotation from crate::DocumentTree to
vectorless_document::DocumentTree for consistency with module structure.

feat(retriever): import additional types and update module paths

Import DocContext, Scope, and WorkspaceContext from vectorless_agent
config module and update QueryResult import path from crate::client
to super::types.

refactor(retriever): remove redundant module prefix in type usage

Replace agent::DocContext with DocContext and update agent::Scope
and agent::WorkspaceContext to their respective unqualified imports.

chore(retrieval): add indextree as dev dependency

Add indextree to dev-dependencies section of Cargo.toml for
workspace configuration.
- Move some import statements to improve code readability and
  maintain consistent ordering
- Reorder some field declarations and function calls to follow
  standard Rust formatting conventions
- Remove unused pub(crate) mod test_support from vectorless-engine
- Remove unused test_support.rs file as it's no longer needed
- Adjust some long lines to fit within 100 character limit
- Move DocumentGraphConfig export to proper location in types module
- Reorder some struct field initializations for better readability
- Replace direct doc.as_context() call with explicit DocContext
  construction using individual fields (tree, nav_index,
  reasoning_index, doc_name)
- Update concurrency configuration to use proper type conversion
  from throttle config

refactor(graph): consolidate configuration in vectorless-config

- Remove local DocumentGraphConfig implementation
- Add vectorless-config dependency to vectorless-graph
- Re-export DocumentGraphConfig from vectorless_config as single
  source of truth

refactor(python): update module imports to use vectorless_engine

- Replace ::vectorless imports with ::vectorless_engine in python
  bindings for Answer, Config, DocumentInfo, Engine, Error, Graph,
  and Metrics types
- This ensures consistent usage of the engine module across Python
  API
- Add re-exports of Config from vectorless_config
- Add re-exports of core document types (Answer, Concept, DocumentInfo, etc.)
- Add re-exports of error handling types (Error, Result)
- Add re-exports of event types (EventEmitter, IndexEvent, QueryEvent, etc.)
- Add re-exports of graph types (DocumentGraph, DocumentGraphNode, etc.)
- Add re-exports of metrics types (LlmMetricsReport, MetricsReport, etc.)
- Add re-export of DocumentTree from vectorless_document
…raph crate

- Remove tracing and tokio dependencies from vectorless-config
- Add vectorless-graph as dependency instead
- Remove graph module from types and update import to use vectorless-graph
- Move DocumentGraphConfig re-export to use vectorless_graph crate

refactor(vectorless-graph): move DocumentGraphConfig implementation to graph crate

- Remove vectorless-config dependency from vectorless-graph
- Implement DocumentGraphConfig directly in vectorless-graph crate
- Include all configuration fields and methods for document graph settings
- Maintain same API interface while moving implementation to correct location
BREAKING CHANGE: Remove the entire vectorless core module including:
- Cargo.toml configuration and dependencies
- Single document challenge example that tested deep reasoning
- Agent command parsing system with navigation commands (ls, cd, cat, find, grep, etc.)
- Target resolution logic for document tree navigation
- All associated tests and implementations

This removes the core vectorless functionality that enabled AI-powered
document navigation and reasoning capabilities.
…ned crates

Update CLAUDE.md to reflect the new architecture with 17 fine-grained Rust
crates instead of the previous monolithic structure. Add detailed tree view
of the new crate organization and dependency layers showing compilation
isolation benefits.

Remove the fix_imports.py script that was used for the crate splitting
process as it's no longer needed.

Update development workflow instructions to reflect the new multi-crate
structure and add information about cargo test counts and specific crate
building commands.
Move DocumentGraphConfig import to maintain consistent ordering
and improve code organization.

fix(engine): format ConcurrencyConfig initialization

Properly format the ConcurrencyConfig initialization across
multiple lines to improve readability.

refactor(lib): consolidate DocumentTree export

Move DocumentTree export to correct location in engine lib to
avoid duplicate exports and maintain proper module structure.

refactor(python): format graph module imports

Reformat imports in python graph module to follow consistent
multi-line style for better readability.
Remove complete examples directory containing various demonstration
files including batch indexing, document management, error handling,
index metrics, PDF indexing, and session walkthrough examples.

The entire examples folder with all subdirectories and files has been
removed, including:
- README.md files explaining each example
- main.py implementation files
- Directory indexing and management examples
- Error handling demonstrations
- Index metrics and PDF indexing examples
- Session API walkthrough materials
Add a comprehensive example demonstrating advanced document indexing
and querying capabilities. The example includes a realistic technical
report about quantum computing research with complex inter-lab
dependencies, financial data, and technical specifications.

The challenge demonstrates the engine's ability to handle queries
requiring deep navigation through the document tree, cross-referencing
details across distant sections, and extracting information from
nested structures rather than surface-level keyword matching.

Includes five challenge questions that test:
- Cross-referencing device characterization needs with equipment specs
- Tracing dependency chains between research milestones
- Calculating impacts from distributed data points
- Complex multi-step reasoning across document sections
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vectorless Ready Ready Preview, Comment Apr 23, 2026 3:25am

@zTgx zTgx merged commit 4cc38f4 into dev Apr 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant