Labels: core, router, breaking-change, performance
1. Executive Summary
This proposal integrates the REFRAG framework to lower RAG latency by ~30x.
Unlike standard RAG (which returns text), this upgrade transforms ruvector into a Tensor Store & Neural Router.
We will store pre-computed "Chunk Embeddings" and use a lightweight policy to decide—at runtime—whether to send the Compressed Vector or the Raw Text to the LLM.
Reference: REFRAG: Rethinking RAG based Decoding (arXiv:2509.01092)
2. Technical Architecture
A. The "Compress" Layer (Storage)
Problem: VectorEntry currently supports only json metadata. Serializing/deserializing thousands of float vectors from JSON is too slow for sub-millisecond goals.
Solution: specific binary storage for "Representation Tensors".
- Modify
ruvector-core/src/entry.rs:
pub struct RefragEntry {
pub id: PointId,
pub search_vector: Vec<f32>, // Standard HNSW Index Vector
// NEW: The "Chunk Embedding" (Compressed Representation)
// Must support Zero-Copy access (via rkyv or raw bytes).
// Typical shape: [1, 768] (RoBERTa) or [1, 4096] (LLaMA-Projected)
pub representation_tensor: Option<Vec<u8>>,
// NEW: Projection Matrix ID (Optional)
// Identifies which LLM space this tensor is aligned to (e.g., "llama3-8b")
pub alignment_model_id: Option<String>,
pub payload: Map<String, Value>, // Fallback Raw Text
}
B. The "Sense" Layer (Router & Compute)
Problem: ruvector-router selects nodes but doesn't score content. REFRAG requires a Policy Network (a small classifier) to run on every retrieved chunk.
Solution: Embed a lightweight inference engine in the router.
- New Module:
ruvector-policy
- Input:
representation_tensor + query_vector
- Model: A lightweight classifier (Linear Layer or small MLP).
- Inference Engine: Use
candle-core (HuggingFace Rust) or pure ndarray for simple dot-product policies.
- Output:
Action::Expand (return Text) or Action::Compress (return Tensor).
C. The "Expand" Layer (API & Projection)
Problem: LLMs cannot ingest raw 768-dim encoder vectors if they expect 4096-dim inputs.
Solution: Implement an on-the-fly Projection Layer (Adapter) if the stored tensors are not pre-projected.
-
Task: Add
Projector struct.
- If
representation_tensor dim != target_llm dim, apply linear transformation $W \cdot x + b$ before returning.
-
Optimization: Ideally, users store pre-projected tensors to skip this step.
3. Implementation Checklist
Phase 1: Core Data Structures (Breaking Change)
Phase 2: The "Sense" Policy
Phase 3: Hybrid Output Format
4. Why this works for RuVector
- Rust/WASM: Python implementations of REFRAG suffer from "Python overhead" in the loop. RuVector can run the "Sense" policy in compiled Rust (SIMD-optimized), making the decision step negligible (<50µs).
- Edge Deployment: The
ruvector-wasm build can now serve as a "Smart Context Compressor" running entirely in the user's browser, sending only the necessary tokens/tensors to the server LLM.
Acceptance Criteria:
Labels:
core,router,breaking-change,performance1. Executive Summary
This proposal integrates the REFRAG framework to lower RAG latency by ~30x.
Unlike standard RAG (which returns text), this upgrade transforms
ruvectorinto a Tensor Store & Neural Router.We will store pre-computed "Chunk Embeddings" and use a lightweight policy to decide—at runtime—whether to send the Compressed Vector or the Raw Text to the LLM.
Reference: REFRAG: Rethinking RAG based Decoding (arXiv:2509.01092)
2. Technical Architecture
A. The "Compress" Layer (Storage)
Problem:
VectorEntrycurrently supports onlyjsonmetadata. Serializing/deserializing thousands of float vectors from JSON is too slow for sub-millisecond goals.Solution: specific binary storage for "Representation Tensors".
ruvector-core/src/entry.rs:B. The "Sense" Layer (Router & Compute)
Problem:
ruvector-routerselects nodes but doesn't score content. REFRAG requires a Policy Network (a small classifier) to run on every retrieved chunk.Solution: Embed a lightweight inference engine in the router.
ruvector-policyrepresentation_tensor+query_vectorcandle-core(HuggingFace Rust) or purendarrayfor simple dot-product policies.Action::Expand(return Text) orAction::Compress(return Tensor).C. The "Expand" Layer (API & Projection)
Problem: LLMs cannot ingest raw 768-dim encoder vectors if they expect 4096-dim inputs.
Solution: Implement an on-the-fly Projection Layer (Adapter) if the stored tensors are not pre-projected.
Projectorstruct.representation_tensordim !=target_llmdim, apply linear transformation3. Implementation Checklist
Phase 1: Core Data Structures (Breaking Change)
representation_tensor(binary blob) to theVectorEntrystruct.rkyvsupport forRefragEntryto ensure zero-copy deserialization from disk/memory.POST /insertto accept atensorfield alongsidevectorandmetadata.Phase 2: The "Sense" Policy
PolicyModelthat can load simple weights (e.g.,.safetensorsor.bin).refrag_threshold(float) to the search parameters.Phase 3: Hybrid Output Format
{ "results": [ { "id": "doc_1", "score": 0.95, "type": "EXPAND", "content": "The quick brown fox..." }, { "id": "doc_2", "score": 0.88, "type": "COMPRESS", "tensor_b64": "base64_encoded_float32_array..." } ] }4. Why this works for RuVector
ruvector-wasmbuild can now serve as a "Smart Context Compressor" running entirely in the user's browser, sending only the necessary tokens/tensors to the server LLM.Acceptance Criteria:
representation_tensor.ruvector-routersuccessfully loads a dummy linear policy.Text/Tensorresults based on threshold.