Skip to content

[RFC] Architecture Upgrade: Full REFRAG Pipeline (Compress-Sense-Expand) with Tensor Storage #10

@JLMA-Agentic-Ai

Description

@JLMA-Agentic-Ai

Labels: core, router, breaking-change, performance

1. Executive Summary

This proposal integrates the REFRAG framework to lower RAG latency by ~30x.
Unlike standard RAG (which returns text), this upgrade transforms ruvector into a Tensor Store & Neural Router.
We will store pre-computed "Chunk Embeddings" and use a lightweight policy to decide—at runtime—whether to send the Compressed Vector or the Raw Text to the LLM.

Reference: REFRAG: Rethinking RAG based Decoding (arXiv:2509.01092)


2. Technical Architecture

A. The "Compress" Layer (Storage)

Problem: VectorEntry currently supports only json metadata. Serializing/deserializing thousands of float vectors from JSON is too slow for sub-millisecond goals.
Solution: specific binary storage for "Representation Tensors".

  • Modify ruvector-core/src/entry.rs:
    pub struct RefragEntry {
        pub id: PointId,
        pub search_vector: Vec<f32>,  // Standard HNSW Index Vector
        
        // NEW: The "Chunk Embedding" (Compressed Representation)
        // Must support Zero-Copy access (via rkyv or raw bytes).
        // Typical shape: [1, 768] (RoBERTa) or [1, 4096] (LLaMA-Projected)
        pub representation_tensor: Option<Vec<u8>>, 
        
        // NEW: Projection Matrix ID (Optional)
        // Identifies which LLM space this tensor is aligned to (e.g., "llama3-8b")
        pub alignment_model_id: Option<String>,
    
        pub payload: Map<String, Value>, // Fallback Raw Text
    }

B. The "Sense" Layer (Router & Compute)

Problem: ruvector-router selects nodes but doesn't score content. REFRAG requires a Policy Network (a small classifier) to run on every retrieved chunk.
Solution: Embed a lightweight inference engine in the router.

  • New Module: ruvector-policy
    • Input: representation_tensor + query_vector
    • Model: A lightweight classifier (Linear Layer or small MLP).
    • Inference Engine: Use candle-core (HuggingFace Rust) or pure ndarray for simple dot-product policies.
    • Output: Action::Expand (return Text) or Action::Compress (return Tensor).

C. The "Expand" Layer (API & Projection)

Problem: LLMs cannot ingest raw 768-dim encoder vectors if they expect 4096-dim inputs.
Solution: Implement an on-the-fly Projection Layer (Adapter) if the stored tensors are not pre-projected.

  • Task: Add Projector struct.
    • If representation_tensor dim != target_llm dim, apply linear transformation $W \cdot x + b$ before returning.
    • Optimization: Ideally, users store pre-projected tensors to skip this step.

3. Implementation Checklist

Phase 1: Core Data Structures (Breaking Change)

  • Schema Update: Add representation_tensor (binary blob) to the VectorEntry struct.
  • Serialization: Implement rkyv support for RefragEntry to ensure zero-copy deserialization from disk/memory.
  • Ingestion API: Update POST /insert to accept a tensor field alongside vector and metadata.

Phase 2: The "Sense" Policy

  • Policy Loader: Create a trait PolicyModel that can load simple weights (e.g., .safetensors or .bin).
  • Scoring Logic: Implement the function:
    fn decide_action(chunk: &Tensor, query: &Tensor, policy: &PolicyModel) -> RefragAction;
  • Configuration: Add refrag_threshold (float) to the search parameters.

Phase 3: Hybrid Output Format

  • API Response: The search endpoint must return a Union Type (Pseudo-code):
    {
      "results": [
        {
          "id": "doc_1",
          "score": 0.95,
          "type": "EXPAND",
          "content": "The quick brown fox..."
        },
        {
          "id": "doc_2",
          "score": 0.88,
          "type": "COMPRESS",
          "tensor_b64": "base64_encoded_float32_array..." 
        }
      ]
    }

4. Why this works for RuVector

  • Rust/WASM: Python implementations of REFRAG suffer from "Python overhead" in the loop. RuVector can run the "Sense" policy in compiled Rust (SIMD-optimized), making the decision step negligible (<50µs).
  • Edge Deployment: The ruvector-wasm build can now serve as a "Smart Context Compressor" running entirely in the user's browser, sending only the necessary tokens/tensors to the server LLM.

Acceptance Criteria:

  • Can store a 768-dim binary vector in representation_tensor.
  • ruvector-router successfully loads a dummy linear policy.
  • Search returns mixed Text/Tensor results based on threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions