[RFC] Architecture Upgrade: Full REFRAG Pipeline (Compress-Sense-Expand) with Tensor Storage

**Labels:** `core`, `router`, `breaking-change`, `performance`

## 1\. Executive Summary

This proposal integrates the **REFRAG** framework to lower RAG latency by \~30x.
Unlike standard RAG (which returns text), this upgrade transforms `ruvector` into a **Tensor Store & Neural Router**.
We will store pre-computed "Chunk Embeddings" and use a lightweight policy to decide—at runtime—whether to send the **Compressed Vector** or the **Raw Text** to the LLM.

**Reference:** *REFRAG: Rethinking RAG based Decoding (arXiv:2509.01092)*

-----

## 2\. Technical Architecture

### A. The "Compress" Layer (Storage)

**Problem:** `VectorEntry` currently supports only `json` metadata. Serializing/deserializing thousands of float vectors from JSON is too slow for sub-millisecond goals.
**Solution:** specific binary storage for "Representation Tensors".

  * **Modify `ruvector-core/src/entry.rs`:**
    ```rust
    pub struct RefragEntry {
        pub id: PointId,
        pub search_vector: Vec<f32>,  // Standard HNSW Index Vector
        
        // NEW: The "Chunk Embedding" (Compressed Representation)
        // Must support Zero-Copy access (via rkyv or raw bytes).
        // Typical shape: [1, 768] (RoBERTa) or [1, 4096] (LLaMA-Projected)
        pub representation_tensor: Option<Vec<u8>>, 
        
        // NEW: Projection Matrix ID (Optional)
        // Identifies which LLM space this tensor is aligned to (e.g., "llama3-8b")
        pub alignment_model_id: Option<String>,

        pub payload: Map<String, Value>, // Fallback Raw Text
    }
    ```

### B. The "Sense" Layer (Router & Compute)

**Problem:** `ruvector-router` selects nodes but doesn't score content. REFRAG requires a **Policy Network** (a small classifier) to run on *every retrieved chunk*.
**Solution:** Embed a lightweight inference engine in the router.

  * **New Module: `ruvector-policy`**
      * **Input:** `representation_tensor` + `query_vector`
      * **Model:** A lightweight classifier (Linear Layer or small MLP).
      * **Inference Engine:** Use `candle-core` (HuggingFace Rust) or pure `ndarray` for simple dot-product policies.
      * **Output:** `Action::Expand` (return Text) or `Action::Compress` (return Tensor).

### C. The "Expand" Layer (API & Projection)

**Problem:** LLMs cannot ingest raw 768-dim encoder vectors if they expect 4096-dim inputs.
**Solution:** Implement an on-the-fly **Projection Layer** (Adapter) if the stored tensors are not pre-projected.

  * **Task:** Add `Projector` struct.
      * If `representation_tensor` dim \!= `target_llm` dim, apply linear transformation $W \cdot x + b$ before returning.
      * *Optimization:* Ideally, users store *pre-projected* tensors to skip this step.

-----

## 3\. Implementation Checklist

### Phase 1: Core Data Structures (Breaking Change)

  - [ ] **Schema Update:** Add `representation_tensor` (binary blob) to the `VectorEntry` struct.
  - [ ] **Serialization:** Implement `rkyv` support for `RefragEntry` to ensure zero-copy deserialization from disk/memory.
  - [ ] **Ingestion API:** Update `POST /insert` to accept a `tensor` field alongside `vector` and `metadata`.

### Phase 2: The "Sense" Policy

  - [ ] **Policy Loader:** Create a trait `PolicyModel` that can load simple weights (e.g., `.safetensors` or `.bin`).
  - [ ] **Scoring Logic:** Implement the function:
    ```rust
    fn decide_action(chunk: &Tensor, query: &Tensor, policy: &PolicyModel) -> RefragAction;
    ```
  - [ ] **Configuration:** Add `refrag_threshold` (float) to the search parameters.

### Phase 3: Hybrid Output Format

  - [ ] **API Response:** The search endpoint must return a **Union Type** (Pseudo-code):
    ```json
    {
      "results": [
        {
          "id": "doc_1",
          "score": 0.95,
          "type": "EXPAND",
          "content": "The quick brown fox..."
        },
        {
          "id": "doc_2",
          "score": 0.88,
          "type": "COMPRESS",
          "tensor_b64": "base64_encoded_float32_array..." 
        }
      ]
    }
    ```

-----

## 4\. Why this works for RuVector

  * **Rust/WASM:** Python implementations of REFRAG suffer from "Python overhead" in the loop. RuVector can run the "Sense" policy in compiled Rust (SIMD-optimized), making the decision step negligible (\<50µs).
  * **Edge Deployment:** The `ruvector-wasm` build can now serve as a "Smart Context Compressor" running entirely in the user's browser, sending only the necessary tokens/tensors to the server LLM.

**Acceptance Criteria:**

  * [ ] Can store a 768-dim binary vector in `representation_tensor`.
  * [ ] `ruvector-router` successfully loads a dummy linear policy.
  * [ ] Search returns mixed `Text/Tensor` results based on threshold.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Architecture Upgrade: Full REFRAG Pipeline (Compress-Sense-Expand) with Tensor Storage #10

1. Executive Summary

2. Technical Architecture

A. The "Compress" Layer (Storage)

B. The "Sense" Layer (Router & Compute)

C. The "Expand" Layer (API & Projection)

3. Implementation Checklist

Phase 1: Core Data Structures (Breaking Change)

Phase 2: The "Sense" Policy

Phase 3: Hybrid Output Format

4. Why this works for RuVector

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[RFC] Architecture Upgrade: Full REFRAG Pipeline (Compress-Sense-Expand) with Tensor Storage #10

Description

1. Executive Summary

2. Technical Architecture

A. The "Compress" Layer (Storage)

B. The "Sense" Layer (Router & Compute)

C. The "Expand" Layer (API & Projection)

3. Implementation Checklist

Phase 1: Core Data Structures (Breaking Change)

Phase 2: The "Sense" Policy

Phase 3: Hybrid Output Format

4. Why this works for RuVector

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions