primd

The predictive turn-cache for real-time conversational AI.

primd is a 10 MB Apache-2.0 Rust runtime that hides retrieval latency inside the STT and TTS phases of your voice agent. It starts retrieving while the user is still speaking, predicts the next turn during TTS playback, and serves repeats from a sub-microsecond cache.

   STT partials  ─►  observe_partial   (speculative scan, scoped by predicted events)
                          │
   end of speech ─►  finalize          (1.6 µs cache hit if speculation matched)
                          │
   TTS playback  ─►  warm_next         (predictor primes next turn's scope)

The problem

Voice AI has a dead-air problem. Every other pipeline component has broken its latency wall:

Component	Best-in-class (2026)
STT (Deepgram Nova-3)	<200 ms
TTS (Cartesia Sonic-3)	40 ms
LLM TTFT (Groq)	<100 ms
Retrieval	50–300 ms

A typical vector-DB query eats the entire ~200 ms voice budget before the LLM even starts generating. That's the pause that breaks the illusion of natural conversation.

The wedge

primd is not "faster semantic search." It's a runtime that time-shifts retrieval out of the critical path by exploiting the timing structure of a voice turn:

STT emits partial transcripts → start scanning before the user is done speaking
TTS playback creates a 1–3 s window → pre-warm the next turn's scope
Conversations follow topic chains → topic-continuation queries are zero-scan repeats

The underlying retrieval primitive is 256-bit binary signatures + SIMD Hamming distance (AVX-512 VPOPCNTDQ, AVX2 VPSHUFB nibble-lookup, scalar fallback). 100k docs scanned in ~100 µs. For high-recall workloads primd hands off to your existing vector DB; its job is to make that handoff happen as rarely as possible.

Benchmarks

Reproducible: cargo bench --bench voice_session. Workload models a Pipecat session: 200 utterances over 20 canonical intents, 4 partial transcripts per turn, 100k-doc corpus across 50 events.

Phase	What it does	primd p50	primd p95	Naive p50
`observe_partial`	speculative scan during STT	108 µs	199 µs	—
`finalize`	end-of-speech retrieval	1.6 µs	2.8 µs	157.8 µs
`warm_next`	predictor + scope union during TTS	222 µs	289 µs	—

98× faster than a naive SIMD scan at the user-visible finalize. 100% speculative-cache hit rate on this workload — every end-of-speech query was already answered before the user finished talking. The 1.6 µs is a cache lookup, not magic; it's the cost of work already done during observe_partial.

For reference: Qdrant's best-in-class managed vector DB reports 4 ms p50 at 1 M vectors. primd's finalize p50 is ~2,500× faster — because most of the retrieval has already happened before finalize is called.

Full bench report (including an in-memory HNSW baseline via instant-distance, hardware notes, and honest framing) is at docs/benchmarks/bench-report.md.

Where primd sits in the 2026 voice-AI stack

┌─────────────────────────────────────┐
│  THE MODEL                          │
│  Pipecat / LiveKit pipelines        │  ← Pipecat, LiveKit, Vapi, Retell
│  TML-Interaction-Small, Moshi,      │     (listens, speaks, decides when to retrieve)
│  GPT-Realtime, Gemini Live          │
└──────────────┬──────────────────────┘
               │ delegates retrieval
               ↓
┌─────────────────────────────────────┐
│  THE INTEGRATION PROTOCOL           │
│  MoshiRAG <ret> token, Pipecat      │  ← Kyutai, Pipecat, LiveKit
│  FrameProcessor, LiveKit agent      │     (knows when knowledge is needed)
│  plugins, TM's "background agent"   │
└──────────────┬──────────────────────┘
               │ sends query, awaits context
               ↓
┌─────────────────────────────────────┐
│  THE RETRIEVAL BACK-END             │
│  primd                              │  ← us
│                                     │     (actually returns docs, fast)
└─────────────────────────────────────┘

Pipecat's FrameProcessor API and Kyutai's MoshiRAG retrieval contract both leave the retrieval back-end open. primd fills exactly that slot, Apache-2.0, Rust-native, and shipping.

Differentiation

	primd	Moss / InferEdge	Qdrant / Pinecone	Mem0 / Letta
Layer	retrieval runtime	semantic search runtime	vector database	chat memory
Latency target	< 200 µs at finalize	< 10 ms	4–50 ms	100–500 ms
STT-partial speculation	yes	no	no	no
TTS-phase pre-warming	yes	no	no	no
Repeat-query delta cache	yes	no	no	no
License	Apache-2.0	PolyForm Shield *	mixed	mixed
Drops into Pipecat / LiveKit	yes	yes (design-partner)	via wrappers	via wrappers

* PolyForm Shield forbids "production or competing commercial use" of the free tier.

primd is not competing with vector DBs. It reads from them. It's not competing with chat memory either — Mem0 / Letta handle who-this-user-is across sessions; primd handles what-they-need-right-now in the current turn.

The only direct overlap is Moss / InferEdge (YC F25, sub-10 ms semantic search, Pipecat/LiveKit design partners). Moss has lower raw scan latency on closed-source semantic search; primd has the entire predictive layer Moss doesn't — speculation during STT, pre-warming during TTS, delta cache for topic continuation — plus an Apache-2.0 license that lets you ship without a lawyer.

Quick start

cargo build --release -p primd-cli

./target/release/primd index \
  --input examples/faq.jsonl \
  --out /tmp/primd-faq \
  --embedder hashed

./target/release/primd serve \
  --index /tmp/primd-faq \
  --bind 127.0.0.1:8080

Stateless query (the slow path — doesn't use any of the predictive layers):

curl -s -X POST http://127.0.0.1:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"text":"is there a free trial","top_k":3}'

OpenAI-compatible call (the drop-in path for MoshiRAG and any OpenAI-shaped client; see docs/integrations/moshirag.md):

curl -s -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"primd",
    "messages":[{"role":"user","content":"is there a free trial"}],
    "user":"session-id",
    "top_k":3
  }'

Session flow (the path that actually beats vector DBs):

# during STT — feed partial transcripts as they arrive
curl -X POST http://127.0.0.1:8080/session/demo/observe \
  -d '{"text":"what about pri","top_k":3}'

# end of speech — near-instant if speculation matched
curl -X POST http://127.0.0.1:8080/session/demo/finalize \
  -d '{"text":"what about pricing","top_k":3}'

# during TTS — pre-warm the next likely turn
curl -X POST http://127.0.0.1:8080/session/demo/warm -d '{}'

Architecture

Four layers compose in QueryContext (primd-core/src/query_context.rs):

Streaming partials — speculative scan during STT (observe_partial)
Binary signature index — 256-bit signatures, SIMD Hamming scan, event-scoped gather + rescan
Markov predictor + prefetch — pre-warms next likely scope during TTS (warm_next)
Predictive-coding delta cache — sub-µs short-circuit for topic-continuation queries

Per-layer docs:

Status (v0.1.0)

Shipping:

SIMD binary signature search (AVX-512 VPOPCNTDQ → AVX2 VPSHUFB → scalar) over event-scoped corpora
QueryContext session runtime: observe / finalize / warm / reset
Session-aware HTTP API
Variable-order Markov predictor with half-life time decay, persistence, smoothing
Predictive-coding delta cache
pipecat-primd Python package (FrameProcessor + async client)
Voice-realistic benchmark harness

Roadmap (v0.2 — in progress):

✅ NextTurnPredictor trait (foundation for swappable predictors)
✅ MoshiRAG back-end adapter — OpenAI-compatible /v1/chat/completions endpoint so MoshiRAG can swap its 3 s vLLM call for primd's sub-200 µs response with one env var change
✅ Successor Representation predictor in new primd-sr crate (tabular variant + Hybrid SR+Markov wrapper). Enable with primd serve --predictor hybrid.
Real per-event HNSW shards (currently the event-scoped path is a SIMD gather + subset rescan, not HNSW; see roadmap)
Public benchmark vs Moss + Qdrant + Pinecone at the finalize event
A/B harness measuring SR vs Markov speculative-cache hit-rate lift

Roadmap (v0.3 — shipped):

✅ livekit-primd packaged plugin
✅ Per-event HNSW shards (instant-distance + lazy build)
✅ Per-user predictor persistence (--sr-state-dir)
✅ WASM / browser target (primd-wasm)
✅ Hippocampus DWM foundation (primd-dwm) — first-mover Rust port of arXiv:2602.13594's BitVector rank/select + Random Indexing primitives. Full Signature DWM cold tier in v0.4.

Roadmap (v0.4):

Signature DWM cold tier with QueryContext integration for multi-day session memory
Trust primitives — confidence scores, dataset freshness, refusal-on-uncertainty
Public LoCoMo / LongMemEval long-horizon recall benches

What primd isn't

Not a vector database. Reads from yours (Qdrant, pgvector, parquet files).
Not chat memory. Use Mem0 or Letta for cross-session user memory. primd retrieves knowledge for the current question.
Not an agent framework. Lower in the stack than LangGraph or Pipecat itself.
Not a high-recall retrieval system at extreme scale. Binary signatures trade some recall for speed; primd targets 10 k–1 M doc corpora where sub-millisecond response matters more than 99th-percentile semantic recall. For larger or recall-sensitive workloads, primd hands off to your vector DB.

Why now

Three signals from the last six months, none of which existed when primd was first prototyped:

Kyutai shipped moshi-rag (April 2026) — open-sourced the full-duplex retrieval-injection protocol and left the back-end slot generic. Their own docs flag retrieval as the latency bottleneck.
Thinking Machines announced interaction models (May 11, 2026) — frontier validation of the background-agent architecture, with explicit acknowledgement that long-session context management is unsolved.
Production voice (Pipecat, LiveKit, Vapi) is mainstream — not research anymore. Pipecat alone has thousands of production deployments. All of them pause on retrieval.

The category is now legible. The back-end slot is empty. primd is the back-end.

Documentation

Architecture overview
Strategy 2026-05 — why the positioning is what it is
Roadmap
Competitive landscape
Positioning & GTM
Gap analysis
Benchmark report — primd-Markov vs primd-Hybrid vs naive vs in-memory HNSW, with honest framing
MoshiRAG back-end adapter — OpenAI-compatible /v1/chat/completions drop-in for MoshiRAG and any other OpenAI-shaped client
LiveKit Agents integration — livekit-primd package + wiring pattern
WebAssembly target — primd-wasm for in-browser voice agents

Citing

If you use primd in research, cite both the underlying ideas (predictive map / hippocampal retrieval, dual-agent voice RAG) and this implementation:

@misc{salesforce2026voiceagentrag,
  title  = {VoiceAgentRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures},
  author = {Salesforce AI Research},
  year   = {2026},
  eprint = {2603.02206},
  archivePrefix = {arXiv}
}

@misc{kyutai2026moshirag,
  title  = {MoshiRAG: Real-Time RAG for Full-Duplex Speech Dialogue},
  author = {Kyutai},
  year   = {2026},
  eprint = {2604.12928},
  archivePrefix = {arXiv}
}

License

Apache-2.0

built by rohan. mumbai. github · x

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
livekit-primd		livekit-primd
pipecat-primd		pipecat-primd
primd-bench		primd-bench
primd-cli		primd-cli
primd-core		primd-core
primd-dwm		primd-dwm
primd-sr		primd-sr
primd-wasm		primd-wasm
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

primd

The problem

The wedge

Benchmarks

Where primd sits in the 2026 voice-AI stack

Differentiation

Quick start

Architecture

Status (v0.1.0)

What primd isn't

Why now

Documentation

Citing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

primd

The problem

The wedge

Benchmarks

Where primd sits in the 2026 voice-AI stack

Differentiation

Quick start

Architecture

Status (v0.1.0)

What primd isn't

Why now

Documentation

Citing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages