feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490
feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490paralizeer wants to merge 1 commit intotobi:mainfrom
Conversation
When OLLAMA_EMBED_URL is set, all embedding and tokenization operations use the remote Ollama HTTP API instead of node-llama-cpp. This enables QMD on platforms without local GPU/Vulkan support (ARM64 VPS, Docker containers, CI runners) and with remote Ollama instances (Tailscale, LAN, Docker networks). Changes: - Add ollamaEmbed() and ollamaEmbedBatch() helper functions using Ollama /api/embed endpoint - Patch getEmbedding() to bypass node-llama-cpp when OLLAMA_EMBED_URL is set - Patch generateEmbeddings() with dedicated Ollama fast-path that skips withLLMSessionForLlm entirely - Patch expandQuery() to skip LLM-based HYDE query expansion (passes raw query as vector search) - Patch chunkDocumentByTokens() to use char-based estimation instead of local tokenizer - Patch vsearch and query CLI commands to skip withLLMSession wrapper Environment variables: OLLAMA_EMBED_URL - Ollama server URL (e.g. http://your-ollama:11434) OLLAMA_EMBED_MODEL - Model name (default: nomic-embed-text) Tested on ARM64 Oracle Cloud VPS with qwen3-embedding:0.6b on remote Ollama via Tailscale. 7,100+ documents indexed successfully.
|
Thanks @alexleach! Just saw your PR #480 and the list — great compilation. PR #116 from @jonesj38 is solid too. Our PR was born from a specific need: ARM64 VPS where Happy to close this in favor of #480 or #116 if one of those gets merged — they're more comprehensive. The important thing is that remote embeddings land in QMD. 🙏 |
|
Yes, same here. This is the behaviour now in #116, on which my #480 was based. Stopping |
Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Related: I've implemented this more broadly in a fork — covers remote embeddings, reranking, and query expansion via a Details and discussion: #489 (comment) |
|
Closing this in favor of #116 — after reviewing all the related PRs (@alexleach's list in #480 was super helpful), #116 is clearly the most complete solution. It covers everything our PR does (remote embeddings without local We've been running remote Ollama embeddings on ARM64 VPS (Oracle Cloud Ampere) and can confirm the approach works perfectly. @jonesj38's implementation in #116 is the right foundation for this. @tobi — would love to see #116 merged. It unblocks QMD on ARM64, Docker, and CI environments where local compilation isn't viable. The performance numbers speak for themselves. |
|
@jaylfc — your We've started testing your fork on our infra. Would you be open to opening a PR to upstream? Happy to co-author or help test. The |
|
Opened upstream PR: #509 Covers everything discussed here — remote embeddings, reranking, query expansion, batch operations, plus the new index endpoints (/search, /browse, /collections, /status) for remote memory access. Thanks @paralizeer for the push to upstream this! 🙏 |
Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480
Summary
When
OLLAMA_EMBED_URLis set, all embedding and tokenization operations use the remote Ollama HTTP API instead ofnode-llama-cpp. This enables QMD on platforms without local GPU/Vulkan support and with remote Ollama instances.Problem
QMD 2.0 uses
node-llama-cppfor all embedding operations, which requires local CMake compilation. This fails on:Every
vsearch,embed, andquerycommand triggers CMake compilation that either fails or hangs.Solution
Six targeted patches that check
OLLAMA_EMBED_URLand bypassnode-llama-cppwhen set:ollamaEmbed()/ollamaEmbedBatch()/api/embedendpointgetEmbedding()getDefaultLlamaCpp()→ollamaEmbed()generateEmbeddings()withLLMSessionForLlm→ direct Ollama HTTPexpandQuery()chunkDocumentByTokens()llm.tokenize()→ char-based estimationvsearch/queryCLIwithLLMSession()wrapper → direct executionEnvironment Variables
Testing
Tested on ARM64 Oracle Cloud VPS (4 vCPU, 23GB RAM, no GPU) with
qwen3-embedding:0.6brunning on a remote Ollama instance connected via Tailscale:qmd embed --force— successfully indexed 7,100+ documents (zero CMake)qmd vsearch "query"— returns results in <2s (was: hung on CMake indefinitely)qmd search(BM25) — unaffected, works as beforeqmd query— works with Ollama embeddings (no reranking without local model)Design Decisions
OLLAMA_EMBED_URL. Without the env var, behavior is identical to current.expandQuery()returns the raw query as avecsearch instead of using HYDE. A future enhancement could call Ollama/api/generatefor expansion.text.length / 3as token estimate (conservative for mixed prose/code). Avoids requiring a local tokenizer.ollamaEmbedBatch()sends all texts in a single/api/embedcall, matching Ollama's native batch support.Closes #489.