feat: support remote Ollama embeddings via OLLAMA_EMBED_URL by paralizeer · Pull Request #490 · tobi/qmd

paralizeer · 2026-04-01T18:59:10Z

Summary

When OLLAMA_EMBED_URL is set, all embedding and tokenization operations use the remote Ollama HTTP API instead of node-llama-cpp. This enables QMD on platforms without local GPU/Vulkan support and with remote Ollama instances.

Problem

QMD 2.0 uses node-llama-cpp for all embedding operations, which requires local CMake compilation. This fails on:

ARM64 VPS (Oracle Cloud, Ampere, Graviton) — no Vulkan SDK
Docker containers — no GPU drivers
CI runners — headless, no GPU
Remote Ollama setups — Ollama on a GPU box, QMD on a different machine (common with Tailscale/Docker networks)

Every vsearch, embed, and query command triggers CMake compilation that either fails or hangs.

Solution

Six targeted patches that check OLLAMA_EMBED_URL and bypass node-llama-cpp when set:

Function	What it bypasses
`ollamaEmbed()` / `ollamaEmbedBatch()`	New helpers using Ollama `/api/embed` endpoint
`getEmbedding()`	`getDefaultLlamaCpp()` → `ollamaEmbed()`
`generateEmbeddings()`	`withLLMSessionForLlm` → direct Ollama HTTP
`expandQuery()`	LLM-based HYDE expansion → raw vector passthrough
`chunkDocumentByTokens()`	`llm.tokenize()` → char-based estimation
`vsearch` / `query` CLI	`withLLMSession()` wrapper → direct execution

Environment Variables

export OLLAMA_EMBED_URL=http://your-ollama:11434       # Required
export OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b         # Optional (default: nomic-embed-text)

Testing

Tested on ARM64 Oracle Cloud VPS (4 vCPU, 23GB RAM, no GPU) with qwen3-embedding:0.6b running on a remote Ollama instance connected via Tailscale:

qmd embed --force — successfully indexed 7,100+ documents (zero CMake)
qmd vsearch "query" — returns results in <2s (was: hung on CMake indefinitely)
qmd search (BM25) — unaffected, works as before
qmd query — works with Ollama embeddings (no reranking without local model)

Design Decisions

No breaking changes — all patches are gated behind OLLAMA_EMBED_URL. Without the env var, behavior is identical to current.
Query expansion skipped — expandQuery() returns the raw query as a vec search instead of using HYDE. A future enhancement could call Ollama /api/generate for expansion.
Char-based chunking — uses text.length / 3 as token estimate (conservative for mixed prose/code). Avoids requiring a local tokenizer.
Batch embedding — ollamaEmbedBatch() sends all texts in a single /api/embed call, matching Ollama's native batch support.

Closes #489.

When OLLAMA_EMBED_URL is set, all embedding and tokenization operations use the remote Ollama HTTP API instead of node-llama-cpp. This enables QMD on platforms without local GPU/Vulkan support (ARM64 VPS, Docker containers, CI runners) and with remote Ollama instances (Tailscale, LAN, Docker networks). Changes: - Add ollamaEmbed() and ollamaEmbedBatch() helper functions using Ollama /api/embed endpoint - Patch getEmbedding() to bypass node-llama-cpp when OLLAMA_EMBED_URL is set - Patch generateEmbeddings() with dedicated Ollama fast-path that skips withLLMSessionForLlm entirely - Patch expandQuery() to skip LLM-based HYDE query expansion (passes raw query as vector search) - Patch chunkDocumentByTokens() to use char-based estimation instead of local tokenizer - Patch vsearch and query CLI commands to skip withLLMSession wrapper Environment variables: OLLAMA_EMBED_URL - Ollama server URL (e.g. http://your-ollama:11434) OLLAMA_EMBED_MODEL - Model name (default: nomic-embed-text) Tested on ARM64 Oracle Cloud VPS with qwen3-embedding:0.6b on remote Ollama via Tailscale. 7,100+ documents indexed successfully.

alexleach · 2026-04-02T08:37:47Z

FYI, there are already a number of PRs and Issues that provide this functionality. I collated a list in my own PR: #480

Really hoping @tobi will merge one of them! My favourite is #116

paralizeer · 2026-04-02T08:43:09Z

Thanks @alexleach! Just saw your PR #480 and the list — great compilation. PR #116 from @jonesj38 is solid too.

Our PR was born from a specific need: ARM64 VPS where node-llama-cpp can't compile (no Vulkan, CMake fails). The OLLAMA_EMBED_URL env var approach was the simplest path to get vsearch working without touching the LLM session layer.

Happy to close this in favor of #480 or #116 if one of those gets merged — they're more comprehensive. The important thing is that remote embeddings land in QMD. 🙏

alexleach · 2026-04-02T08:49:18Z

Yes, same here. node-llama-cpp won't compile in docker containers on my macbook (which is also arm64), so qmd completely fails without these patches to use remote endpoints and stop its compilation.

This is the behaviour now in #116, on which my #480 was based. Stopping node-llama-cpp's compilation was one of the main sticking points I had with #116, but that's now been implemented now. I'll probably close #480 in favour of #116, but that list could still be useful, until one of them gets merged...

Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jaylfc · 2026-04-04T14:49:45Z

Related: I've implemented this more broadly in a fork — covers remote embeddings, reranking, and query expansion via a qmd serve model server + --server / QMD_SERVER client flag.

Details and discussion: #489 (comment)
Fork: https://github.com/jaylfc/qmd/tree/feat/remote-llm-provider

paralizeer · 2026-04-05T16:12:29Z

Closing this in favor of #116 — after reviewing all the related PRs (@alexleach's list in #480 was super helpful), #116 is clearly the most complete solution.

It covers everything our PR does (remote embeddings without local node-llama-cpp compilation) plus query expansion, reranking, and OpenAI-compatible endpoint support — which means it works with Ollama too via baseUrl override.

We've been running remote Ollama embeddings on ARM64 VPS (Oracle Cloud Ampere) and can confirm the approach works perfectly. @jonesj38's implementation in #116 is the right foundation for this.

@tobi — would love to see #116 merged. It unblocks QMD on ARM64, Docker, and CI environments where local compilation isn't viable. The performance numbers speak for themselves.

paralizeer · 2026-04-05T16:14:42Z

@jaylfc — your qmd serve implementation is exactly what we need. We're running multiple agents (OpenClaw) on an ARM64 VPS sharing a remote Ollama instance, and the model server approach (load once, serve many) is perfect for our setup.

We've started testing your fork on our infra. Would you be open to opening a PR to upstream? Happy to co-author or help test. The RemoteLLM drop-in + QMD_SERVER env var is the cleanest approach of all the remote embedding PRs.

jaylfc · 2026-04-05T17:01:05Z

Opened upstream PR: #509

Covers everything discussed here — remote embeddings, reranking, query expansion, batch operations, plus the new index endpoints (/search, /browse, /collections, /status) for remote memory access.

Thanks @paralizeer for the push to upstream this! 🙏

Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `qmd serve` command that runs a lightweight HTTP server exposing embedding, reranking, and query expansion endpoints. Multiple QMD clients can share a single set of loaded models over the network instead of each loading their own into RAM. Changes: - New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize) - New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP - Updated LLM interface: added embedBatch, tokenize, intent option - Updated store.ts: use LLM interface instead of concrete LlamaCpp type - CLI: added `serve` command, `--server` flag, and QMD_SERVER env var - README: documented remote model server usage and multi-agent setup Addresses: tobi#489 tobi#490 tobi#502 tobi#480

chidev mentioned this pull request Apr 2, 2026

Stabilize runtime and hybrid search chidev/qmd#1

Draft

paralizeer closed this Apr 5, 2026

paralizeer mentioned this pull request Apr 5, 2026

feat: Add OpenAI embedding and query expansion support #116

Open

paralizeer mentioned this pull request Apr 5, 2026

Feature request: Support remote Ollama embeddings via HTTP (OLLAMA_EMBED_URL) #489

Open

jaylfc mentioned this pull request Apr 5, 2026

feat: remote model server (qmd serve) for shared inference across clients #509

Closed

jaylfc mentioned this pull request Apr 5, 2026

feat: remote model server (qmd serve) for shared inference across clients #511

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490
paralizeer wants to merge 1 commit intotobi:mainfrom
paralizeer:feat/ollama-remote-embeddings

paralizeer commented Apr 1, 2026

Uh oh!

alexleach commented Apr 2, 2026

Uh oh!

paralizeer commented Apr 2, 2026

Uh oh!

alexleach commented Apr 2, 2026

Uh oh!

jaylfc commented Apr 4, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

paralizeer commented Apr 1, 2026

Summary

Problem

Solution

Environment Variables

Testing

Design Decisions

Uh oh!

alexleach commented Apr 2, 2026

Uh oh!

paralizeer commented Apr 2, 2026

Uh oh!

alexleach commented Apr 2, 2026

Uh oh!

jaylfc commented Apr 4, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

paralizeer commented Apr 5, 2026

Uh oh!

jaylfc commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants