Skip to content

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490

Closed
paralizeer wants to merge 1 commit intotobi:mainfrom
paralizeer:feat/ollama-remote-embeddings
Closed

feat: support remote Ollama embeddings via OLLAMA_EMBED_URL#490
paralizeer wants to merge 1 commit intotobi:mainfrom
paralizeer:feat/ollama-remote-embeddings

Conversation

@paralizeer
Copy link
Copy Markdown

Summary

When OLLAMA_EMBED_URL is set, all embedding and tokenization operations use the remote Ollama HTTP API instead of node-llama-cpp. This enables QMD on platforms without local GPU/Vulkan support and with remote Ollama instances.

Problem

QMD 2.0 uses node-llama-cpp for all embedding operations, which requires local CMake compilation. This fails on:

  • ARM64 VPS (Oracle Cloud, Ampere, Graviton) — no Vulkan SDK
  • Docker containers — no GPU drivers
  • CI runners — headless, no GPU
  • Remote Ollama setups — Ollama on a GPU box, QMD on a different machine (common with Tailscale/Docker networks)

Every vsearch, embed, and query command triggers CMake compilation that either fails or hangs.

Solution

Six targeted patches that check OLLAMA_EMBED_URL and bypass node-llama-cpp when set:

Function What it bypasses
ollamaEmbed() / ollamaEmbedBatch() New helpers using Ollama /api/embed endpoint
getEmbedding() getDefaultLlamaCpp()ollamaEmbed()
generateEmbeddings() withLLMSessionForLlm → direct Ollama HTTP
expandQuery() LLM-based HYDE expansion → raw vector passthrough
chunkDocumentByTokens() llm.tokenize() → char-based estimation
vsearch / query CLI withLLMSession() wrapper → direct execution

Environment Variables

export OLLAMA_EMBED_URL=http://your-ollama:11434       # Required
export OLLAMA_EMBED_MODEL=qwen3-embedding:0.6b         # Optional (default: nomic-embed-text)

Testing

Tested on ARM64 Oracle Cloud VPS (4 vCPU, 23GB RAM, no GPU) with qwen3-embedding:0.6b running on a remote Ollama instance connected via Tailscale:

  • qmd embed --force — successfully indexed 7,100+ documents (zero CMake)
  • qmd vsearch "query" — returns results in <2s (was: hung on CMake indefinitely)
  • qmd search (BM25) — unaffected, works as before
  • qmd query — works with Ollama embeddings (no reranking without local model)

Design Decisions

  • No breaking changes — all patches are gated behind OLLAMA_EMBED_URL. Without the env var, behavior is identical to current.
  • Query expansion skippedexpandQuery() returns the raw query as a vec search instead of using HYDE. A future enhancement could call Ollama /api/generate for expansion.
  • Char-based chunking — uses text.length / 3 as token estimate (conservative for mixed prose/code). Avoids requiring a local tokenizer.
  • Batch embeddingollamaEmbedBatch() sends all texts in a single /api/embed call, matching Ollama's native batch support.

Closes #489.

When OLLAMA_EMBED_URL is set, all embedding and tokenization operations
use the remote Ollama HTTP API instead of node-llama-cpp. This enables
QMD on platforms without local GPU/Vulkan support (ARM64 VPS, Docker
containers, CI runners) and with remote Ollama instances (Tailscale,
LAN, Docker networks).

Changes:
- Add ollamaEmbed() and ollamaEmbedBatch() helper functions using
  Ollama /api/embed endpoint
- Patch getEmbedding() to bypass node-llama-cpp when OLLAMA_EMBED_URL
  is set
- Patch generateEmbeddings() with dedicated Ollama fast-path that skips
  withLLMSessionForLlm entirely
- Patch expandQuery() to skip LLM-based HYDE query expansion (passes
  raw query as vector search)
- Patch chunkDocumentByTokens() to use char-based estimation instead of
  local tokenizer
- Patch vsearch and query CLI commands to skip withLLMSession wrapper

Environment variables:
  OLLAMA_EMBED_URL   - Ollama server URL (e.g. http://your-ollama:11434)
  OLLAMA_EMBED_MODEL - Model name (default: nomic-embed-text)

Tested on ARM64 Oracle Cloud VPS with qwen3-embedding:0.6b on remote
Ollama via Tailscale. 7,100+ documents indexed successfully.
@alexleach
Copy link
Copy Markdown

FYI, there are already a number of PRs and Issues that provide this functionality. I collated a list in my own PR: #480

Really hoping @tobi will merge one of them! My favourite is #116

@paralizeer
Copy link
Copy Markdown
Author

Thanks @alexleach! Just saw your PR #480 and the list — great compilation. PR #116 from @jonesj38 is solid too.

Our PR was born from a specific need: ARM64 VPS where node-llama-cpp can't compile (no Vulkan, CMake fails). The OLLAMA_EMBED_URL env var approach was the simplest path to get vsearch working without touching the LLM session layer.

Happy to close this in favor of #480 or #116 if one of those gets merged — they're more comprehensive. The important thing is that remote embeddings land in QMD. 🙏

@alexleach
Copy link
Copy Markdown

Yes, same here. node-llama-cpp won't compile in docker containers on my macbook (which is also arm64), so qmd completely fails without these patches to use remote endpoints and stop its compilation.

This is the behaviour now in #116, on which my #480 was based. Stopping node-llama-cpp's compilation was one of the main sticking points I had with #116, but that's now been implemented now. I'll probably close #480 in favour of #116, but that list could still be useful, until one of them gets merged...

jaylfc pushed a commit to jaylfc/qmd that referenced this pull request Apr 4, 2026
Add `qmd serve` command that runs a lightweight HTTP server exposing
embedding, reranking, and query expansion endpoints. Multiple QMD clients
can share a single set of loaded models over the network instead of each
loading their own into RAM.

Changes:
- New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize)
- New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP
- Updated LLM interface: added embedBatch, tokenize, intent option
- Updated store.ts: use LLM interface instead of concrete LlamaCpp type
- CLI: added `serve` command, `--server` flag, and QMD_SERVER env var
- README: documented remote model server usage and multi-agent setup

Addresses: tobi#489 tobi#490 tobi#502 tobi#480

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jaylfc
Copy link
Copy Markdown

jaylfc commented Apr 4, 2026

Related: I've implemented this more broadly in a fork — covers remote embeddings, reranking, and query expansion via a qmd serve model server + --server / QMD_SERVER client flag.

Details and discussion: #489 (comment)
Fork: https://github.com/jaylfc/qmd/tree/feat/remote-llm-provider

@paralizeer
Copy link
Copy Markdown
Author

Closing this in favor of #116 — after reviewing all the related PRs (@alexleach's list in #480 was super helpful), #116 is clearly the most complete solution.

It covers everything our PR does (remote embeddings without local node-llama-cpp compilation) plus query expansion, reranking, and OpenAI-compatible endpoint support — which means it works with Ollama too via baseUrl override.

We've been running remote Ollama embeddings on ARM64 VPS (Oracle Cloud Ampere) and can confirm the approach works perfectly. @jonesj38's implementation in #116 is the right foundation for this.

@tobi — would love to see #116 merged. It unblocks QMD on ARM64, Docker, and CI environments where local compilation isn't viable. The performance numbers speak for themselves.

@paralizeer
Copy link
Copy Markdown
Author

@jaylfc — your qmd serve implementation is exactly what we need. We're running multiple agents (OpenClaw) on an ARM64 VPS sharing a remote Ollama instance, and the model server approach (load once, serve many) is perfect for our setup.

We've started testing your fork on our infra. Would you be open to opening a PR to upstream? Happy to co-author or help test. The RemoteLLM drop-in + QMD_SERVER env var is the cleanest approach of all the remote embedding PRs.

@jaylfc
Copy link
Copy Markdown

jaylfc commented Apr 5, 2026

Opened upstream PR: #509

Covers everything discussed here — remote embeddings, reranking, query expansion, batch operations, plus the new index endpoints (/search, /browse, /collections, /status) for remote memory access.

Thanks @paralizeer for the push to upstream this! 🙏

jaylfc added a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
Add `qmd serve` command that runs a lightweight HTTP server exposing
embedding, reranking, and query expansion endpoints. Multiple QMD clients
can share a single set of loaded models over the network instead of each
loading their own into RAM.

Changes:
- New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize)
- New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP
- Updated LLM interface: added embedBatch, tokenize, intent option
- Updated store.ts: use LLM interface instead of concrete LlamaCpp type
- CLI: added `serve` command, `--server` flag, and QMD_SERVER env var
- README: documented remote model server usage and multi-agent setup

Addresses: tobi#489 tobi#490 tobi#502 tobi#480

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jaylfc added a commit to jaylfc/qmd that referenced this pull request Apr 5, 2026
Add `qmd serve` command that runs a lightweight HTTP server exposing
embedding, reranking, and query expansion endpoints. Multiple QMD clients
can share a single set of loaded models over the network instead of each
loading their own into RAM.

Changes:
- New `src/serve.ts`: HTTP server wrapping LlamaCpp (embed/rerank/expand/tokenize)
- New `src/llm-remote.ts`: RemoteLLM class implementing LLM interface via HTTP
- Updated LLM interface: added embedBatch, tokenize, intent option
- Updated store.ts: use LLM interface instead of concrete LlamaCpp type
- CLI: added `serve` command, `--server` flag, and QMD_SERVER env var
- README: documented remote model server usage and multi-agent setup

Addresses: tobi#489 tobi#490 tobi#502 tobi#480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Support remote Ollama embeddings via HTTP (OLLAMA_EMBED_URL)

3 participants