A per model "reverse proxy" which redirects requests to multiple ollama servers.
This is a lowest effort implementation of a reverse proxy for ollama, it accepts mainly chat and generation requests, depending on the model requested it will transfer the payload to a server which has been specifically assigned to run the given model. Reffer to API for a list of endpoints currently supported.
go run ./*.go --level=trace --address 0.0.0.0:11434 --proxy=llama3.2-vision=http://server-02:11434
--proxy=deepseek-r1:14b=http://server-01:11434
Official images are available on docker hub and ghcr.io. You can run the latest image from either:
- docker hub:
docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" slawoc/gollamas:latest
- ghcr.io :
docker run -it -e GOLLAMAS_PROXIES="llama3.2-vision=http://server:11434,deepseek-r1:14b=http://server2:11434" ghcr.io/slawo/gollamas:latest
There are various scenarios this projects attempts to resolve, here is a list of features currently implemented:
- Manage models
- Map model aliases to existing model names (some tools only allow a pre-defined set of models)
- Set that by default only the configured models are returned when listing models
- Set a flag to also return models as aliases
- Set option to allow requests to currently running models (ie server has additional model running)
- Keep models in memory
- Preload models (ensure model is loaded uppon startup)
- Ping models (maintain model loaded)
- Add config to enforce model keep alive globally
"keep_alive": -1
(if it is worth adding functionality for servers withoutOLLAMA_KEEP_ALIVE=-1
) - Add config to override model keep alive per model/server
"keep_alive": -1
- Set fixed size context
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) in each request
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) per model/server
"options": { "num_ctx": 4096 }
- Add config to enforce context size in each request
"options": { "num_ctx": 4096 }
- Add config to enforce context size per model/server
"options": { "num_ctx": 4096 }
- Add config to set a default context size (if missing) in each request
Not all endpoints are covered, particularly endpoints which deal with customisation and creation of models are not supported until there is a clear usecase for this.
-
Supported endpoints
-
GET /
-
GET /api/tags
-
GET /api/ps
-
GET /api/version
-
GET /v1/models
-
GET /v1/models/:model
-
HEAD /
-
HEAD /api/tags
-
HEAD /api/version
-
POST /api/chat
-
POST /api/embed
-
POST /api/embeddings
-
POST /api/generate
-
POST /api/pull
-
POST /api/show
-
POST /v1/chat/completions
-
POST /v1/completions
-
POST /v1/embeddings
-
-
Not supported
-
HEAD /api/blobs/:digest
-
DELETE /api/delete
-
POST /api/blobs/:digest
-
POST /api/copy
-
POST /api/create
-
POST /api/push
-
The server relies on existing ollama models and middlewares to speed up the development of the initial implementation.
Only the requests which have a model
( or the deprecated name
) field are transfered to the right server.
When possible other endpoints hit all configured servers to either select one answer (ie: the lowest version
available), or are combined into oone response (ie: lists of models).