High-performance Whisper STT API server with a hybrid "Hot/Cold" worker architecture.
Ideal for locally running installations of agents like OpenClaw or Open-WebUI, where the media should not leave the private local domain.
Created and maintained by Hugo L. Espuny. Part of the Uttera voice stack. Licensed under the Apache License 2.0. See NOTICE for third-party attributions.
This repository has been renamed from whisper-stt-local-server to
uttera-stt-hotcold and transferred from its original creator's
personal page (@fakehec) to the
Uttera GitHub organization.
GitHub redirects old URLs automatically, so any existing clones, forks,
bookmarks, and links keep working. If you still have
fakehec/whisper-stt-local-server as your origin, consider updating:
git remote set-url origin https://github.com/uttera/uttera-stt-hotcold.git| Use case | This repo | Sibling repo |
|---|---|---|
| Home-lab, personal, small/mid GPU (8β16 GB) | β uttera-stt-hotcold | β |
| Cloud, multi-tenant, large GPU (β₯24 GB) | β | uttera-stt-vllm |
Choose uttera-stt-hotcold when:
- You have consumer GPUs (RTX 4070, 4080) and transcribe occasionally.
- Personal or single-user deployment.
- You want to share the GPU with other workloads.
- You have 8β24 GB of VRAM. vLLM does not fit comfortably in this range: at 8β16 GB the KV cache is too small for continuous batching to beat hotcold; at 16β24 GB vLLM works but reserves 11β22 GB permanently, wasting the co-location flexibility that is hotcold's reason to exist on mid-sized GPUs.
Choose uttera-stt-vllm when:
- You transcribe hours of audio per day across many concurrent streams.
- You want continuous batching to maximise GPU utilisation.
- You have large-VRAM GPUs dedicated to inference.
- You have 32 GB+ of VRAM (vLLM reserves ~22β29 GB at startup
depending on
gpu_memory_utilization; below 32 GB total you either run out of headroom or lose the batching advantage that justifies the reservation).
See uttera-benchmarks
for reproducible head-to-head numbers across four load profiles
(latency, burst up to N=1024, sustained) and two corpora (LibriSpeech
test-clean and an internal Spanish WAV corpus).
Concurrency and engine
- Hybrid hot/cold pool:
- Hot worker: Whisper resident in VRAM for sub-second (~0.2 s) inference on short clips.
- Cold workers: on-demand subprocesses spawned on the GPU when
the hot lane is busy, so long audio files don't block quick voice
commands. Drains idle after
COLD_WORKER_IDLE_TIMEOUT.
- GPU accelerated via NVIDIA CUDA. fp16 + fp32-LayerNorm by
default (
WHISPER_FP16=1) β halves VRAM with no quality loss.
OpenAI-compatible API
- Standard endpoints:
POST /v1/audio/transcriptions,POST /v1/audio/translations. GET /v1/modelsfor client autodiscovery (reportswhisper-1,owned_by: uttera).- All five OpenAI
response_formatvalues really supported (v2.2.0) βjson,text,verbose_json,srt,vtt. Previouslysrt/vtt/verbose_jsonsilently collapsed to the compact JSON form.
Translation
POST /v1/audio/translationswithto_language(defaulten). WithLIBRETRANSLATE_URLset: Whisper-transcribe β LibreTranslate pipeline to any target language. Without: Whisper nativetranslate(English only; poor onturbo-class models).to_language != "en"withoutLIBRETRANSLATE_URLβ HTTP 400 with the missing-env-var name (previously a silent fallback to English β a contract violation).X-Translation-Mode: libretranslateresponse header whenever the LibreTranslate path runs, so clients and observability tooling can tell which engine handled the call.
Validation and observability
- Strict validation on every knob β out-of-range returns HTTP 422 or
HTTP 400 with a useful detail body:
response_formatmust be one ofjson|text|verbose_json|srt|vtt.temperatureβ[0.0, 1.0](OpenAI spec).- Undecodeable / non-audio file bodies β HTTP 400 with the typed decode error (was HTTP 500 before v2.2.0).
- Unsupported Whisper language codes β HTTP 400 with the message (was generic HTTP 500 before v2.2.0).
X-Routeresponse header βHOT/COLD-POOL/COLD-POOL>HOTβ tells the client which lane handled the request, exposed to browsers via CORS when CORS is enabled.- Multilingual: 99 languages covered by Whisper. Auto-detects if
languageis omitted.
Operations
GET /healthandHEAD /health(v2.2.0) expose version, model, worker status, queue depth, VRAM β one struct for both proxies and Docker healthchecks.- Opt-in
CORSMiddlewareviaCORS_ALLOW_ORIGINSenv var (disabled by default β API-first deployments don't need it). ExposesX-RouteandX-Translation-Modeto browser clients. - Canonical Uttera-stack port
9005(STT family). TTS family uses9004. Swappinghotcold β vllmis a backend change, not a port change. - Optional Redis self-registration (
REDIS_URL) for upstream router discovery β same protocol as the siblinguttera-stt-vllmand the TTS servers.
Privacy
- 100% local execution. Your audio never leaves your infrastructure.
| Model | Params | VRAM (fp16) | Speed | Languages | Best for |
|---|---|---|---|---|---|
tiny |
39M | ~1 GB | Fastest | 99 | Testing, low-resource |
tiny.en |
39M | ~1 GB | Fastest | English only | English-only, low-resource |
base |
74M | ~1 GB | Fast | 99 | Light workloads |
base.en |
74M | ~1 GB | Fast | English only | Light English-only |
small |
244M | ~2 GB | Moderate | 99 | Good accuracy/speed balance |
small.en |
244M | ~2 GB | Moderate | English only | English-only balanced |
medium |
769M | ~5 GB | Slow | 99 | Default. High accuracy |
medium.en |
769M | ~5 GB | Slow | English only | English-only high accuracy |
large |
1550M | ~10 GB | Slowest | 99 | Maximum accuracy (v1) |
large-v2 |
1550M | ~10 GB | Slowest | 99 | Improved large |
large-v3 |
1550M | ~10 GB | Slowest | 99 | Best accuracy overall |
turbo |
809M | ~6 GB | Fast | 99 | Recommended. large-v3 distilled, best quality/speed |
Set the model via WHISPER_MODEL in .env. To download all models at once for offline use:
source venv/bin/activate
python3 -c "
import whisper
for m in ['tiny','tiny.en','base','base.en','small','small.en',
'medium','medium.en','large','large-v2','large-v3','turbo']:
print(f'Downloading {m}...')
whisper.load_model(m, download_root='assets/models/whisper')
print(f' Done: {m}')
"Install the following system dependencies first:
sudo apt update && sudo apt install -y ffmpeg python3 python3-venvPython version:
setup.shuses the system defaultpython3(3.12+ recommended). torch is pinned to>=2.9.0,<2.10.0to avoid CUDA 13 NPP dependency issues with newer versions.
git clone https://github.com/uttera/uttera-stt-hotcold.git
cd uttera-stt-hotcold
chmod +x setup.sh
./setup.shsetup.sh creates the virtual environment, installs all dependencies, and downloads the configured Whisper model into assets/models/. It is safe to re-run.
To run the server without sudo privileges and enable GPU acceleration, the user must belong to the video and render groups:
sudo usermod -aG video $USER
sudo usermod -aG render $USERNote: Restart your session for changes to take effect.
The server listens on port 9005 by default. Ensure the user has permissions to open sockets on this port (standard for ports >1024).
| Method | Path | Description |
|---|---|---|
GET / HEAD |
/health |
Server liveness, version, model, worker status, queue, VRAM. |
GET |
/v1/models |
OpenAI-compatible model list (whisper-1, owned_by: uttera). |
POST |
/v1/audio/transcriptions |
Transcribe audio to text (Hot or Cold Lane). Supports json / text / verbose_json / srt / vtt response formats. |
POST |
/v1/audio/translations |
Transcribe + translate to to_language (default en). With LIBRETRANSLATE_URL: any target language. Without: English only (Whisper native). |
See API.md for full request/response schemas, validation
ranges, X-Route / X-Translation-Mode semantics, and the error
taxonomy (400 / 422 / 502).
The server uses direct Uvicorn execution for maximum ASGI performance.
source venv/bin/activate
# Localhost only
uvicorn main_stt:app --host 127.0.0.1 --port 9005
# Expose to local network
uvicorn main_stt:app --host 0.0.0.0 --port 9005Copy .env.example to .env and adjust as needed. All variables are optional.
| Variable | Default | Description |
|---|---|---|
WHISPER_MODEL |
medium |
Model to load: tiny, base, small, medium, large. |
WHISPER_FP16 |
1 |
fp16+LayerNorm-fp32 (halves VRAM). Set to 0 for fp32. |
COLD_POOL_SIZE |
10 |
Max concurrent cold workers (safety cap). |
COLD_WORKER_IDLE_TIMEOUT |
60 |
Seconds before idle cold worker exits. |
COLD_WORKER_IDLE_STAGGER |
10 |
Stagger per worker slot to avoid mass die-off. |
MIN_COLD_VRAM_GB |
4.0 |
Min free VRAM to spawn a cold worker (0=disable). |
COLD_LANE_TIMEOUT_SECONDS |
300 |
Max seconds to wait for a Cold Lane subprocess before HTTP 500. |
ROUTING_DRAIN_CAP_SECONDS |
120 |
Queue drain time considered 100% load. |
REDIS_URL |
(empty) | Redis URL for node self-registration (opt-in). |
NODE_HOST |
localhost |
Host advertised to Redis for Gatekeeper routing. |
NODE_PORT |
9005 |
Port advertised to Redis for Gatekeeper routing. |
DEBUG |
false |
Set to true to enable worker routing and subprocess traces. |
VENV_PYTHON |
(auto-detected) | Path to venv Python. Auto-detected from venv/bin/python. |
See .env.example for the full list of variables and their defaults.
- Create directory if it doesn't exist:
mkdir -p ~/.config/systemd/user - Create:
~/.config/systemd/user/uttera-stt.service - Configuration (environment variables are loaded from your
.envfile):
[Unit]
Description=Uttera STT Hot/Cold Server
After=network.target
[Service]
Type=simple
WorkingDirectory=%h/uttera-stt-hotcold
ExecStart=%h/uttera-stt-hotcold/venv/bin/uvicorn main_stt:app --host 127.0.0.1 --port 9005
Restart=always
RestartSec=5
[Install]
WantedBy=default.target- Enable and start:
systemctl --user daemon-reload
systemctl --user enable --now uttera-stt.serviceIf concurrent requests return HTTP 500 with a path error, the Cold Lane cannot find the Whisper CLI or Python binary. The server auto-detects venv/bin/python and venv/bin/whisper relative to the project directory. If running from a non-standard location, set the paths explicitly in .env:
VENV_PYTHON=/absolute/path/to/venv/bin/python
WHISPER_SCRIPT=/absolute/path/to/venv/bin/whisperThe server defaults to assets/models/whisper inside the project directory β no root required. If you see a permission error on a path like /opt/..., an old XDG_CACHE_HOME env var is being inherited from the shell. Either unset it or override it in .env:
XDG_CACHE_HOME=assets/modelsIf transcription of long audio hangs and eventually returns HTTP 500, increase the timeout in .env:
COLD_LANE_TIMEOUT_SECONDS=600Before running docker compose up for the first time, the host machine requires two one-time configuration steps to enable GPU passthrough via the NVIDIA Container Toolkit CDI mode.
These steps are required because Docker's default legacy GPU mode relies on BPF cgroup device filters, which are not available in cgroup v2 environments (Ubuntu 22.04+). CDI solves this cleanly.
1. Add the NVIDIA package repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list2. Install the toolkit:
sudo apt update && sudo apt install -y nvidia-container-toolkit3. Generate the CDI spec (exposes the GPU to containers via a stable device descriptor):
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml4. Enable CDI in the Docker daemon:
sudo tee /etc/docker/daemon.json <<'EOF'
{
"features": {
"cdi": true
}
}
EOF
sudo systemctl restart docker5. Verify it works:
docker run --rm --device nvidia.com/gpu=all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smiNote: Step 3 must be re-run if the NVIDIA driver is updated (
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml).
# Build and start
docker compose up -d
# Check server is ready
curl http://localhost:9005/health
# View logs
docker compose logs -f
# Stop
docker compose downThe model is persisted in assets/models/whisper/ (host volume), so it only downloads once.
GET /metrics returns Prometheus-format metrics for direct scraping
by Prometheus, Telegraf's inputs.prometheus plugin, or any other
OpenMetrics-compatible consumer. Metrics share the uttera_stt_*
namespace with the sibling uttera-stt-vllm backend (same names
and label shapes for the common series β the engine label in
uttera_stt_build_info differentiates the variant), plus this
server's additional hot/cold pool telemetry.
[[inputs.prometheus]]
urls = ["http://stt-host:9005/metrics"]
interval = "15s"Key series:
| Metric | Type | Use |
|---|---|---|
uttera_stt_requests_total{endpoint,method,status} |
Counter | Per-endpoint request rate + status mix |
uttera_stt_request_duration_seconds{endpoint,method} |
Histogram | HTTP p50/p95/p99 (total RTT) |
uttera_stt_inflight_requests |
Gauge | Live load (hot + cold combined) |
uttera_stt_requests_by_route_total{route} |
Counter | Lane split β HOT / COLD-POOL / COLD-POOL>HOT |
uttera_stt_transcriptions_total{response_format} |
Counter | Traffic mix across the five response formats |
uttera_stt_translations_total{mode,response_format} |
Counter | Translation path breakdown |
uttera_stt_audio_seconds_total{endpoint,route} |
Counter | Audio processed, lane-tagged β billing / throughput proxy |
uttera_stt_inference_duration_seconds{op} |
Histogram | Lane-tagged model latency: whisper_transcribe_hot / whisper_transcribe_cold / libretranslate |
uttera_stt_cold_workers_active |
Gauge | Live cold subprocesses |
uttera_stt_cold_workers_loading |
Gauge | Cold subprocesses booting |
uttera_stt_cold_worker_pool_size_cap |
Gauge | COLD_POOL_SIZE |
uttera_stt_cold_workers_spawned_total |
Counter | Monotonic spawn count (for cold-worker-churn dashboards) |
uttera_stt_cold_worker_ema_start_seconds |
Gauge | Rolling EMA of cold-worker boot time |
uttera_stt_work_queue_depth |
Gauge | Items queued |
uttera_stt_work_queue_audio_seconds |
Gauge | Audio queued (for drain-time estimate) |
uttera_stt_load_score |
Gauge | Saturation signal [0.0, 1.0] |
uttera_stt_hot_ema_sps |
Gauge | Hot-lane throughput EMA |
uttera_stt_vram_free_gb |
Gauge | GPU memory headroom |
uttera_stt_vram_per_cold_worker_gb |
Gauge | Rolling EMA of VRAM per cold subprocess |
uttera_stt_engine_ready |
Gauge | 1 if hot worker loaded |
uttera_stt_libretranslate_configured |
Gauge | 1 if LIBRETRANSLATE_URL was set at startup |
uttera_stt_errors_total{type} |
Counter | Typed errors (decode / validation / model / libretranslate) |
uttera_stt_build_info{version,engine,model} |
Gauge | Version + engine + model in the field (value always 1) |
By default, the server binds to 127.0.0.1 on port 9005.
- To allow external network access, change
--hostto0.0.0.0. - WARNING: This API does not have authentication. Exposing it to the network via
0.0.0.0represents a security risk. Ensure the server is protected by a firewall or operating within a secure VPN/local network.
| Task | Latency |
|---|---|
| Short command (2s audio, Hot Lane) | ~0.2s |
| Long audio (30s, Hot Lane) | ~0.7s |
| 160 concurrent (Hot + Cold Pool) | Target ~21s total, 0 failures |
Server source code: Apache License 2.0. Commercial use permitted.
Whisper model weights (OpenAI): released under the MIT License β commercial use permitted, no restrictions. See NOTICE for full attributions.
Created and maintained by Hugo L. Espuny, with contributions acknowledged in AUTHORS.md.
If you want to follow the project or get involved:
- β Star this repo to help discoverability.
- π Report issues via the issue tracker.
- π¬ Join the conversation in Discussions.
- π° Technical posts at blog.uttera.ai.
- π Uttera Cloud: https://uttera.ai (EU-hosted, solar-powered, subscription flat-rate).
Uttera /ΛΚt.Ιr.Ι/ β from the English verb "to utter" (to speak aloud, to pronounce, to give audible expression to). Formally, the name is a backronym of Universal Text Transformer Engine for Realtime Audio β reflecting the project's origin as a STT/TTS server and its underlying Transformer architecture.