Open Speech

OpenAI-compatible speech-to-text server with pluggable backends.

Drop-in replacement for faster-whisper-server / Speaches with a cleaner architecture, web UI, and real-time streaming.

Features

OpenAI API compatible — POST /v1/audio/transcriptions, POST /v1/audio/translations
Real-time streaming — WS /v1/audio/stream (Deepgram-compatible protocol)
Web UI — Upload files, record from mic, stream live, or synthesize speech at /web
Text-to-speech — POST /v1/audio/speech (OpenAI-compatible, Kokoro-82M backend)
Voice blending — Mix voices with af_bella(2)+af_sky(1) syntax
Multiple STT backends — faster-whisper (GPU/CPU), Moonshine (fast CPU, English), Vosk (tiny, offline)
Pluggable backends — select via model name: faster-whisper-*, moonshine/*, vosk-*
TTS Web UI — Text-to-speech tab with voice selector, blending, speed control, history, and model management
Model hot-swap — Load/unload models via /api/ps and /v1/audio/models/load
GPU + CPU — CUDA float16 or CPU int8
Self-signed HTTPS — Auto-generated cert, browser mic works out of the box
Silero VAD — Voice activity detection prevents transcribing silence
Docker ready — GPU and CPU compose files included

Quick Start

One-liner (Docker Hub)

# GPU (NVIDIA)
docker run -d -p 8100:8100 --gpus all jwindsor1/open-speech:latest

# CPU
docker run -d -p 8100:8100 jwindsor1/open-speech:cpu

Open https://localhost:8100/web — accept the self-signed cert warning, then upload audio or use the mic.

Docker Compose (recommended for persistent setups)

git clone https://github.com/will-assistant/open-speech.git
cd open-speech

# GPU
docker compose -f docker-compose.gpu.yml up -d

# CPU
docker compose -f docker-compose.cpu.yml up -d

Compose uses persistent volumes for model cache — models survive container rebuilds.

Custom Configuration

cp .env.example .env    # edit as needed
docker compose -f docker-compose.gpu.yml up -d

All settings work as environment variables, in .env, or inline in compose. See Configuration for the full list.

Windows with GPU (WSL2 + Docker Desktop)

Install Docker Desktop with WSL2 backend
Install NVIDIA Container Toolkit in WSL2
Run:

docker run -d -p 8100:8100 --gpus all jwindsor1/open-speech:latest

Or clone the repo and use docker compose -f docker-compose.gpu.yml up -d for persistent config.

STT Backends

Open Speech supports multiple speech-to-text backends. The backend is selected automatically based on the model parameter:

Backend	Model prefix	Best for	Languages
faster-whisper	`deepdml/faster-whisper-*`, etc.	High accuracy, GPU	99+ languages
Moonshine	`moonshine/tiny`, `moonshine/base`	Fast CPU inference, edge	English only
Vosk	`vosk-model-*`	Tiny models, fully offline	Many (per model)

Install optional backends

pip install 'open-speech[moonshine]'  # Moonshine (moonshine-onnx)
pip install 'open-speech[vosk]'       # Vosk

Usage examples

# faster-whisper (default)
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" -F "model=deepdml/faster-whisper-large-v3-turbo-ct2"

# Moonshine — 5x faster than Whisper on CPU, English only
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" -F "model=moonshine/tiny"

# Vosk — tiny offline model
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" -F "model=vosk-model-small-en-us-0.15"

Set a default backend via environment variable:

STT_DEFAULT_MODEL=moonshine/tiny  # Use Moonshine by default

API Usage

Transcribe a file

curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=deepdml/faster-whisper-large-v3-turbo-ct2" \
  -F "response_format=json"

OpenAI Python SDK

import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://localhost:8100/v1",
    api_key="not-needed",
    http_client=httpx.Client(verify=False),  # self-signed cert
)

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="deepdml/faster-whisper-large-v3-turbo-ct2",
        file=f,
    )
print(result.text)

Text-to-Speech

# Generate speech (saves as MP3)
curl -sk https://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Hello world","voice":"alloy"}' \
  -o output.mp3

# OpenAI Python SDK
speech = client.audio.speech.create(
    model="kokoro",
    input="Hello world!",
    voice="alloy",  # or "af_bella", "af_bella(2)+af_sky(1)"
    response_format="mp3",
)
speech.stream_to_file("output.mp3")

Voice options:

OpenAI names: alloy, echo, fable, onyx, nova, shimmer
Kokoro voices: af_heart, af_bella, am_adam, etc.
Blends: af_bella(2)+af_sky(1) (weighted mix)

Formats: mp3, opus, aac, flac, wav, pcm

Streaming TTS

# Stream audio as it's generated (chunked transfer)
curl -sk "https://localhost:8100/v1/audio/speech?stream=true" \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro","input":"Long text here...","voice":"af_heart"}' \
  --output - | mpv -

Transcript Formats (SRT/VTT)

# Get SRT subtitles
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "response_format=srt" -o transcript.srt

# Get WebVTT subtitles
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "response_format=vtt" -o transcript.vtt

# Plain text only
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "response_format=text"

Real-time streaming (WebSocket)

const ws = new WebSocket("wss://localhost:8100/v1/audio/stream?model=deepdml/faster-whisper-large-v3-turbo-ct2");

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "transcript") {
        console.log(data.is_final ? "FINAL:" : "partial:", data.text);
    }
};

// Send PCM16 LE mono 16kHz audio as binary frames
ws.send(audioChunkArrayBuffer);

// Stop gracefully
ws.send(JSON.stringify({ type: "stop" }));

Endpoints

Method	Path	Description
`GET`	`/health`	Server health + loaded model count
`GET`	`/v1/models`	List available models (STT + TTS)
`GET`	`/api/ps`	Show loaded models with details
`POST`	`/api/ps/{model}`	Load a model
`POST`	`/v1/audio/transcriptions`	Transcribe audio file
`POST`	`/v1/audio/translations`	Translate audio to English
`POST`	`/v1/audio/speech`	Synthesize speech from text (TTS)
`GET`	`/v1/audio/voices`	List available TTS voices
`WS`	`/v1/audio/stream`	Real-time streaming transcription
`GET`	`/web`	Web UI
`GET`	`/docs`	Swagger/OpenAPI docs

Configuration

All config via environment variables:

Variable	Default	Description
`STT_HOST`	`0.0.0.0`	Bind address
`STT_PORT`	`8100`	Listen port
`STT_DEVICE`	`cuda`	`cuda` or `cpu`
`STT_COMPUTE_TYPE`	`float16`	`float16`, `int8`, `int8_float16`
`STT_DEFAULT_MODEL`	`deepdml/faster-whisper-large-v3-turbo-ct2`	Default whisper model
`STT_PRELOAD_MODELS`	``	Comma-separated models to download and load on startup
`STT_MODEL_TTL`	`300`	Seconds idle before auto-unload (0 = never). Default model exempt
`STT_MAX_LOADED_MODELS`	`0`	Max models in memory (0 = unlimited). LRU eviction, default exempt
`STT_STREAM_CHUNK_MS`	`2000`	Streaming chunk size (ms)
`STT_STREAM_VAD_THRESHOLD`	`0.5`	VAD speech detection threshold
`STT_STREAM_ENDPOINTING_MS`	`300`	Silence before finalizing utterance
`STT_STREAM_MAX_CONNECTIONS`	`10`	Max concurrent WebSocket streams
`STT_API_KEY`	``	API key for authentication (empty = auth disabled)
`STT_RATE_LIMIT`	`0`	Max requests/min per IP (0 = disabled)
`STT_RATE_LIMIT_BURST`	`0`	Burst allowance (0 = same as rate limit)
`STT_MAX_UPLOAD_MB`	`100`	Maximum upload file size in MB
`STT_CORS_ORIGINS`	`*`	Comma-separated allowed CORS origins
`STT_TRUST_PROXY`	`false`	Trust X-Forwarded-For for rate limiting (set true behind reverse proxy)
TTS Settings
`TTS_ENABLED`	`true`	Enable/disable TTS endpoints
`TTS_DEFAULT_MODEL`	`kokoro`	Default TTS model
`TTS_DEFAULT_VOICE`	`af_heart`	Default voice
`TTS_DEVICE`	(inherits STT_DEVICE)	Device for TTS inference (`cuda`/`cpu`)
`TTS_MAX_INPUT_LENGTH`	`4096`	Max input text length (chars)
`TTS_DEFAULT_FORMAT`	`mp3`	Default output audio format
`TTS_DEFAULT_SPEED`	`1.0`	Default speech speed
`TTS_PRELOAD_MODELS`	``	Comma-separated TTS models to preload
`TTS_VOICES_CONFIG`	``	Path to custom voice presets YAML

Model Lifecycle

Open Speech supports Ollama-style automatic model eviction to manage memory:

TTL eviction — Models idle longer than STT_MODEL_TTL seconds are automatically unloaded. The default model is exempt.
Max models — When STT_MAX_LOADED_MODELS is set, the least recently used non-default model is evicted when the limit is exceeded.
Manual unload — DELETE /api/ps/{model} to immediately unload a model (409 for default, 404 if not loaded).
Enriched status — GET /api/ps returns last_used_at, is_default, and ttl_remaining per model.

# Keep models for 10 minutes, max 3 loaded
STT_MODEL_TTL=600 STT_MAX_LOADED_MODELS=3 docker compose up -d

Security

API Key Authentication

Set STT_API_KEY to require authentication on all API endpoints. Health (/health) and web UI (/web) are always exempt.

# Enable auth
STT_API_KEY=my-secret-key docker compose up -d

# Use with curl
curl -sk https://localhost:8100/v1/audio/transcriptions \
  -H "Authorization: Bearer my-secret-key" \
  -F "file=@audio.wav"

# Use with OpenAI SDK
client = OpenAI(base_url="https://localhost:8100/v1", api_key="my-secret-key")

# WebSocket auth via query param
ws = new WebSocket("wss://localhost:8100/v1/audio/stream?api_key=my-secret-key");

Rate Limiting

Per-IP token bucket rate limiter. Set STT_RATE_LIMIT to enable.

STT_RATE_LIMIT=60          # 60 requests/min per IP
STT_RATE_LIMIT_BURST=10    # Allow bursts up to 10

Rate limit info is returned in response headers: X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After (on 429).

Upload Limits

STT_MAX_UPLOAD_MB (default 100) caps file upload size. Empty files are rejected with 400.

CORS

STT_CORS_ORIGINS controls allowed origins (default *). Set to specific origins for production:

STT_CORS_ORIGINS=https://myapp.com,https://staging.myapp.com

Response Formats

response_format parameter supports: json, text, verbose_json, srt, vtt

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
docs		docs
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
LICENSE		LICENSE
README.md		README.md
STATE.md		STATE.md
docker-compose.cpu.yml		docker-compose.cpu.yml
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Speech

Features

Quick Start

One-liner (Docker Hub)

Docker Compose (recommended for persistent setups)

Custom Configuration

Windows with GPU (WSL2 + Docker Desktop)

STT Backends

Install optional backends

Usage examples

API Usage

Transcribe a file

OpenAI Python SDK

Text-to-Speech

Streaming TTS

Transcript Formats (SRT/VTT)

Real-time streaming (WebSocket)

Endpoints

Configuration

Model Lifecycle

Security

API Key Authentication

Rate Limiting

Upload Limits

CORS

Response Formats

License

About

Uh oh!

Releases

Packages

Languages

License

will-assistant/open-speech

Folders and files

Latest commit

History

Repository files navigation

Open Speech

Features

Quick Start

One-liner (Docker Hub)

Docker Compose (recommended for persistent setups)

Custom Configuration

Windows with GPU (WSL2 + Docker Desktop)

STT Backends

Install optional backends

Usage examples

API Usage

Transcribe a file

OpenAI Python SDK

Text-to-Speech

Streaming TTS

Transcript Formats (SRT/VTT)

Real-time streaming (WebSocket)

Endpoints

Configuration

Model Lifecycle

Security

API Key Authentication

Rate Limiting

Upload Limits

CORS

Response Formats

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages