Voice runtime (STT + TTS) with OpenAI-compatible API
Quick Start · Core Capabilities · Architecture · API Docs · Demo · Full Documentation
Production Voice Runtime Infrastructure Real-time Speech-to-Text and Text-to-Speech with OpenAI-compatible API, streaming session control, and extensible execution architecture.
Macaw OpenVoice is a production-grade runtime for voice systems.
It standardizes and operationalizes the execution of Speech-to-Text (STT) and Text-to-Speech (TTS) models in real environments by providing:
- a unified execution interface for multiple inference engines
- real-time audio streaming with controlled latency
- continuous session management
- bidirectional speech interaction
- operational observability
- production-ready APIs
Macaw acts as the infrastructure layer between voice models and production applications, abstracting complexity related to streaming, synchronization, state management, and execution control.
Macaw OpenVoice plays the same role for voice systems that:
- vLLM plays for LLM serving
- Triton Inference Server plays for GPU inference
- Ollama plays for local model execution
It transforms voice models into operational services.
- OpenAI-compatible Audio API
- Real-time full-duplex WebSocket streaming
- Local runtime CLI
- simultaneous STT and TTS in the same session
- automatic speech detection
- barge-in support (interruptible speech)
- automatic mute during synthesis
- state machine for continuous audio processing
- ring buffer with persistence
- crash recovery without context loss
- cross-segment coherence
- automatic resampling
- DC offset removal
- gain normalization
- voice activity detection
- multiple STT and TTS engines
- subprocess isolation
- declarative model registry
- pluggable architecture
- priority-based scheduler
- dynamic batching
- latency tracking
- Prometheus metrics
Macaw is designed for real-world voice workloads:
- real-time conversational voice agents
- telephony automation (SIP / VoIP)
- live transcription systems
- embedded voice interfaces
- multimodal assistants
- interactive media streaming
- continuous audio processing pipelines
# Install
pip install macaw-openvoice[server,grpc,faster-whisper]
# Pull a model
macaw pull faster-whisper-tiny
# Start the runtime
macaw serve$ macaw serve
╔══════════════════════════════════════════════╗
║ Macaw OpenVoice v1.0.0 ║
╚══════════════════════════════════════════════╝
INFO Scanning models in ~/.macaw/models
INFO Found 2 model(s): faster-whisper-tiny (STT), kokoro-v1 (TTS)
INFO Spawning STT worker faster-whisper-tiny port=50051 engine=faster-whisper
INFO Spawning TTS worker kokoro-v1 port=50052 engine=kokoro
INFO Scheduler started aging=30.0s batch_ms=75.0 batch_max=8
INFO Uvicorn running on http://127.0.0.1:8000
# Via REST API
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F model=faster-whisper-tiny
# Via CLI
macaw transcribe audio.wav --model faster-whisper-tinywscat -c "ws://localhost:8000/v1/realtime?model=faster-whisper-tiny"
# Send binary audio frames, receive JSON transcript eventscurl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "kokoro-v1", "input": "Hello, how can I help you?", "voice": "default"}' \
--output speech.wav Clients
CLI / REST / WebSocket (full-duplex)
|
v
+----------------------------------------------------+
| API Server (FastAPI) |
| |
| POST /v1/audio/transcriptions (STT batch) |
| POST /v1/audio/translations (STT translate) |
| POST /v1/audio/speech (TTS) |
| WS /v1/realtime (STT+TTS) |
+----------------------------------------------------+
| Scheduler |
| Priority queue (realtime > batch), cancellation, |
| dynamic batching, latency tracking |
+----------------------------------------------------+
| Model Registry |
| Declarative manifest (macaw.yaml), lifecycle |
+----------+-------------------+---------------------+
| |
+--------+--------+ +------+-------+
| STT Workers | | TTS Workers |
| (subprocess | | (subprocess |
| gRPC) | | gRPC) |
| | | |
| Faster-Whisper | | Kokoro |
| WeNet | | |
+-----------------+ +--------------+
|
+----------+-------------------------------------+
| Audio Preprocessing Pipeline |
| Resample -> DC Remove -> Gain Normalize |
+------------------------------------------------+
| Session Manager (STT only) |
| 6 states, ring buffer, WAL, LocalAgreement, |
| cross-segment context, crash recovery |
+------------------------------------------------+
| VAD (Energy Pre-filter + Silero VAD) |
+------------------------------------------------+
| Post-Processing (ITN via NeMo) |
+------------------------------------------------+
| Engine | Type | Architecture | Partials | Hot Words | Status |
|---|---|---|---|---|---|
| Faster-Whisper | STT | encoder-decoder | LocalAgreement | via initial_prompt | Supported |
| WeNet | STT | CTC | native | native keyword boosting | Supported |
| Kokoro | TTS | neural | — | — | Supported |
Adding a new engine requires ~400-700 lines of code and zero changes to the runtime core. See the Adding an Engine guide.
Macaw implements the OpenAI Audio API contract, so existing SDKs work without modification:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Transcription
result = client.audio.transcriptions.create(
model="faster-whisper-tiny",
file=open("audio.wav", "rb"),
)
print(result.text)
# Text-to-Speech
response = client.audio.speech.create(
model="kokoro-v1",
input="Hello, how can I help you?",
voice="default",
)
response.stream_to_file("output.wav")The /v1/realtime endpoint supports full-duplex STT + TTS:
Client -> Server:
Binary frames PCM 16-bit audio (any sample rate)
session.configure Configure VAD, language, hot words, TTS model
tts.speak Trigger text-to-speech synthesis
tts.cancel Cancel active TTS
Server -> Client:
session.created Session established
vad.speech_start Speech detected
transcript.partial Intermediate hypothesis
transcript.final Confirmed segment (with ITN)
vad.speech_end Speech ended
tts.speaking_start TTS started (STT muted)
Binary frames TTS audio output
tts.speaking_end TTS finished (STT unmuted)
error Error with recoverable flag
macaw serve # Start API server
macaw transcribe audio.wav # Transcribe file
macaw transcribe audio.wav --format srt # Generate subtitles
macaw transcribe --stream # Stream from microphone
macaw translate audio.wav # Translate to English
macaw list # List installed models
macaw pull faster-whisper-tiny # Download a model
macaw inspect faster-whisper-tiny # Model detailsAn interactive demo with a React/Next.js frontend is included:
./demo/start.shThis starts the FastAPI backend (port 9000) and the Next.js frontend (port 3000) together. The demo includes a dashboard for batch transcriptions, real-time streaming STT with VAD visualization, and a TTS playground. See demo/README.md for details.
# Setup (requires Python 3.11+ and uv)
uv venv --python 3.12
uv sync --all-extras
# Development workflow
make check # format + lint + typecheck
make test-unit # unit tests (preferred during development)
make test # all tests (1686 passing)
make ci # full pipeline: format + lint + typecheck + testFull documentation is available at usemacaw.github.io/macaw-openvoice.
We welcome contributions! Please read our Contributing Guide before submitting a pull request.
- Website: usemacaw.io
- Email: hello@usemacaw.io
- GitHub: github.com/usemacaw/macaw-openvoice

