High-quality American English voices for Qwen3-TTS, optimized for real-time local inference on Apple Silicon.
360ms to first audio. 2x real-time. 1.7GB RAM. Fully local.
Holler is a fine-tuned Qwen3-TTS-12Hz-0.6B with curated American English voices. It runs entirely on your Mac via Metal. No cloud, no API keys, no internet required after the initial model download.
Two ways to use it:
- HollerKit (Swift) — native Swift package for macOS apps. Stream text in, get audio out.
- Python server — HTTP API with streaming audio. Good for prototyping and non-Swift integrations.
Native Swift library for integrating Holler into macOS apps. Supports streaming text input (LLM integration), automatic sentence buffering, KV cache carryover for natural prosody, silence detection, and retry logic.
Add HollerKit to your Package.swift:
dependencies: [
.package(url: "https://github.com/sentiuminc/holler.git", from: "1.0.0"),
],
targets: [
.target(name: "MyApp", dependencies: [
.product(name: "HollerKit", package: "holler"),
]),
]Generate speech:
import HollerKit
let model = try await HollerModel.load()
// Simple: text in, audio out
let audio = try await model.synthesize("Hello world", voice: "kit")
// audio.samples: [Float], audio.sampleRate: 24000
// Streaming: get audio chunks as they're generated (~360ms to first audio)
for try await chunk in model.stream("Hello world", voice: "kit") {
player.scheduleBuffer(chunk.samples)
}
// LLM integration: feed tokens, get audio
let session = model.makeSession(voice: "kit")
// Consumer side — runs concurrently
Task {
for try await chunk in session.audio {
player.scheduleBuffer(chunk.samples)
}
}
// Producer side — feed text as it arrives from the LLM
for await token in llmStream {
session.feed(token)
}
await session.finish()SpeechSession handles everything automatically: sentence boundary detection, streaming generation, silence trimming, KV cache carryover between sentences, and retry on failed generations. Feed it text in any chunk size — single characters, whole sentences, random LLM-sized pieces — it accumulates and splits on sentence boundaries internally.
The package includes a command-line tool for testing:
# Build once (~3 min first time)
./build.sh
# Speak text through your speakers
./holler --text 'Hello world' --talk
# Save to file
./holler --text 'Hello world' --output hello.wav
# Simulate LLM streaming (token-by-token with sentence buffering)
./holler --session --text 'Sure. Let me check that for you. I think the answer is forty two.'
# Benchmark
./holler --benchmark
# Debug mode — see the full pipeline (sentence splits, chunk RMS, cache state, retries)
./holler --session --debug --text 'Your text here'
build.shuses xcodebuild under the hood because MLX requires compiled Metal shaders (.metallib) which only Xcode can produce.swift buildcompiles the Swift code but skips Metal.
var config = HollerConfiguration()
config.temperature = 0.6 // Sampling temperature
config.codebooks = 12 // Codec books (12 = fast, 16 = max quality)
config.maxRetries = 3 // Retry attempts on failed generation
config.log = { print($0) } // Enable debug logging
let model = try await HollerModel.load(repo: "sentium/holler-0.6b-6bit", configuration: config)| Metric | Value |
|---|---|
| Time to first audio | 360ms avg |
| Real-time factor | 0.49 avg (2x real-time) |
| Metal RAM | 1.7 GB |
| Model on disk | 1.1 GB (6-bit) |
TTFA includes codec warmup silence trimming — the 360ms is time to first audible speech, not first raw audio chunk.
HTTP API with streaming audio. Good for prototyping and non-Swift integrations.
git clone https://github.com/sentiuminc/holler.git
cd holler
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Start the server
python3 inference/server.py
# Open http://localhost:8100 in your browser, or:
curl "http://localhost:8100/tts?text=Hello+world" -o hello.wavOn first run, the server downloads sentium/holler-0.6b-6bit from HuggingFace (~1.1GB, cached for future runs).
Returns audio as it's generated. Float32 PCM at 24kHz, chunked transfer encoding. First audio arrives in ~139ms.
curl -X POST http://localhost:8100/speak \
-H "Content-Type: application/json" \
-d '{"text": "The weather looks great today.", "voice": "kit"}'| Field | Type | Default | Description |
|---|---|---|---|
text |
string | required | Text to speak |
voice |
string | first available | Voice name |
temperature |
float | 0.6 | Sampling temperature |
top_k |
int | 50 | Top-k sampling |
n_codebooks |
int | 12 | Codec books (12 fastest, 16 max quality) |
continue |
bool | false | Carry over prosody from previous generation |
curl "http://localhost:8100/tts?text=Hello+world&voice=kit" -o hello.wav| Endpoint | Description |
|---|---|
GET /health |
Server status, model name, available voices |
GET /benchmark |
6-sentence benchmark with TTFA/RTF results |
GET / |
Browser test UI with real-time playback |
| Metric | Value |
|---|---|
| Time to first audio | 139ms avg |
| Real-time factor | 0.38 avg (2.6x real-time) |
| Metal RAM | 1.7 GB |
Holler v1 ships with curated American English voices (2 trained, 8 more coming):
| Voice | Description | Status |
|---|---|---|
| Kit | Androgynous, clear, warm | Trained |
| Dakota | Male, grounded, natural | Trained |
| + 8 more | From 22 curated candidates | Coming soon |
All voices are created using Qwen3-TTS VoiceDesign, then fine-tuned with 400-500 curated clips per voice.
The full training pipeline is documented in docs/training-runbook.md:
- Voice design — create voice identity via VoiceDesign text prompts
- Data generation — 500 clips per voice locally on Mac via mlx-audio
- Enhancement — DeepFilter noise removal, LUFS normalization, de-essing
- Curation — manual listening pass
- Training — lr=1e-7, 2 epochs, text_projection patch, bf16
- Quantization — 6-bit affine g64
Scripts: training/ for fine-tuning, tools/ for data generation and curation.
- macOS with Apple Silicon (M1 or later)
- For Swift: Xcode 16+ (full app, not just Command Line Tools — MLX requires Metal shader compilation which only Xcode provides)
- For Python: Python 3.13+
- ~2GB free RAM for inference
Holler is a fine-tune of Qwen3-TTS by the Qwen team at Alibaba Cloud (Apache 2.0). All credit for the underlying architecture goes to them.
Apache 2.0