A Claude Code skill that turns any content (file, URL, raw text, image, PDF) into spoken audio narration. Default backend is local Kokoro: free, no API key, no watermark, fully offline. Pass --gemini to use the Gemini 2.5 Flash TTS API instead.
You say "read this to me" and Claude does. The skill rewrites the source into a spoken-style script first, then streams audio chunk-by-chunk while later chunks synthesize, so audio starts within ~5s.
User-scoped (recommended):
git clone https://github.com/wickdninja/tts.git ~/.claude/skills/ttsProject-scoped:
git clone https://github.com/wickdninja/tts.git <project>/.claude/skills/ttsThen in Claude Code:
/tts # ask what to read
/tts ./README.md # narrate a file
/tts https://example.com/post # narrate a URL
/tts --gemini ./README.md # use the Gemini API instead of Kokoro
/tts --voices # list Kokoro voices
/tts --voices --gemini # list Gemini voices
- macOS or Linux
node(any modern version)uv(install)
brew install uv nodeThe first run downloads ~340MB to ~/.cache/uv/ (kokoro + torch wheels) and ~/.cache/huggingface/ (the 82M Kokoro weights). Subsequent runs are fast and fully offline.
Output is 24kHz mono WAV. MIT-licensed model, no watermark, no audible AI disclaimer.
GOOGLE_API_KEY(get one)- network connectivity at synthesis time
export GOOGLE_API_KEY="..."Pass --gemini to route through Google. Costs are negligible at typical volumes; audio is billed per character.
Set TTS_VOICE=<name> to pick a voice. Voice names are not interchangeable across backends.
| Backend | Default | Popular alternates |
|---|---|---|
| Kokoro | af_heart (American female, warm) |
am_adam, bf_emma, bm_george, am_puck, af_bella |
| Gemini | Kore (female, warm) |
Orus, Charon, Puck, Leda |
Full catalog (including Japanese, Mandarin, Spanish, French, Hindi, Italian, Brazilian Portuguese for Kokoro): see voices.md.
- Phase 1, extract. Read the file, fetch the URL, describe the image, etc.
- Phase 2, rewrite. Convert the source into a spoken narrative: no markdown, no bullets, no code blocks. Spoken transitions ("here's the thing"), expanded jargon, ~150 wpm.
- Phase 3, synthesize. Stream audio chunk-by-chunk with
--play, or batch-render to a single WAV without it. - Phase 4, replay. The combined WAV is always written to
/tmp/tts-output.wavso you can replay with! afplay /tmp/tts-output.wav(macOS).
Audio chunking is automatic. Kokoro chunks on sentence boundaries; Gemini pre-chunks at ~1500 chars (streaming) or ~8000 chars (batch).
uvisn't installed and you don't want to install it.- You want a specific Gemini voice (
Kore,Orus, etc.). - Long-form content (>30 min) where Gemini's faster end-to-end synthesis matters more than privacy or cost.
Otherwise stay on Kokoro: free, fully offline, no watermark.
| Var | Backend | Default | Notes |
|---|---|---|---|
TTS_VOICE |
both | af_heart / Kore |
voice selection |
TTS_LANG |
Kokoro | inferred from voice prefix | override (e.g. a, b, j, z) |
TTS_SPEED |
Kokoro | 1.0 |
playback rate multiplier |
TTS_MODEL |
Gemini | gemini-2.5-flash-preview-tts |
rarely needs changing |
SKILL.md main skill (read first)
setup.md install + verification steps
voices.md full voice catalog
tts.js dispatcher (Node); handles the --gemini path directly
tts_kokoro.py local Kokoro helper (uv self-bootstrapping)
--playusesafplay(macOS). On Linux, the WAV is still written; play it withaplay,paplay,mpv, etc.- Never use
.mp3as the output extension. The file is PCM WAV. - Kokoro is MIT-licensed and runs entirely on your machine. Gemini routes audio through Google.
MIT. See LICENSE.