A unified, extensible, and future-proof Python toolkit for locally running state-of-the-art LLM-based Text-to-Speech (TTS) synthesis models.
- 🧩 25+ Models — MeloTTS, ChatTTS, CosyVoice, Fish Speech, Parler-TTS, XTTS, GPT-SoVITS, F5-TTS, Qwen3-TTS, GLM-TTS, Index-TTS, MaskGCT, and more
- 🔌 Plugin Architecture — Add new models with
@register_modeldecorator - 🚀 Hot-Swap — Switch models at runtime without restarting
- 🌍 Multi-Language — Chinese, English, Japanese, Korean, and more
- 🎯 Multi-Task — Speech synthesis, voice cloning, emotion control, style transfer, streaming
- 💻 Local-First — All inference on-device. No APIs. No data leaves your machine.
- 🐍 Modern Python — uv-native packaging, Pydantic configs, rich CLI
- 📦 Zero-Config for select models — GLM-TTS and Index-TTS automatically download their official code repositories on first use
# Clone the repository
git clone https://github.com/vra/modern-tts.git
cd modern-tts
# Sync all dependencies (recommended)
uv sync --all-extras
# Or install specific extras only
uv sync --extra melotts --extra chattts --extra glm --extra index
# Or just core dependencies
uv syncPython 3.10+ recommended. Some models (e.g. Index-TTS) require specific PyTorch / transformers versions—see per-model notes below.
from modern_tts import TTSPipeline
# Synthesize with MeloTTS
pipe = TTSPipeline("melotts-zh")
result = pipe("你好世界,这是语音合成测试。")
result.save("output.wav")
# Switch to ChatTTS for emotional speech
pipe.switch_model("chattts")
result = pipe("这是一个带有情感的语音合成。")
result.save("output_emotion.wav")
# Voice cloning with CosyVoice
pipe.switch_model("cosyvoice-300m")
result = pipe("这是克隆的声音。", task="clone", reference_audio="reference.wav")
result.save("cloned.wav")
# Zero-config voice cloning with GLM-TTS (auto-downloads code)
pipe.switch_model("glm-tts")
result = pipe("你好,这是 GLM-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("glm_cloned.wav")
# Zero-config voice cloning with Index-TTS (auto-downloads code)
pipe.switch_model("index-tts")
result = pipe("你好,这是 Index-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("index_cloned.wav")| Model ID | Type | Languages | Modes | Install Extra | Notes |
|---|---|---|---|---|---|
melotts-zh |
TTS | zh, en | speak, emotion | --extra melotts |
Many text-processing deps (pypinyin, jieba, etc.) |
melotts-en |
TTS | zh, en | speak, emotion | --extra melotts |
English variant |
chattts |
TTS | zh, en | speak, clone, emotion | --extra chattts |
Emotional prosody control |
f5-tts |
ZS-VC | zh, en, ja, ko | speak, clone, emotion | --extra f5 |
Requires reference audio for synthesis |
glm-tts |
ZS-VC | zh, en | speak, clone | --extra glm |
Auto-downloads official repo. Heavy deps (transformers, onnxruntime, peft). |
index-tts |
ZS-VC | zh, en, ja, ko, yue | speak, clone, emotion, style | --extra index |
Auto-downloads official repo. Requires Python ≥ 3.10. |
moss-tts |
TTS | zh, en, ja, ko | speak, emotion | --extra moss |
MOSS-TTS-Nano (0.1B), CPU-friendly |
piper-tts |
TTS | 15+ | speak | --extra piper |
ONNX-based, edge-optimized |
qwen3-tts-0.6b |
ZS-VC | 11+ | speak, clone | --extra qwen3-tts |
Requires qwen-tts package |
qwen3-tts-1.7b |
ZS-VC | 11+ | speak, clone | --extra qwen3-tts |
Larger Qwen3-TTS variant |
xtts-v1 |
ZS-VC | 13+ | speak, clone | --extra xtts |
Requires coqui-tts |
xtts-v2 |
ZS-VC | 13+ | speak, clone | --extra xtts |
Adds Chinese support |
xtts-v2.1 |
ZS-VC | 13+ | speak, clone, streaming | --extra xtts |
Adds streaming mode |
ZS-VC = Zero-Shot Voice Cloning (requires a
reference_audiosample).
These models need you to manually clone their official repositories and/or download weights before use. Calling load() will raise a RuntimeError with setup instructions.
| Model ID | Type | Languages | Modes | Install Extra | Setup Notes |
|---|---|---|---|---|---|
bertvits2-zh |
TTS | zh, en | speak, emotion | --extra bertvits2 |
Clone repo + download weights |
bertvits2-en |
TTS | en | speak, emotion | --extra bertvits2 |
Clone repo + download weights |
bertvits2-jp |
TTS | ja, en | speak, emotion | --extra bertvits2 |
Clone repo + download weights |
cosyvoice-300m |
ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | --extra cosyvoice |
Clone repo + download weights |
cosyvoice-300m-sft |
ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | --extra cosyvoice |
SFT variant |
cosyvoice-300m-instruct |
ZS-VC | zh, en, yue, ja, ko | speak, clone, emotion, style | --extra cosyvoice |
Instruct variant |
fishspeech-1.5 |
ZS-VC | zh, en, ja, ko | speak, clone, emotion | --extra fishspeech |
Clone repo + weights; pyaudio needs system headers |
gptsovits |
ZS-VC | zh, en, ja, yue | speak, clone | --extra gptsovits |
Clone repo + download weights |
redfire-tts |
ZS-VC | zh, en, yue | speak, clone, emotion | --extra redfire |
fairseq needs C++ build headers |
| Model ID | Reason |
|---|---|
maskgct |
Custom tokenizer incompatible with generic TextToAudioLLMModel loader |
parler-tts-mini |
parler-tts package incompatible with transformers >= 4.50 |
parler-tts-large |
Same compatibility issue as parler-tts-mini |
pocket-tts |
No public repository or weights found (reserved for future implementation) |
glm-tts— LLM + Flow Matching zero-shot TTS (Zhipu AI). Merged previousglm-tts-nano-2512andglm-tts-2512into a singleglm-ttsmodel ID.index-tts— Industrial-level multilingual zero-shot voice cloning (IndexTeam).
- GLM-TTS and Index-TTS no longer require manual environment variables (
GLM_TTS_REPO_PATH,INDEX_TTS_REPO_PATH) orPYTHONPATHmanipulation. - On first use, the framework automatically:
- Clones the official repository to
~/.cache/modern-tts/repos/ - Injects the path into
sys.path - Proceeds with model loading
- Clones the official repository to
- You can still override the auto-download path via
config.extra["glm_tts_repo_path"]/config.extra["index_tts_repo_path"]or the corresponding environment variables.
modern_tts.core.hf_hub— HuggingFace Hub download helpers (download_hf_model,get_hf_model_path) so custom-code adapters don't re-implement caching logic.modern_tts.core.repo_manager— Generic git repository auto-downloader (ensure_repo,inject_repo_path) used by adapters that depend on upstream code not on PyPI.
TextToAudioLLMModel.load()now raises a clearNotImplementedErrorwhen a subclass has not setPROCESSOR_CLS/MODEL_CLS, signaling that the subclass must overrideload()for custom loading logic.
| Old ID | New ID | Note |
|---|---|---|
glm-tts-nano-2512 |
glm-tts |
Merged into unified glm-tts |
glm-tts-2512 |
glm-tts |
Merged into unified glm-tts |
Modern TTS is built on three layers:
- TTSPipeline — Unified user API. Handles text normalization, task dispatch, model lifecycle.
- TTSModel / TextToAudioLLMModel — Adapter layer. New models often need only 8 lines of config via
TextToAudioLLMModel. - Backends — Transformers, vLLM, ONNX Runtime.
from modern_tts.core.audio_llm import TextToAudioLLMModel
from modern_tts.core.registry import register_model
@register_model("my-tts-1b")
class MyTTS1B(TextToAudioLLMModel):
HF_PATH = "org/MyTTS-1B"
PROCESSOR_CLS = "transformers.AutoTokenizer"
MODEL_CLS = "transformers.AutoModelForTextToWaveform"
SUPPORTED_LANGUAGES = {"zh", "en"}
DEFAULT_SAMPLE_RATE = 24000
@property
def model_id(self) -> str:
return "my-tts-1b"That's it. The registry auto-discovers it at runtime.
See Contributing Guide for development setup, code style, and PR checklist.
Apache-2.0