Skip to content

vra/modern-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modern TTS

A unified, extensible, and future-proof Python toolkit for locally running state-of-the-art LLM-based Text-to-Speech (TTS) synthesis models.

Python License


✨ Features

  • 🧩 25+ Models — MeloTTS, ChatTTS, CosyVoice, Fish Speech, Parler-TTS, XTTS, GPT-SoVITS, F5-TTS, Qwen3-TTS, GLM-TTS, Index-TTS, MaskGCT, and more
  • 🔌 Plugin Architecture — Add new models with @register_model decorator
  • 🚀 Hot-Swap — Switch models at runtime without restarting
  • 🌍 Multi-Language — Chinese, English, Japanese, Korean, and more
  • 🎯 Multi-Task — Speech synthesis, voice cloning, emotion control, style transfer, streaming
  • 💻 Local-First — All inference on-device. No APIs. No data leaves your machine.
  • 🐍 Modern Python — uv-native packaging, Pydantic configs, rich CLI
  • 📦 Zero-Config for select models — GLM-TTS and Index-TTS automatically download their official code repositories on first use

📦 Installation

# Clone the repository
git clone https://github.com/vra/modern-tts.git
cd modern-tts

# Sync all dependencies (recommended)
uv sync --all-extras

# Or install specific extras only
uv sync --extra melotts --extra chattts --extra glm --extra index

# Or just core dependencies
uv sync

Python 3.10+ recommended. Some models (e.g. Index-TTS) require specific PyTorch / transformers versions—see per-model notes below.


🚀 Quick Start

from modern_tts import TTSPipeline

# Synthesize with MeloTTS
pipe = TTSPipeline("melotts-zh")
result = pipe("你好世界,这是语音合成测试。")
result.save("output.wav")

# Switch to ChatTTS for emotional speech
pipe.switch_model("chattts")
result = pipe("这是一个带有情感的语音合成。")
result.save("output_emotion.wav")

# Voice cloning with CosyVoice
pipe.switch_model("cosyvoice-300m")
result = pipe("这是克隆的声音。", task="clone", reference_audio="reference.wav")
result.save("cloned.wav")

# Zero-config voice cloning with GLM-TTS (auto-downloads code)
pipe.switch_model("glm-tts")
result = pipe("你好,这是 GLM-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("glm_cloned.wav")

# Zero-config voice cloning with Index-TTS (auto-downloads code)
pipe.switch_model("index-tts")
result = pipe("你好,这是 Index-TTS 的语音克隆测试。", task="clone", reference_audio="ref.wav")
result.save("index_cloned.wav")

🎙️ Supported Models

✅ Ready to use (loadable out-of-the-box)

Model ID Type Languages Modes Install Extra Notes
melotts-zh TTS zh, en speak, emotion --extra melotts Many text-processing deps (pypinyin, jieba, etc.)
melotts-en TTS zh, en speak, emotion --extra melotts English variant
chattts TTS zh, en speak, clone, emotion --extra chattts Emotional prosody control
f5-tts ZS-VC zh, en, ja, ko speak, clone, emotion --extra f5 Requires reference audio for synthesis
glm-tts ZS-VC zh, en speak, clone --extra glm Auto-downloads official repo. Heavy deps (transformers, onnxruntime, peft).
index-tts ZS-VC zh, en, ja, ko, yue speak, clone, emotion, style --extra index Auto-downloads official repo. Requires Python ≥ 3.10.
moss-tts TTS zh, en, ja, ko speak, emotion --extra moss MOSS-TTS-Nano (0.1B), CPU-friendly
piper-tts TTS 15+ speak --extra piper ONNX-based, edge-optimized
qwen3-tts-0.6b ZS-VC 11+ speak, clone --extra qwen3-tts Requires qwen-tts package
qwen3-tts-1.7b ZS-VC 11+ speak, clone --extra qwen3-tts Larger Qwen3-TTS variant
xtts-v1 ZS-VC 13+ speak, clone --extra xtts Requires coqui-tts
xtts-v2 ZS-VC 13+ speak, clone --extra xtts Adds Chinese support
xtts-v2.1 ZS-VC 13+ speak, clone, streaming --extra xtts Adds streaming mode

ZS-VC = Zero-Shot Voice Cloning (requires a reference_audio sample).

⚠️ Requires manual setup

These models need you to manually clone their official repositories and/or download weights before use. Calling load() will raise a RuntimeError with setup instructions.

Model ID Type Languages Modes Install Extra Setup Notes
bertvits2-zh TTS zh, en speak, emotion --extra bertvits2 Clone repo + download weights
bertvits2-en TTS en speak, emotion --extra bertvits2 Clone repo + download weights
bertvits2-jp TTS ja, en speak, emotion --extra bertvits2 Clone repo + download weights
cosyvoice-300m ZS-VC zh, en, yue, ja, ko speak, clone, emotion, style --extra cosyvoice Clone repo + download weights
cosyvoice-300m-sft ZS-VC zh, en, yue, ja, ko speak, clone, emotion, style --extra cosyvoice SFT variant
cosyvoice-300m-instruct ZS-VC zh, en, yue, ja, ko speak, clone, emotion, style --extra cosyvoice Instruct variant
fishspeech-1.5 ZS-VC zh, en, ja, ko speak, clone, emotion --extra fishspeech Clone repo + weights; pyaudio needs system headers
gptsovits ZS-VC zh, en, ja, yue speak, clone --extra gptsovits Clone repo + download weights
redfire-tts ZS-VC zh, en, yue speak, clone, emotion --extra redfire fairseq needs C++ build headers

❌ Temporarily unavailable

Model ID Reason
maskgct Custom tokenizer incompatible with generic TextToAudioLLMModel loader
parler-tts-mini parler-tts package incompatible with transformers >= 4.50
parler-tts-large Same compatibility issue as parler-tts-mini
pocket-tts No public repository or weights found (reserved for future implementation)

📋 Changelog & API Changes

Latest

New Models

  • glm-tts — LLM + Flow Matching zero-shot TTS (Zhipu AI). Merged previous glm-tts-nano-2512 and glm-tts-2512 into a single glm-tts model ID.
  • index-tts — Industrial-level multilingual zero-shot voice cloning (IndexTeam).

Zero-Config Auto-Download

  • GLM-TTS and Index-TTS no longer require manual environment variables (GLM_TTS_REPO_PATH, INDEX_TTS_REPO_PATH) or PYTHONPATH manipulation.
  • On first use, the framework automatically:
    1. Clones the official repository to ~/.cache/modern-tts/repos/
    2. Injects the path into sys.path
    3. Proceeds with model loading
  • You can still override the auto-download path via config.extra["glm_tts_repo_path"] / config.extra["index_tts_repo_path"] or the corresponding environment variables.

New Infrastructure Modules

  • modern_tts.core.hf_hub — HuggingFace Hub download helpers (download_hf_model, get_hf_model_path) so custom-code adapters don't re-implement caching logic.
  • modern_tts.core.repo_manager — Generic git repository auto-downloader (ensure_repo, inject_repo_path) used by adapters that depend on upstream code not on PyPI.

Base Class Improvements

  • TextToAudioLLMModel.load() now raises a clear NotImplementedError when a subclass has not set PROCESSOR_CLS / MODEL_CLS, signaling that the subclass must override load() for custom loading logic.

Model ID Changes

Old ID New ID Note
glm-tts-nano-2512 glm-tts Merged into unified glm-tts
glm-tts-2512 glm-tts Merged into unified glm-tts

🏗️ Architecture

Modern TTS is built on three layers:

  1. TTSPipeline — Unified user API. Handles text normalization, task dispatch, model lifecycle.
  2. TTSModel / TextToAudioLLMModel — Adapter layer. New models often need only 8 lines of config via TextToAudioLLMModel.
  3. Backends — Transformers, vLLM, ONNX Runtime.

Adding a New Model

from modern_tts.core.audio_llm import TextToAudioLLMModel
from modern_tts.core.registry import register_model

@register_model("my-tts-1b")
class MyTTS1B(TextToAudioLLMModel):
    HF_PATH = "org/MyTTS-1B"
    PROCESSOR_CLS = "transformers.AutoTokenizer"
    MODEL_CLS = "transformers.AutoModelForTextToWaveform"
    SUPPORTED_LANGUAGES = {"zh", "en"}
    DEFAULT_SAMPLE_RATE = 24000

    @property
    def model_id(self) -> str:
        return "my-tts-1b"

That's it. The registry auto-discovers it at runtime.


🤝 Contributing

See Contributing Guide for development setup, code style, and PR checklist.


📄 License

Apache-2.0

About

A unified, extensible toolkit for LLM-based Text-to-Speech synthesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages