audio
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
WhisperFusion builds upon the capabilities of WhisperLive and WhisperSpeech to provide a seamless conversations with an AI.
High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
Zero-Shot Speech Editing and Text-to-Speech in the Wild
Instant voice cloning by MIT and MyShell. Audio foundation model.
Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
[ACM MM 2024] This is the official code for "AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding"
A generative speech model for daily dialogue.
一个简单的本地网页界面,使用ChatTTS将文字合成为语音,同时支持对外提供API接口。A simple native web interface that uses ChatTTS to synthesize text into speech, along with support for external API interfaces.
Silero VAD: pre-trained enterprise-grade Voice Activity Detector
[IJCV] FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. AI拟音大师,给你的无声视频添加生动而且同步的音效 😝
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Multilingual Voice Understanding Model
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Official PyTorch implementation of BigVGAN (ICLR 2023)
Easily train a good VC model with voice data <= 10 mins!
Landing Page for All Things Source Separation
Muzic: Music Understanding and Generation with Artificial Intelligence
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
An Open-Sourced LLM-empowered Foundation TTS System
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Music repair method to convert lossy MP3 compressed music to lossless music.
High-quality Text-to-Audio Generation with Efficient Diffusion Transformer
