Stars
Awesome speech/audio LLMs, representation learning, and codec models
Ultra-low bitrate neural audio codec (0.31~1.40 kbps) with a better semantic in the latent space.
The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.
Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Samples are presented on
SpeechGPT Series: Speech Large Language Models
toy reproduction of Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Facebook Low Resource (FLoRes) MT Benchmark
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
This is a Phi Family of SLMs book for getting started with Phi Models. Phi a family of open sourced AI models developed by Microsoft. Phi models are the most capable and cost-effective small langua…
Speech To Speech: an effort for an open-sourced and modular GPT4-o
code for AAAI2022 paper "Open Vocabulary Electroencephalography-To-Text Decoding and Zero-shot Sentiment Classification"
Foundational Models for State-of-the-Art Speech and Text Translation
Fast and memory-efficient exact attention
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Medical o1, Towards medical complex reasoning with LLMs
zero-shot voice conversion & singing voice conversion, with real-time support
[WIP] Resources for AI engineers. Also contains supporting materials for the book AI Engineering (Chip Huyen, 2025)
Audio Codec Speech processing Universal PERformance Benchmark
Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。
PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Code for paper "Attention on Attention for Image Captioning". ICCV 2019
[WACV2025 Oral] SUM: Saliency Unification through Mamba for Visual Attention Modeling
unofficial implementation of the High Fidelity Neural Audio Compression
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
The official implementation of our paper "Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning".