Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design question for integrating new model to Transformers? #36784

Open
Manalelaidouni opened this issue Mar 18, 2025 · 2 comments
Open

Design question for integrating new model to Transformers? #36784

Manalelaidouni opened this issue Mar 18, 2025 · 2 comments

Comments

@Manalelaidouni
Copy link
Contributor

Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment.

X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict

YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.

My question is :
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!


Image

The model was out for a while but the paper has just been released last week.

@zucchini-nlp
Copy link
Member

@Manalelaidouni hey! This is quite similar to VQ-VAE based vision models we have in transformers, for ex Chameleon and Emu3 (a few more PRs still in progress).

The general idea of a processor is to be 1) lightweight and fast; 2) no dependency on specific frameworks like torch etc. So we usually add quantizer modules as part of the model. Feel free to check out how the above mentioned models are integrated

cc @eustlb for audio

@eustlb
Copy link
Contributor

eustlb commented Mar 18, 2025

Hey @Manalelaidouni, it's cool you're working on Yue Transformers integration !! 🤗

We want to have a separate X-Codec integration PR so that the modelling can live by itself, as for mimi or dac, all the more since it will enable other model integrations that would rely on it (e.g. Llasa 😊).

Then, the model can hold a reference to this codec model; see Moshi for example:

self.audio_encoder = AutoModel.from_config(config.audio_encoder_config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants