You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment.
X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict
YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.
My question is :
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!
The model was out for a while but the paper has just been released last week.
The text was updated successfully, but these errors were encountered:
@Manalelaidouni hey! This is quite similar to VQ-VAE based vision models we have in transformers, for ex Chameleon and Emu3 (a few more PRs still in progress).
The general idea of a processor is to be 1) lightweight and fast; 2) no dependency on specific frameworks like torch etc. So we usually add quantizer modules as part of the model. Feel free to check out how the above mentioned models are integrated
Hey @Manalelaidouni, it's cool you're working on Yue Transformers integration !! 🤗
We want to have a separate X-Codec integration PR so that the modelling can live by itself, as for mimi or dac, all the more since it will enable other model integrations that would rely on it (e.g. Llasa 😊).
Then, the model can hold a reference to this codec model; see Moshi for example:
Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment.
X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict
YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.
My question is :
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!
The model was out for a while but the paper has just been released last week.
The text was updated successfully, but these errors were encountered: