Description
Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment.
X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict
YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.
My question is :
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!
The model was out for a while but the paper has just been released last week.