Design question for integrating new model to Transformers?

Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment. 

X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict

YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.

**My question is :**
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!

<br>

<img width="452" alt="Image" src="https://github.com/user-attachments/assets/a1e6a5a2-4e82-41f7-991f-5aca093e7155" />

*The model was out for a while but the [paper](https://arxiv.org/abs/2503.08638) has just been released last week.*




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design question for integrating new model to Transformers? #36784

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design question for integrating new model to Transformers? #36784

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions