Skip to content

Design question for integrating new model to Transformers? #36784

Closed
@Manalelaidouni

Description

@Manalelaidouni

Hey, I’ve been working on adding the new Yue Model to Transformers, it’s a lyrics to song generation model that takes a lyrics prompt and an optional audio prompt. he audio prompt is passed to a sota audio tokenizer called X-codec that excels at capturing both semantic and accoustic information which significantly improves text-to-audio alignment.

X-codec first encodes the audio using an acoustic model (DAC) and then extracts semantic information using a pretrained Hubert model, this semantic information is further refined with a semantic encoder (based on RepCodec). The combined features are then fed into a RVQ module (residual vector quantizer) that converts them into discrete tokens for the language model to predict

YuE itself uses a two stage language model based on Llama2, toped with a vocoder to go from the intermediate sound representation to the song waveform.

My question is :
Can I wrap the X‑Codec model functionality as a feature extractor so that it becomes part of the processor YuEProcessor for a multi-model input? or since it’s a relatively complex model should I instead add it as an optional part of the model architecture? I’ll be opening a PR soon, any pointers would be greatly appreciated!


Image

The model was out for a while but the paper has just been released last week.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions