Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max num tokens supported on inference is ~ 40 max. not 256 as it would appear by reading code. #527

Open
xvdp opened this issue Jan 29, 2024 · 2 comments

Comments

@xvdp
Copy link

xvdp commented Jan 29, 2024

max number of tokens I am able to run thru bark generate_text_semantic() is about 40, ~ 24 words or so.

I looked thru the code and noticed that generate_text_semantic() clips anything over 256 and pads to 256, concatenating to
256 + 256 from history prompt + 1

At about 1.6 tokens per word approx that would be ~ 160 words.
I tried running that and got some really funky gibbirish. I noticed that something else was mangling the process.

I looked at the semantic, coarse and fine tokens generated at various text lengths, 12 words, 24, 36 words and 160.
At 24 words, around 37 tokens the result is quite good. Anything more it goes bad. The text I tried was the abstract of Audio LM paper.

text160= 'Abstract—We introduce AudioLM, a framework for highquality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained'

Logging semantic, coarse and fine tokens I get are
12 words: semantic tokens (544,), coarse tokens (2, 817), fine tokens (8, 817), audio 10s
24 words: semantic tokens (736,), coarse tokens (2, 1106), fine tokens (8, 1106), audio 13.28s
36 words: semantic tokens (748,), coarse tokens (2, 1124), fine tokens (8, 1124), audio 14 s
160 words: semantic tokens (696,), coarse tokens (2, 1046), fine tokens (8, 1046), audio 14 s

So clearly something is clipping to 14 seconds, It is the loop in generation.py: generate_text_semantic(): for n in range(n_tot_steps): where n_tot_steps = 768
changing that loop to allow more steps results in an error.

BUT concatenating the output of generate_text_semantic() and feeding to the subsequent loops DOES generate long audio - there's something funky with the GPT model which generates semantic tokens.

# lets say text has 20 words and concatenate

speaker = f"v2/en_speaker_6"
semantic = generate_text_semantic(text, history_prompt=speaker, temp=0.7, silent=False)
semantic =  np.concatenate([semantic]*6) # simulate 120 words 

coarse_tokens = generate_coarse( semantic, history_prompt=speaker, temp=temp, silent=False, use_kv_caching=True)
fine_tokens = generate_fine(coarse_tokens,history_prompt=speaker,temp=0.5)
audio = codec_decode(fine_tokens)
@JonathanFly
Copy link
Contributor

Yeah, it is a bit odd that you can input way way more tokens in the prompt that it can possibly speak in just 14 seconds. But there are some possible use cases, for example, unspoken text like [laugh] or [descriptions] also use space, so you can imagine some edge cases with a lots of tags where the space does something. But typically if you fill it up with words, Bark just hallucinates, random word salad comes out.

Max length of new audio 756 semantic tokens, at 49.9 hz, which is just over 15 seconds if you maxed it out. This is the maximum context window size of 1024 the same way a text language model may have a 4096 context. If Bark had used relative positioning instead of absolute there may have been some tricks to get 2048 or 4096 but as far I know there aren't such techniques for absolute positioning.

There is no absolute length limit in coarse, but that is because it uses a sliding window. But there is a practical limit - if you concat all the semantic tokens together and then generate everything in a single coarse step, you'll notice the speaker's voice gets lost pretty quickly past 14 seconds.

@computersrmyfriends
Copy link

Then, what is the solution for full length 60 minute audio book generation?
I even tried Bark infinity but it has this issue. Either the voices change or it gets muffled up.

How do you generate a long audio?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants