-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
13 second bypass limit demo #84
Conversation
added 13 seconds limit bypass example
added 13 second bypass demo to README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
You can also split the input string by the full stop (.) Instead of splitting by amount of words. This way makes the generated audio file more clearer and should remove the audio cutouts In the final output. You can achieve that by using setting words = long_string.split(".")
and replace with this for loop instead
|
If you don't mind the extra dependency, I'd recommend using NLTK to split the input. Using just a period will cause issues with abbreviations and such. |
Just tried it out It does split the string beautifully with |
added nltk to split long string.
if i remember correctly nltk makes mistakes around things like
would love thoughts toward those 2 goals |
For this part of the problem you can use a for loop at the end of the script. This might not be ideal though.
|
I've been tinkering with this and have unlimited tokens now. Coherency is lost between sentences, I was able to fix this by feeding the full output back into generate_audio. Here is my code if it's helpful
|
You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings.
|
I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:
|
Why bark has this 13 seconds limit? |
Oh, neat! I have been doing the same thing essentially, except I have been using syllables. about 40 syllables seems to be a good spot I've found. The steps that I take are:
The 'magic trick' in my poking about seems to be segmenting sentences in such a way that a long pause character (ex. |
instead this:
Can I use this? It will work the same?
|
I stitched wsippel method altogether and surprisingly got decent results. You can test It out as Is or use different methods to compare the two here is the complete demo code.
|
Let me ask you guys: |
@7gxycn08 Are you aware of this fork: https://github.com/JonathanFly/bark ? It would be nice to insert your impl into that fork. |
Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : |
We should incorporate the best stuff into the main repo or we will end with 15 good separated things. @gkucsko |
Works great, runs on GTX 970 very nicely takes about 2min to generate 30sec on default settings. |
What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later? |
It is Bark - with a Web GUI & some additions wrapped around it 😃 You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well. |
Very nice work putting this together! |
But I need to run Bark in batch mode (using a script). It would be great if I could set everything up in the GUI and then have the GUI provide me with the corresponding command line. I hope that makes sense now? |
This is on the list of things I want to add in fork. It's a bit a mess so probably not super soon, but I find myself wanting to do this too quite often. |
I believe that this breaks the non-speech you add to a text [laughs] |
Need to add following code snippet:
|
I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence. |
That's may not specific to that code, but Bark in general. Usually a generic tag like [laugh] works on a decent set of voices, but things like [happy] are much less likely and can wreck the output of the entire clip. If you need to make use of text tags that specific you're best bet is to generate random voices using the tags and sift through the outputs to find a few rare ones that do. |
@7gxycn08 I'm implementing Bark with Vocos which means that I'm using the Here's the code for that function. from typing import Optional, Union, Dict
import numpy as np
from bark.generation import generate_coarse, generate_fine
def semantic_to_audio_tokens(
semantic_tokens: np.ndarray,
history_prompt: Optional[Union[Dict, str]] = None,
temp: float = 0.7,
silent: bool = False,
output_full: bool = False,
):
coarse_tokens = generate_coarse(
semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
)
fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)
if output_full:
full_generation = {
"semantic_prompt": semantic_tokens,
"coarse_prompt": coarse_tokens,
"fine_prompt": fine_tokens,
}
return full_generation
return fine_tokens |
Added Demo to README on how to use large prompts and generate > 13 second audio files.