Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13 second bypass limit demo #84

Closed
wants to merge 4 commits into from
Closed

13 second bypass limit demo #84

wants to merge 4 commits into from

Conversation

7gxycn08
Copy link

Added Demo to README on how to use large prompts and generate > 13 second audio files.

added 13 seconds limit bypass example
added 13 second bypass demo to README
Copy link

@Armand97123 Armand97123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@7gxycn08
Copy link
Author

7gxycn08 commented Apr 24, 2023

You can also split the input string by the full stop (.) Instead of splitting by amount of words. This way makes the generated audio file more clearer and should remove the audio cutouts In the final output.

You can achieve that by using setting words = long_string.split(".")
and removing the for loop underneath.

for i in range(0, len(words), 10):
    text_prompt = " ".join(words[i:i+10])
    text_prompts.append(text_prompt)

and replace with this for loop instead

for sentence in words:
    # Append the sentence to the text_prompts list if it's not empty
    if sentence:
        text_prompts.append(sentence + ".")

@wsippel
Copy link

wsippel commented Apr 24, 2023

If you don't mind the extra dependency, I'd recommend using NLTK to split the input. Using just a period will cause issues with abbreviations and such.

@7gxycn08
Copy link
Author

7gxycn08 commented Apr 24, 2023

NLTK

Just tried it out It does split the string beautifully with words = nltk.sent_tokenize(long_string)

added nltk to split long string.
@gkucsko
Copy link
Contributor

gkucsko commented Apr 24, 2023

if i remember correctly nltk makes mistakes around things like I spoke with Mr. Smith, and he....
Thanks for PR, super open to incorporate longer predicionts. atm would love to first learn how to best do this. i believe there are two main things that make it non-trivial beyond the sentence splitting:

  1. speaker/audio coherence. it might either be best to keep the same original history_prompt such as en_speaker_1, or it might be best to use the previously generated output at each step. or maybe there is a magic third option of some clever mixing/concatting of the 2

  2. potentially doing multiple generations per step (maybe even with auto selection of the best one). sometimes a generation goes off the rails or the model invents a new speaker even with the same prompt. especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

would love thoughts toward those 2 goals

@gkucsko gkucsko mentioned this pull request Apr 24, 2023
@7gxycn08
Copy link
Author

7gxycn08 commented Apr 24, 2023

especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

For this part of the problem you can use a for loop at the end of the script. This might not be ideal though.

#Write each individual audio to a file this will generate multiple wav files for every split prompt from the list.

for i, audio_array in enumerate(audio_arrays):
    write_wav(f"audio_{i}.wav", SAMPLE_RATE, audio_array)

@fidofetch
Copy link

I've been tinkering with this and have unlimited tokens now. Coherency is lost between sentences, I was able to fix this by feeding the full output back into generate_audio. Here is my code if it's helpful

def text_to_audio(text, text_temp, waveform_temp, history_prompt):

    if(history_prompt == "Unconditional"):
        history_prompt = None
        
    #segment the sentences
    text_prompts_list = nltk.sent_tokenize(text)
    
    # generate audio from text
    audio_arrays = np.array([])
    i = 0
    
    for prompt in text_prompts_list:
        full_generation, audio_array = generate_audio(prompt,
                                     history_prompt,
                                     text_temp,
                                     waveform_temp,
                                     output_full = True)
        audio_arrays = np.concatenate((audio_arrays, audio_array))
                        
        save_as_prompt(os.path.join(cwd, f"bark/assets/userprompts/{i}.npz"), full_generation)
        history_prompt = os.path.join(cwd, f"bark/assets/userprompts/{i}.npz")
    # return audio array as output
    return SAMPLE_RATE, audio_arrays

@7gxycn08
Copy link
Author

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings.
for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

@wsippel
Copy link

wsippel commented Apr 25, 2023

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:

sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

@felipelalli
Copy link

felipelalli commented Apr 26, 2023

Why bark has this 13 seconds limit?

@tarpeyd12
Copy link

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:

sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

Oh, neat! I have been doing the same thing essentially, except I have been using syllables. about 40 syllables seems to be a good spot I've found.

The steps that I take are:

  1. split the text into 'phrases' that end in pause characters. For example "Hello! I'm tired, I need to sleep - uh - soon." will turn into ["Hello! ", "I'm tired, ", "I need to sleep - ", "uh - ", "soon." ]
  2. Estimate the number of syllables in each part. [2, 4, 6, 1, 2]
  3. Recombine the parts into 'sentences' that do not exceed some maximum number of syllables. ["Hello! I'm tired,", "I need to sleep - ", "uh - soon."]
  4. Pass those parts to bark.

The 'magic trick' in my poking about seems to be segmenting sentences in such a way that a long pause character (ex. . , - ... etc.) are at the end of segments. This is useful since if they get cut off or are too long it's not too noticeable.

@kun-create
Copy link

instead this:

text_prompts_list = nltk.sent_tokenize(long_string)

Can I use this? It will work the same?

textwrap.wrap( long_string, width=300, replace_whitespace=False, break_long_words=False, break_on_hyphens=False)

@jungrea jungrea mentioned this pull request Apr 26, 2023
@7gxycn08
Copy link
Author

7gxycn08 commented Apr 26, 2023

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

@felipelalli
Copy link

Let me ask you guys: nltk.sent_tokenize works only to English words or it works fine to Portuguese, for example?

@felipelalli
Copy link

@7gxycn08 Are you aware of this fork: https://github.com/JonathanFly/bark ? It would be nice to insert your impl into that fork.

@C0untFloyd
Copy link

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this:
https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 :
https://github.com/C0untFloyd/bark-gui
https://github.com/serp-ai/bark-with-voice-clone

@felipelalli
Copy link

We should incorporate the best stuff into the main repo or we will end with 15 good separated things. @gkucsko

standards_2x

@diStyApps
Copy link

diStyApps commented Apr 29, 2023

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone

Works great, runs on GTX 970 very nicely takes about 2min to generate 30sec on default settings.
I will add it to my installer.
https://github.com/diStyApps/seait
bark

@felipelalli
Copy link

felipelalli commented Apr 29, 2023

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

@C0untFloyd
Copy link

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

It is Bark - with a Web GUI & some additions wrapped around it 😃

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

@appatalks
Copy link

It is Bark - with a Web GUI & some additions wrapped around it smiley

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

Very nice work putting this together!
I just really love this open source community & wanted to say Thank you for sharing :)

@felipelalli
Copy link

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

But I need to run Bark in batch mode (using a script). It would be great if I could set everything up in the GUI and then have the GUI provide me with the corresponding command line. I hope that makes sense now?

@JonathanFly
Copy link
Contributor

This is on the list of things I want to add in fork. It's a bit a mess so probably not super soon, but I find myself wanting to do this too quite often.

@Ishino
Copy link

Ishino commented May 14, 2023

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

I believe that this breaks the non-speech you add to a text [laughs]

@qyqcswill
Copy link

qyqcswill commented Aug 16, 2023

Need to add following code snippet:

nltk.download('punkt')

@Alnounou-UHCL
Copy link

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

@JonathanFly
Copy link
Contributor

JonathanFly commented Sep 21, 2023

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

That's may not specific to that code, but Bark in general.

Usually a generic tag like [laugh] works on a decent set of voices, but things like [happy] are much less likely and can wreck the output of the entire clip. If you need to make use of text tags that specific you're best bet is to generate random voices using the tags and sift through the outputs to find a few rare ones that do.

@platform-kit
Copy link

platform-kit commented Dec 5, 2023

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

@7gxycn08 I'm implementing Bark with Vocos
(see: https://github.com/gemelo-ai/vocos/blob/main/notebooks%2FBark%2BVocos.ipynb)

which means that I'm using the semantic_to_audio_tokens function instead of generate_audio. Can you recommend a way to implement your coherency-validating pattern, using the coarse_prompt or fine_prompt instead of audio_array?

Here's the code for that function.

from typing import Optional, Union, Dict

import numpy as np
from bark.generation import generate_coarse, generate_fine


def semantic_to_audio_tokens(
    semantic_tokens: np.ndarray,
    history_prompt: Optional[Union[Dict, str]] = None,
    temp: float = 0.7,
    silent: bool = False,
    output_full: bool = False,
):
    coarse_tokens = generate_coarse(
        semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
    )
    fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)

    if output_full:
        full_generation = {
            "semantic_prompt": semantic_tokens,
            "coarse_prompt": coarse_tokens,
            "fine_prompt": fine_tokens,
        }
        return full_generation
    return fine_tokens

@7gxycn08 7gxycn08 closed this by deleting the head repository Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.