13 second bypass limit demo #84

7gxycn08 · 2023-04-24T06:42:06Z

Added Demo to README on how to use large prompts and generate > 13 second audio files.

added 13 seconds limit bypass example

added 13 second bypass demo to README

Armand97123

:)

7gxycn08 · 2023-04-24T09:18:48Z

You can also split the input string by the full stop (.) Instead of splitting by amount of words. This way makes the generated audio file more clearer and should remove the audio cutouts In the final output.

You can achieve that by using setting words = long_string.split(".")
and removing the for loop underneath.

for i in range(0, len(words), 10):
    text_prompt = " ".join(words[i:i+10])
    text_prompts.append(text_prompt)

and replace with this for loop instead

for sentence in words:
    # Append the sentence to the text_prompts list if it's not empty
    if sentence:
        text_prompts.append(sentence + ".")

wsippel · 2023-04-24T09:50:25Z

If you don't mind the extra dependency, I'd recommend using NLTK to split the input. Using just a period will cause issues with abbreviations and such.

7gxycn08 · 2023-04-24T10:05:39Z

NLTK

Just tried it out It does split the string beautifully with words = nltk.sent_tokenize(long_string)

added nltk to split long string.

gkucsko · 2023-04-24T12:48:04Z

if i remember correctly nltk makes mistakes around things like I spoke with Mr. Smith, and he....
Thanks for PR, super open to incorporate longer predicionts. atm would love to first learn how to best do this. i believe there are two main things that make it non-trivial beyond the sentence splitting:

speaker/audio coherence. it might either be best to keep the same original history_prompt such as en_speaker_1, or it might be best to use the previously generated output at each step. or maybe there is a magic third option of some clever mixing/concatting of the 2
potentially doing multiple generations per step (maybe even with auto selection of the best one). sometimes a generation goes off the rails or the model invents a new speaker even with the same prompt. especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

would love thoughts toward those 2 goals

7gxycn08 · 2023-04-24T17:40:55Z

especially when you generate many minutes of audio it would be a shame if some of the clips in between are not great. so we should find a good way to keep track of generations

For this part of the problem you can use a for loop at the end of the script. This might not be ideal though.

#Write each individual audio to a file this will generate multiple wav files for every split prompt from the list.

for i, audio_array in enumerate(audio_arrays):
    write_wav(f"audio_{i}.wav", SAMPLE_RATE, audio_array)

fidofetch · 2023-04-25T02:21:54Z

I've been tinkering with this and have unlimited tokens now. Coherency is lost between sentences, I was able to fix this by feeding the full output back into generate_audio. Here is my code if it's helpful

def text_to_audio(text, text_temp, waveform_temp, history_prompt):

    if(history_prompt == "Unconditional"):
        history_prompt = None
        
    #segment the sentences
    text_prompts_list = nltk.sent_tokenize(text)
    
    # generate audio from text
    audio_arrays = np.array([])
    i = 0
    
    for prompt in text_prompts_list:
        full_generation, audio_array = generate_audio(prompt,
                                     history_prompt,
                                     text_temp,
                                     waveform_temp,
                                     output_full = True)
        audio_arrays = np.concatenate((audio_arrays, audio_array))
                        
        save_as_prompt(os.path.join(cwd, f"bark/assets/userprompts/{i}.npz"), full_generation)
        history_prompt = os.path.join(cwd, f"bark/assets/userprompts/{i}.npz")
    # return audio array as output
    return SAMPLE_RATE, audio_arrays

7gxycn08 · 2023-04-25T08:40:20Z

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings.
for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

wsippel · 2023-04-25T13:48:51Z

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:

sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

felipelalli · 2023-04-26T06:33:57Z

Why bark has this 13 seconds limit?

tarpeyd12 · 2023-04-26T08:17:08Z

I've experimented a bit. Since I'm using Bark as an output for LLMs, I often run into very short sentences. Not great for stability if we tokenize by sentence using NLTK. I implemented a (probably godawful) counter that joins sentences up to a maximum token length (300 can be too much, 250 seems relatively safe), and got pretty decent results:
sentences = nltk.sent_tokenize(string)
chunks = ['']
token_counter = 0
for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

Oh, neat! I have been doing the same thing essentially, except I have been using syllables. about 40 syllables seems to be a good spot I've found.

The steps that I take are:

split the text into 'phrases' that end in pause characters. For example "Hello! I'm tired, I need to sleep - uh - soon." will turn into ["Hello! ", "I'm tired, ", "I need to sleep - ", "uh - ", "soon." ]
Estimate the number of syllables in each part. [2, 4, 6, 1, 2]
Recombine the parts into 'sentences' that do not exceed some maximum number of syllables. ["Hello! I'm tired,", "I need to sleep - ", "uh - soon."]
Pass those parts to bark.

The 'magic trick' in my poking about seems to be segmenting sentences in such a way that a long pause character (ex. . , - ... etc.) are at the end of segments. This is useful since if they get cut off or are too long it's not too noticeable.

kun-create · 2023-04-26T09:16:51Z

instead this:

text_prompts_list = nltk.sent_tokenize(long_string)

Can I use this? It will work the same?

textwrap.wrap( long_string, width=300, replace_whitespace=False, break_long_words=False, break_on_hyphens=False)

7gxycn08 · 2023-04-26T13:34:01Z

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

felipelalli · 2023-04-27T01:47:54Z

Let me ask you guys: nltk.sent_tokenize works only to English words or it works fine to Portuguese, for example?

felipelalli · 2023-04-27T17:07:30Z

@7gxycn08 Are you aware of this fork: https://github.com/JonathanFly/bark ? It would be nice to insert your impl into that fork.

C0untFloyd · 2023-04-29T10:47:14Z

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this:
https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 :
https://github.com/C0untFloyd/bark-gui
https://github.com/serp-ai/bark-with-voice-clone

felipelalli · 2023-04-29T18:17:02Z

We should incorporate the best stuff into the main repo or we will end with 15 good separated things. @gkucsko

diStyApps · 2023-04-29T21:13:31Z

Why re-invent and add extra dependencies if it is something which is already used successfully somewhere else? For example here is a snippet from TortoiseTTS doing this: https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/utils/text.py

This snippet is already being used in 2 forks, one of them is my one so I might be biased 😉 : https://github.com/C0untFloyd/bark-gui https://github.com/serp-ai/bark-with-voice-clone

Works great, runs on GTX 970 very nicely takes about 2min to generate 30sec on default settings.
I will add it to my installer.
https://github.com/diStyApps/seait

felipelalli · 2023-04-29T22:30:56Z

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

C0untFloyd · 2023-04-30T09:46:24Z

What is this "Bark GUI" thing? Is it possible to generate the "final command line" after play on the GUI to use in batch mode later?

It is Bark - with a Web GUI & some additions wrapped around it 😃

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

appatalks · 2023-04-30T14:20:51Z

It is Bark - with a Web GUI & some additions wrapped around it smiley

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

Very nice work putting this together!
I just really love this open source community & wanted to say Thank you for sharing :)

felipelalli · 2023-05-01T02:05:30Z

You don't need to type commandline args, just use the GUI. Audio parts of long sentences will be joined together at the end and are saveable/playable from the GUI as well.

But I need to run Bark in batch mode (using a script). It would be great if I could set everything up in the GUI and then have the GUI provide me with the corresponding command line. I hope that makes sense now?

JonathanFly · 2023-05-09T13:06:54Z

This is on the list of things I want to add in fork. It's a bit a mess so probably not super soon, but I find myself wanting to do this too quite often.

Ishino · 2023-05-14T19:13:45Z

I stitched wsippel method altogether and surprisingly got decent results.

You can test It out as Is or use different methods to compare the two here is the complete demo code.

from bark import generate_audio,preload_models
from scipy.io.wavfile import write as write_wav
import numpy as np
import nltk

nltk.download('punkt')
preload_models()

long_string = """
Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai/). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
"""
sentences = nltk.sent_tokenize(long_string)

# Set up sample rate
SAMPLE_RATE = 22050
HISTORY_PROMPT = "en_speaker_6"

chunks = ['']
token_counter = 0

for sentence in sentences:
	current_tokens = len(nltk.Text(sentence))
	if token_counter + current_tokens <= 250:
		token_counter = token_counter + current_tokens
		chunks[-1] = chunks[-1] + " " + sentence
	else:
		chunks.append(sentence)
		token_counter = current_tokens

# Generate audio for each prompt
audio_arrays = []
for prompt in chunks:
    audio_array = generate_audio(prompt,history_prompt=HISTORY_PROMPT)
    audio_arrays.append(audio_array)

# Combine the audio files
combined_audio = np.concatenate(audio_arrays)

# Write the combined audio to a file
write_wav("combined_audio.wav", SAMPLE_RATE, combined_audio)

I believe that this breaks the non-speech you add to a text [laughs]

qyqcswill · 2023-08-16T13:28:17Z

Need to add following code snippet:

nltk.download('punkt')

Alnounou-UHCL · 2023-09-21T15:16:34Z

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

JonathanFly · 2023-09-21T16:52:12Z

Added Demo to README on how to use large prompts and generate > 13 second audio files.

I tested your version of the bypass and found that the prompts such as [happy] and [sad] cause an interpretation issue where the API no longer parses them correctly or flat out makes a big noise in place of the whole sentence.

That's may not specific to that code, but Bark in general.

Usually a generic tag like [laugh] works on a decent set of voices, but things like [happy] are much less likely and can wreck the output of the entire clip. If you need to make use of text tags that specific you're best bet is to generate random voices using the tags and sift through the outputs to find a few rare ones that do.

platform-kit · 2023-12-05T21:30:13Z

You can detect changes in speakers voice using this for loop then make adjustments to the model or audio settings. for now it just warns the user that there are voice variations hopefully this is useful.

for prompt in text_prompts_list:
    history_prompt = "en_speaker_1"
    audio_array = generate_audio(prompt, history_prompt=history_prompt)
    if prev_audio is not None:
        min_length = min(len(audio_array), len(prev_audio))
        audio_diff = np.abs(audio_array[:min_length] - prev_audio[:min_length])
        if np.max(audio_diff) > 0.1:
            diff_mean = np.mean(audio_diff)
            diff_std = np.std(audio_diff)
            print(f"WARNING: Abrupt change in speaker's voice detected. Mean diff = {diff_mean:.4f}, Std diff = {diff_std:.4f}")
    audio_arrays.append(audio_array)
    prev_audio = audio_array

@7gxycn08 I'm implementing Bark with Vocos
(see: https://github.com/gemelo-ai/vocos/blob/main/notebooks%2FBark%2BVocos.ipynb)

which means that I'm using the semantic_to_audio_tokens function instead of generate_audio. Can you recommend a way to implement your coherency-validating pattern, using the coarse_prompt or fine_prompt instead of audio_array?

Here's the code for that function.

from typing import Optional, Union, Dict

import numpy as np
from bark.generation import generate_coarse, generate_fine


def semantic_to_audio_tokens(
    semantic_tokens: np.ndarray,
    history_prompt: Optional[Union[Dict, str]] = None,
    temp: float = 0.7,
    silent: bool = False,
    output_full: bool = False,
):
    coarse_tokens = generate_coarse(
        semantic_tokens, history_prompt=history_prompt, temp=temp, silent=silent, use_kv_caching=True
    )
    fine_tokens = generate_fine(coarse_tokens, history_prompt=history_prompt, temp=0.5)

    if output_full:
        full_generation = {
            "semantic_prompt": semantic_tokens,
            "coarse_prompt": coarse_tokens,
            "fine_prompt": fine_tokens,
        }
        return full_generation
    return fine_tokens

7gxycn08 added 3 commits April 24, 2023 10:32

Update README.md

d0ea020

added 13 seconds limit bypass example

Update README.md

f4fef8b

added 13 second bypass demo

b599366

added 13 second bypass demo to README

Armand97123 approved these changes Apr 24, 2023

View reviewed changes

Refactored Code

d5c0e86

added nltk to split long string.

gkucsko mentioned this pull request Apr 24, 2023

Limited to 13 seconds? #79

Closed

3dluvr mentioned this pull request Apr 26, 2023

long text_prompt cannot transform? #148

Closed

jungrea mentioned this pull request Apr 26, 2023

The long sentence #156

Closed

mcamac mentioned this pull request Apr 30, 2023

How to select certain gpu? #220

Closed

gkucsko mentioned this pull request May 18, 2023

The length of audio #308

Closed

JerrickB mentioned this pull request Jul 22, 2023

[Bug] Bark: Hallucinations JordieB/lippy#17

Open

7gxycn08 closed this by deleting the head repository Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13 second bypass limit demo #84

13 second bypass limit demo #84

7gxycn08 commented Apr 24, 2023

Armand97123 left a comment

7gxycn08 commented Apr 24, 2023 •

edited

Loading

wsippel commented Apr 24, 2023

7gxycn08 commented Apr 24, 2023 •

edited

Loading

gkucsko commented Apr 24, 2023

7gxycn08 commented Apr 24, 2023 •

edited

Loading

fidofetch commented Apr 25, 2023

7gxycn08 commented Apr 25, 2023

wsippel commented Apr 25, 2023

felipelalli commented Apr 26, 2023 •

edited

Loading

tarpeyd12 commented Apr 26, 2023

kun-create commented Apr 26, 2023

7gxycn08 commented Apr 26, 2023 •

edited

Loading

felipelalli commented Apr 27, 2023

felipelalli commented Apr 27, 2023

C0untFloyd commented Apr 29, 2023

felipelalli commented Apr 29, 2023

diStyApps commented Apr 29, 2023 •

edited

Loading

felipelalli commented Apr 29, 2023 •

edited

Loading

C0untFloyd commented Apr 30, 2023

appatalks commented Apr 30, 2023

felipelalli commented May 1, 2023

JonathanFly commented May 9, 2023

Ishino commented May 14, 2023

qyqcswill commented Aug 16, 2023 •

edited

Loading

Alnounou-UHCL commented Sep 21, 2023

JonathanFly commented Sep 21, 2023 •

edited

Loading

platform-kit commented Dec 5, 2023 •

edited

Loading

13 second bypass limit demo #84

13 second bypass limit demo #84

Conversation

7gxycn08 commented Apr 24, 2023

Armand97123 left a comment

Choose a reason for hiding this comment

7gxycn08 commented Apr 24, 2023 • edited Loading

wsippel commented Apr 24, 2023

7gxycn08 commented Apr 24, 2023 • edited Loading

gkucsko commented Apr 24, 2023

7gxycn08 commented Apr 24, 2023 • edited Loading

fidofetch commented Apr 25, 2023

7gxycn08 commented Apr 25, 2023

wsippel commented Apr 25, 2023

felipelalli commented Apr 26, 2023 • edited Loading

tarpeyd12 commented Apr 26, 2023

kun-create commented Apr 26, 2023

7gxycn08 commented Apr 26, 2023 • edited Loading

felipelalli commented Apr 27, 2023

felipelalli commented Apr 27, 2023

C0untFloyd commented Apr 29, 2023

felipelalli commented Apr 29, 2023

diStyApps commented Apr 29, 2023 • edited Loading

felipelalli commented Apr 29, 2023 • edited Loading

C0untFloyd commented Apr 30, 2023

appatalks commented Apr 30, 2023

felipelalli commented May 1, 2023

JonathanFly commented May 9, 2023

Ishino commented May 14, 2023

qyqcswill commented Aug 16, 2023 • edited Loading

Alnounou-UHCL commented Sep 21, 2023

JonathanFly commented Sep 21, 2023 • edited Loading

platform-kit commented Dec 5, 2023 • edited Loading

7gxycn08 commented Apr 24, 2023 •

edited

Loading

7gxycn08 commented Apr 24, 2023 •

edited

Loading

7gxycn08 commented Apr 24, 2023 •

edited

Loading

felipelalli commented Apr 26, 2023 •

edited

Loading

7gxycn08 commented Apr 26, 2023 •

edited

Loading

diStyApps commented Apr 29, 2023 •

edited

Loading

felipelalli commented Apr 29, 2023 •

edited

Loading

qyqcswill commented Aug 16, 2023 •

edited

Loading

JonathanFly commented Sep 21, 2023 •

edited

Loading

platform-kit commented Dec 5, 2023 •

edited

Loading