Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stable Audio, please! #319

Closed
mykeehu opened this issue Jun 10, 2024 · 45 comments
Closed

Add Stable Audio, please! #319

mykeehu opened this issue Jun 10, 2024 · 45 comments

Comments

@mykeehu
Copy link

mykeehu commented Jun 10, 2024

Please add Stable Audio to the options, if you please! Thank you very much in advance!

https://github.com/Stability-AI/stable-audio-tools

And model here:
https://huggingface.co/stabilityai/stable-audio-open-1.0

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 10, 2024

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

@mykeehu
Copy link
Author

mykeehu commented Jun 10, 2024

I don't have a problem downloading the model this way, maybe you could ask for the login to download it? So those who have it can use it, those who don't can't. I don't know why it's tied to a license, but I've seen a video of it making quite good sound effects, so after the login the model would be downloaded.

@chlowden
Copy link

I'd be interested in trying this out too, please.

@ke1ne
Copy link

ke1ne commented Jun 20, 2024

Hi, thanks for requesting this! I have been procrastinating with it actually. One question - such a model would require a huggingface account and a login to be used, since this https://huggingface.co/stabilityai/stable-audio-open-1.0 cannot be automatically downloaded. Would you be ok with that?

Please respond as this is a matter that could really determine whether or not people use it.

For instance, I'm ok with it. Thanks!

@dairydaddy
Copy link

dairydaddy commented Jun 21, 2024 via email

@chlowden
Copy link

I've already downloaded the checkpoint. I presume that those who are enjoying your interface are the sort of people who already have a huggingface account.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 25, 2024

Stable audio has been added but is causing some problems so it might be added-removed a few times until it's 'stable'.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 25, 2024

Also, I just want to clarify - with extensive research - stable audio is not a 'stable diffusion 1.5' moment because it has a restrictive, potentially dangerous license (which might be legally unenforceable or impossible to defend in court; it's the very same infamous SD3 license) and I saw comments about Facebook's (notably similarly non-commercially licensed) AudioGen/MusicGen performing similarly.

My biggest issue so far is that running the 'official' inference code results in ~14gb RAM usage, where due to memory management my 24 gb RAM & 24 gb VRAM system would often just fail.

That being said, I really appreciate receiving information about what people want to try and see.

@chlowden
Copy link

I concur on the VRAM issue. I often saturate my RTX 3090 with 24GB of RAM using MusicGen. I have not been able to test MultiBandDiffusion due to VRAM saturation. I have seen that python will not release the VRAM it takes up so it blocks the GPU. I have to restart the machine to liberate the VRAM.
If Stable Audio is even worse than MusicGen, it does make it probematic to test for me.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 26, 2024

Restarting the webui should be enough. Additionally, after I fix the bugs arising from adding this new model, I can spend more time on 'unload model' buttons throughout the UI; however, there will always be some leftovers that aren't unloaded.
As for Stable Audio - generating a 47 second or a 1 second clip seems to use the same amount of VRAM unless they somehow can fix it all will do it themselves. Honestly there's multiple improvements on the model itself that are waiting to be done by somebody, perhaps they are hoping the community will do it.

@chlowden
Copy link

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages
https://github.com/DigitalPhonetics/IMS-Toucan

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 26, 2024

And as we are talking of other models ... maybe people are interested in ... Toucan TTS with 7000 languages https://github.com/DigitalPhonetics/IMS-Toucan

For this project it seems decent but could be hard to handle if it means everyone has to install espeak.

@rsxdalv
Copy link
Owner

rsxdalv commented Jun 30, 2024

Ok nevermind stable audio is amazing sometimes. If you have the GPU for it, it generates quickly (anything below the 'default size', which I think is 47 seconds is not going to generate faster, but if you want a full sample it's so quick) and it often generates without needing a lot more steering that you would expect with musicgen. That being said, the license is still the way it is.

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 1, 2024

I will close this issue as Stable Audio has been added. In the future it will be added to the React UI too. I optimized the memory a bit so while it does spike, it's a very brief amount of time so you can use the remaining VRAM freely, I tested this by running Stable Diffusion alongside Stable Audio. (Edit: so by using 'half' the consistent memory consumption is only 6gb, but there still is a few second spike of 14gb, which could perhaps be modified to allow running on smaller GPUs).

Finally, I invested some GPU resources to generate Stable Audio samples and test different prompts at https://promptecho.com/stableaudio . The parameters are quite useful:

  • different sampler can generate different audio; however the speed is basically the same
  • CFG scale can make the audio be 'fluid' when low, but at 0.5 it becomes nonsense. When it's too high it becomes repetitive and unnatural.
  • sampling steps can save time for those who want to tweak them. Some genres can generate with as few as 50 steps, while Electro music seems to do better with 100-200 steps. Going up to 500 steps seems to change almost nothing in many cases.
  • Seconds Total does almost nothing, it does not make it faster nor seem to change it much. I think it might be useful if you want to have everything your prompt contains within a short duration, i.e., 'wind chimes' within 3 seconds rather than spread over 47 seconds.

@rsxdalv rsxdalv closed this as completed Jul 1, 2024
@mykeehu
Copy link
Author

mykeehu commented Jul 1, 2024

Thank you for fantastic work and this addon! I will use it!

@chlowden
Copy link

chlowden commented Jul 1, 2024

Well done. Thank you so much. I have downloaded the latest version. I am getting an error in the Stable_Audio tab
`
Failed to load Stable Audio demo. Please check your configuration.

Error: expected an indented block after 'if' statement on line 548 (stable_audio.py, line 550)
`
I've run update.py and pip install -r requirements_stable_audio.txt

Any ideas how I can resolve this

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 1, 2024

Thanks for reporting, fixed it, just update normally or do a git pull for a very quick update.

@chlowden
Copy link

chlowden commented Jul 1, 2024 via email

@chlowden
Copy link

chlowden commented Jul 1, 2024 via email

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 1, 2024

Ok I think I figured it out:
There needs to be another folder, since you can have 100s of different Stable Audio models, so in your example moving the files to a new folder like this should work:

data/models/stable-audio/my-first-stable-audio-model/model.ckpt

@chlowden
Copy link

chlowden commented Jul 1, 2024

Got it working thanks to you.
I created a specific subfolder and put the model.ckpt in it ... but that did not work. The subfolder should also have the model_config.json file from the same huggingface page as the ckpt and then it all worked great. The webui shows the sub folder name, not the ckpt.
image

Another issue is that the output file is always overwritten in the tts-generation-webui folder. It would be great if it could show up in the outputs folder like the TTS and Musicgen do.

A passing note, the init audio button is not working for me.

Many thanks

@chlowden
Copy link

chlowden commented Jul 1, 2024

Concerning GPU memory, it has a very low memory footprint in comparison so Musicgen. My RTX 3090 has no problem with SD audio for the moment.

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 1, 2024

So about the outputs - I want to avoid spending a huge amount of time on integrating with the old favorites system and move on to a new system.
When you want them saved, what is the main wish - full integration with the favorites and history and collections, or do you just want to have a folder with all of the files and reasonable filenames/metadata?

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 6, 2024

Now files are being saved to outputs-rvc/stableaudio/...

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 8, 2024

Commercial use is now OK for most people, this makes Stable Audio one if not the best open source model we have! (Many other famous models are not open source, non-commercial etc) https://stability.ai/news/license-update

@chlowden
Copy link

chlowden commented Jul 9, 2024

Thank you for sharing this update. This is excellent news from SD. I was starting to worry that the SD project would fold to the GAFA pressure ... which is still a possibility ...

@chlowden
Copy link

Now files are being saved to outputs-rvc/stableaudio/...

Hello
I have been doing tests. I now get individually named folders in outputs-rvc but they are empty. But the audio file still appears at the folder root and is overwritten each time.
I replaced the stableaudio file in src with the new one but maybe there is something else to swap too?
Many thanks

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 10, 2024 via email

@chlowden
Copy link

Here is the error
FileNotFoundError: [Errno 2] No such file or directory: 'outputs-rvc/Stable Audio/2024-07-10_16-30-30_((piano_solo))__acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm_/2024-07-10_16-30-30_((piano_solo))__acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm_.wav'
The system creates a folder and sub-folder that use the prompt and it seems that for the export, the system does not find the path with folder names that are not same ...
image

@chlowden
Copy link

No problem with TTS or bark

@chlowden
Copy link

I'm using Rocky Linux 8.10

@chlowden
Copy link

chlowden commented Jul 10, 2024

Incidentally, with the latest stable diffusion file, all my outputs are now 48secs long, even if the total seconds are above or below 48secs.

@chlowden
Copy link

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total,
Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 10, 2024

@chlowden

Here is the error
FileNotFoundError: [Errno 2] No such file or directory: 'outputs-rvc/Stable Audio/2024-07-10_16-30-30_((piano_solo))acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm/2024-07-10_16-30-30((piano_solo))_acoustic__key_E_minor__minimalist_high_energy_4/4__120bpm__320kbps__48.0kHz_Stereo__Studio__sorrow__minimalism_genre__Classical__Avant-Garde__dynamic_rhythm.wav'
The system creates a folder and sub-folder that use the prompt and it seems that for the export, the system does not find the path with folder names that are not same ...

So I think it could be the parenthesis and file system names, I will add a fix for removing the parenthesis, but could you try just a simple 'water' and see if that generation gets saved?

No problem with TTS or bark

Ok, that helps a lot to know.

Incidentally, with the latest stable diffusion file, all my outputs are now 48secs long, even if the total seconds are above or below 48secs.

Yes, I always saw that behaviour, have you ever seen it generate a different length? To me, if I put say 10s the audio will be silent but still output 48 seconds. I tried online demos and saw the same; so I was waiting for stable diffusion to fix this.

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total, Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

I will check this part. value here means the default value, but it could be related.

@chlowden
Copy link

Or maybe not as the sample_rate = 32000

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 10, 2024

I checked the source of Stable Audio and it does seem like sample_size as defined within their code could determine the output length, but the gradio API they have made does not allow changing the length. They have a more internal API but it still seems like their model generates the audio equivalent of 512 by 512.

@chlowden
Copy link

I've done so many tests that I am probably getting confused as to what I can do with what service.

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 10, 2024

Got it, I see we are doing a lot of back and forth so it might be useful to go on the new discord server. I will be busy for a while but hopefully can do more from there.

https://discord.com/invite/3JbBrKrH

@chlowden
Copy link

image
river works as you expect ..

@mykeehu
Copy link
Author

mykeehu commented Jul 10, 2024

seconds_total_slider = gr.Slider( minimum=0, maximum=512, step=1, value=sample_size // sample_rate, label="Seconds total", visible=has_seconds_total, Maybe the value=sample_size // sample_rate, explains why everything is 48secs long

The 48 seconds is interesting because the official limit is 47 seconds. I'm sorry that you can't set an custom length, but the website says it's fixed:
https://stability.ai/news/introducing-stable-audio-open
"Stable Audio Open is an open source text-to-audio model for generating up to 47 seconds of samples and sound effects."

The filenames and folder names are really long, with a prompt you can easily reach the Windows 255 character path limit. I'd rather say the date-seed format could be more manageable, since the file you're describing is next to it anyway.

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 11, 2024

The filenames and folder names are really long, with a prompt you can easily reach the Windows 255 character path limit. I'd rather say the date-seed format could be more manageable, since the file you're describing is next to it anyway.

Fixed the filenames:

def get_name(prompt):
... return (
... prompt.replace(" ", "")
... .replace(":", "
")
... .replace("'", "")
... .replace('"', "
")
... .replace("\", "")
... .replace(",", "
")
... .replace("(", "")
... .replace(")", "
")
... .replace("?", "")
... .replace("!", "
")
... .replace("&", "")
... # only first 15 characters
... .replace("__", "
")[:15]
... )
...

test get_name

get_name("bamboo flute, zed, reiki, meditation music")
'bamboo_flute_ze'
get_name("funk, disco, R&B, AOR, soft rock, and boogie")
'funk_disco_R_B_'
get_name("Electro House, 320kbps")
'Electro_House_3'
get_name("minimalism, piano, acoustic key E minor, 120bpm and 108bpm piano, Classical, Avant-Garde, dynamic rhythm")
'minimalism_pian'

@mykeehu
Copy link
Author

mykeehu commented Jul 11, 2024

Fixed the filenames:

Great! Now you can better manage your folders and files, thank you!

@mykeehu
Copy link
Author

mykeehu commented Jul 11, 2024

Even the file names need a little fix. Stable Audio Generator produced such a prompt, and it is not saved because of the characters it contains:

Genre: Pop
Mood: Romantic, Atmospheric
Style: 90s
Instruments: Lead, Lead-off
Beats per Minute (BPM): 100
Additional Details: Create a dreamy and nostalgic atmosphere with lush synth pads, gentle piano melodies, a prominent lead melody, and a textural lead-off supporting the track. The music should build up to a cathartic moment filled with emotion and passion, capturing the essence of romanticism in the style of 90s pop.

The problem is with the \n:

Traceback (most recent call last):
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "I:\tts-generation-webui\tts-generation-webui\src\stable_audio\stable_audio.py", line 282, in save_result
    os.makedirs(base_dir, exist_ok=True)
  File "I:\tts-generation-webui\installer_files\env\lib\os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] File name, directory name or volume label syntax is incorrect: 'outputs-rvc\\Stable Audio\\2024-07-11_21-34-59_Format_Band_\nGe'

@rsxdalv
Copy link
Owner

rsxdalv commented Jul 13, 2024

Even the file names need a little fix. Stable Audio Generator produced such a prompt, and it is not saved because of the characters it contains:

Genre: Pop
Mood: Romantic, Atmospheric
Style: 90s
Instruments: Lead, Lead-off
Beats per Minute (BPM): 100
Additional Details: Create a dreamy and nostalgic atmosphere with lush synth pads, gentle piano melodies, a prominent lead melody, and a textural lead-off supporting the track. The music should build up to a cathartic moment filled with emotion and passion, capturing the essence of romanticism in the style of 90s pop.

The problem is with the \n:

Traceback (most recent call last):
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "I:\tts-generation-webui\installer_files\env\lib\site-packages\gradio\utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "I:\tts-generation-webui\tts-generation-webui\src\stable_audio\stable_audio.py", line 282, in save_result
    os.makedirs(base_dir, exist_ok=True)
  File "I:\tts-generation-webui\installer_files\env\lib\os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [WinError 123] File name, directory name or volume label syntax is incorrect: 'outputs-rvc\\Stable Audio\\2024-07-11_21-34-59_Format_Band_\nGe'

Should be fixed in the latest update #342

@mykeehu
Copy link
Author

mykeehu commented Jul 14, 2024

Great work! I tested it, the save works perfectly! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants