<a href="https://colab.research.google.com/github/xxristoskk/stable-audio-sample-generator/blob/main/Stable_Audio_Sample_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stable Audio Sample Generator

This notebook use's StabilityAI's stable-audio-tools to generate audio samples. You will need to have a HuggingFace account, because a token is required to use the model.

You will be asked to restart the runtime after the following cell is finished running, and you won't need to run it again.

Instrument names are the names of the root folders you want, and the project name is what you want to name the generated audio.

For example, I want to create samples of various guitar playing styles. I can name the instrument Guitar, then name a project metal, punk, jazz, ect. A folder named "Guitar" will be created and a subfolder of the project name.

Create as many instruments and projects as you'd like, and you can zip them together for a download in the last cell. Alternatively, you can download individual files from the file view on the left.

In [1]:
#@markdown #Install Pre-reqs
# Install required libraries
!pip install torch torchaudio einops huggingface_hub stable-audio-tools

Collecting einops
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m841.5 kB/s[0m eta [36m0:00:00[0m
Collecting stable-audio-tools
  Downloading stable_audio_tools-0.0.16-py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.1/121.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidi

In [1]:
import os
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
from huggingface_hub import login
from google.colab import files

# Function to read text file
# def read_txt(path: str):
#     with open(path, 'r') as file:
#         prompts = [line.strip() for line in file if line.strip()]
#     return prompts

#@markdown #Setup
# Authenticate with Hugging Face
huggingface_token = '' #@param {type: 'string'}
login()
# login(huggingface_token)



# Check CUDA availability
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available. Using CPU.")

# Additional diagnostics
print("PyTorch version:", torch.__version__)
print("CUDA version:", torch.version.cuda)
print("CuDNN version:", torch.backends.cudnn.version())
print("CUDA path:", os.getenv("CUDA_PATH"))
print("CUDA visible devices:", os.getenv("CUDA_VISIBLE_DEVICES"))

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Get the model and define sample rate and size
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)

# Function to run the model
def run_model(prompt: str, seconds: int, steps: int, output_name: str):
    conditioning = [{
        "prompt": prompt,
        "seconds_start": 0,
        "seconds_total": seconds
    }]
    output = generate_diffusion_cond(
        model,
        steps=steps,
        cfg_scale=cfg_scale,
        conditioning=conditioning,
        sample_size=sample_size,
        sigma_min=0.3,
        sigma_max=500,
        sampler_type="dpmpp-3m-sde",
        device=device
    )
    output = rearrange(output, "b d n -> d (b n)")
    output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
    torchaudio.save(output_name, output, sample_rate)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

CUDA available: True
Using GPU: Tesla T4
PyTorch version: 2.3.0+cu121
CUDA version: 12.1
CuDNN version: 8902
CUDA path: None
CUDA visible devices: None
No module named 'flash_attn'
flash_attn not installed, disabling Flash Attention




# Settings

In [2]:
# User input for parameters
steps = 100 #@param {type: 'integer'}
seconds = 47 #@param {type: 'integer'}
cfg_scale = 7 #@param {type: 'integer'}
instrument = 'vox' #@param {type: 'string'}
project_name = 'chants' #@param {type: 'string'}
prompt = 'a bear trying to sing in gegorian chants' #@param {type: 'string'}
multiprompt = False #@param {type: 'boolean'}
batch = 0 #@param {type: 'integer'}


In [None]:
#Place multiple prompts here
prompts = [
    'detroit techno loop',
    'jazz drums at 160bpm',
    'hardcore breakdown drum beat played with chopsticks',
    'the sound of a frustrated bear trying to play progressive rock drums',
]

In [3]:
#@markdown #Run
# Create new directory for outputs
# new_dir = os.path.join(f'{os.getcwd()}/{instrument}', project_name)
new_dir = os.path.join(f'{instrument}', project_name)

os.makedirs(new_dir, exist_ok=True)

# Generate audio samples
if not multiprompt and not batch:
    run_model(prompt, seconds, steps, f'{new_dir}/{project_name}.wav')
elif batch and not multiprompt:
    for x in range(batch):
        run_model(prompt, seconds, steps, f'{new_dir}/{project_name}[{x}].wav')
else:
    for x in range(len(prompts)):
        print(f'Prompt: {prompts[x]}')
        for i in range(batch):
            run_model(prompts[x], seconds, steps, f"{new_dir}/{project_name}[{x}]_{i}.wav")

print('DONE!')


2949926074


  0%|          | 0/100 [00:00<?, ?it/s]

  return F.conv_transpose1d(
  return F.conv1d(input, weight, bias, self.stride,


DONE!


# Download

In [None]:
# Download the generated files
!zip -r output.zip {new_dir}
files.download("output.zip")

updating: vox/bear/ (stored 0%)
updating: vox/bear/bear[0].wav (deflated 41%)
updating: vox/bear/bear[1].wav (deflated 45%)
updating: vox/bear/bear[2].wav (deflated 46%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>