High fidelity training? #32

dustyatx · 2023-03-31T15:14:22Z

Would love your thoughts on if this is a viable experiment to run or if I'm missing something important about how diffusion models work with audio/spectrograms. I'm really excited to give your project a test but I'm of a ML enthusiast, so my knowledge is limited to conceptual understandings. Any guidance you can give would be greatly appreciated..

I've noticed that your project as well as Riffusion generates low fidelity <= 10khz, about 1/2 of hifi. I would like to run an experiment to produce short high fidelity single hit sounds.

I've done some experimenting with spectrograms and I found that if i set the n_fft, sample rate, mels, etc I can create spectrograms that covert from audio to image to audio again and it maintains high fidelity sound of 20000 hz. In one experiment I was able to get that hifi result with an image of 512x256; not sure if that is useful but it seemed like a smaller image size would train faster.

I have around 110,000 drum kick sounds that I generated using samples & sound design techniques (FX, blending, pitch shift, stretch, etc). Ranging from analog to electronic to experimental kick drum sounds.

I then used sound analysis algorithms to get the features of the sound and then mapped those to a variety of natural language phrases. I varied them in length from 30-75 tokens, with a nice bell curve with a peak around the 50-60 token length mark.

I have a core i-9 24 core CPU & a 4090 with 24GB ram.. I don't mind letting it run for a week or so to get best results, though I would imagine I start off with sometime small like 128x128 and then scale it up to 512x512. Any advice on training would be super helpful, it's so hard to get good answers around training on datasets > 100 images.

Here is an example of my descriptions. They are all different, sometimes it will have the scientific measure and the value other times it will be a phrase that represents that measure. In this case it's an acoustic kick with a good amount of bass, high freq and distortion.

Class dustkick
Very low brightness
Crest 9
Key C1
Medium-high crest factor
Presence 0
Medium-high distortion
Tonal powerful
Brightness dark
Noisiness 3
Harmonicity 7
Types kicks
Distortion intriguing
Length epic
Headroom exciting
RMS 2
Moderate loudness
Duration ultra long
Tags hit acoustic drum
Ultra high harmonicity
Loudness bright

teticio · 2023-04-15T10:46:47Z

Sorry for the delay - I was on holiday.

The sample rate of the models I uploaded to HF is 22,050 hz. You can do 44,100 if you like, but you need twice the resolution in the x-axis to get the same sample length in seconds. On your machine you should be able to go to 512x512 easily.

I found that around 20,000 samples worked well, but it does depend on how homegeneous they are. Also some genres appear to wrk better than others. Regarding conditional training with a text prompt, it can be done with the codebase but the model is expecting a vector of numbers (en encoding) which can be a text embedding or whatever. You would just need to provide this as a dictionary as described in the README (i.e., it is not so convenient as a HF pipeline that takes the text as an input). I am not sure that you will get great results with these kind of descriptions. I would suggest things that a pretrained language model might better "understand" like "Fast-paced, exciting, orchestral with drums". Obviously you are not going to label 20,000 - the same description can be used for each "slice" of the 30s previews. Perhaps you can find a way to get a meaningful description from the Spotify API / scraping?

dustyatx · 2023-04-23T16:07:48Z

The descriptions are from audio analysis algorithms so my hope is that they will be more accurate description of the audio with stronger signal than something like genre or a feel. I think if I feed the descriptions I have in to a LLM with some good prompt engineering I should get me more natural (human like) descriptions. I have a few installed and I can use LangChain to batch them.

For this experiment I only only be using single shot kick drums (0.1-2 secs long, all padded to 2 secs), I have a pretty good set that ranges from natural real world kick drums (like a Pearl drum set) to more experimental industrial sounds that are completely synthetic. I have a metadata store that I can use to create a more evenly distributed set, so I don't have an over abundance of 808 or 909 etc. So monolithic in that it's only one class of sound but quiet a lot of variety.

For the training could you tell me what type of GPU, how long it took to train on 20k and what I can expect for memory usage. I have a 4090 24GB but I can rent a A100 80GB if necessary.

For vectorizing the text and passing in a dictionary, I can't seem to find anything about that in the readme. I'm not sure how I'm missing it. Can you point me to that? Which tokenizer should I use?

really appreciate the guidance..

teticio · 2023-04-23T21:37:20Z

I think I mention somewhere in the README that I used one 2080 Ti and it took about 40 hours to train on 20,000 samples. So you will be fine with a 4090, at least at the same resolution (256x256). If you go up to 512x512 (which can allow for higher quality, longer samples) you should still be able to train a VAE (I am doing exactly this right now).

Bear in mind that the LLMs are trained on different texts from the ones you are using, but you can only know by trying.

teticio closed this as completed Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High fidelity training? #32

High fidelity training? #32

dustyatx commented Mar 31, 2023

teticio commented Apr 15, 2023

dustyatx commented Apr 23, 2023

teticio commented Apr 23, 2023

High fidelity training? #32

High fidelity training? #32

Comments

dustyatx commented Mar 31, 2023

teticio commented Apr 15, 2023

dustyatx commented Apr 23, 2023

teticio commented Apr 23, 2023