# Audio generation with RAVE

RAVE is a Real-time Audio Variational autoEncoder (https://github.com/acids-ircam/RAVE) released by Caillon and Esling (ACIDS IRCAM) in November 2021. You can read the paper here: https://arxiv.org/abs/2111.05011. RAVE is a particularly light model that allows generating audio in real-time in the CPU and even in embedded systems with low computational power, such as Raspberry Pi (here is a video: https://youtu.be/jAIRf4nGgYI). Still, training this model is computationally expensive: in the original paper, they used 3M steps, which took six days on a TITAN V GPU. 

In this notebook we will see how to we can generate audio with pre-trained RAVE models, including unconditional generation (generating from latent z with the decoder), timbre transfer (passing exisiting audio through the encoder and decoder model to alter the timbre of the sound), and do some simple latent space manipulation.

If you want to train your own model you will need a GPU to do it effectively (hence we are not doing it as a class activity). To train a RAVE model on a cloud based GPU see [this colab notebook for training your own RAVE model on a custom audio dataset](https://colab.research.google.com/drive/1ih-gv1iHEZNuGhHPvCHrleLNXvooQMvI?usp=sharing). Otherwise [the offical RAVE github page](https://github.com/acids-ircam/RAVE) has instructions for training a RAVE model locally.

\* If you are interested in using RAVE for performing, the real-time implementation runs in MaxMSP and can be downloaded here: https://github.com/acids-ircam/nn_tilde

This notebook is adapted from a notebook originally created by [Teresa Pelinski](https://teresapelinski.com/), that was based on [this RAVE generation notebook](https://colab.research.google.com/github/hdparmar/AI-Music/blob/main/Latent_Soundings_workshop_RAVE.ipynb).

### Installs and imports

THere are a couple of python packages that you will need to download manually the first time you run this. Uncomment and run the next lines in the cell. 
- `wget` for downloading files over the internet
- `acids-ircam` for downloading and running rave models
  
As well as some additional software:
- `ffmpeg` for transcoding audio and video (and all sorts of other useful things! see https://ffmpeg.org/)

Once you have installed these you can re-comment them so they don't run again in future (You can use the shortcut: CMD+/ (mac) or CTRL+/ (windows) to do this for multiple lines).

In [None]:
# !pip install --quiet wget
# !pip install --quiet acids-rave # --quiet avoids long outputs. if you get any errors, remove --quiet
# !yes|conda install ffmpeg

### import packages

In [None]:
import os
import sys
import wget
import torch

import numpy as np
import librosa as li
import soundfile as sf
import IPython.display as ipd
import matplotlib.pyplot as plt

from math import floor
from scipy import signal

from src.latent_util import create_latent_interp, clamp

### Hyperparameters

In [None]:
latent_dim = 8 
# sample_rate = 48000 # sample rate of the audio
sample_rate = 44100 # sample rate of the audio

### Download pretrained models
Some info on the pretrained models is available here: https://acids-ircam.github.io/rave_models_download

In [None]:
pt_path = "../rave_models" # folder where pretrained models will be downloaded
if not os.path.exists(pt_path): # create the folder if it doesn't exist
    os.mkdir(pt_path)
    
def bar_progress(current, total, width=80): # progress bar for wget
    progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
    # Don't use print() as it will print in new line every time.
    sys.stdout.write("\r" + progress_message)
    sys.stdout.flush()

pretrained_models = ["vintage", "percussion", "VCTK"] # list of available pretrained_models to download in https://acids-ircam.github.io/rave_models_download (you can select less if you want to spend less time on this cell)

for model in pretrained_models: # download pretrained models and save them in pt_path
    if not os.path.exists(os.path.join(pt_path, f"{model}.ts")): # only download if not already downloaded
        print(f"Downloading {model}.ts...")
        wget.download(f"https://play.forum.ircam.fr/rave-vst-api/get_model/{model}",f"{pt_path}/{model}.ts", bar=bar_progress)
    else:
        print(f"{model}.ts already downloaded")


### Load Model

Let us load in one of the models that we have downloaded:

In [None]:
generated_path = "generated" # folder where generated audio will be saved
if not os.path.exists(generated_path): # create the folder if it doesn't exist
    os.mkdir(generated_path)
    
pretrained_model = "vintage" # select the pretrained model to use

model = torch.jit.load(f"{pt_path}/{pretrained_model}.ts" ).eval() # load model
torch.set_grad_enabled(False) # disable gradients

### Random generation

In the next code cell we will see how to sample randomly different points in the RAVE latent space and concatonate them into an audio clip:

In [None]:
generated_clips = []
for i in range(100):
    # Randomly sample latent space
    z = torch.randn(1,latent_dim,1)
    
    # Generate audio clip and append to list
    gen_audio_clip = model.decode(z)
    gen_audio_clip = gen_audio_clip.reshape(-1).cpu().numpy()
    generated_clips.append(gen_audio_clip)

# Concatonate list of audio clips into one array
generated_audio = np.concatenate(generated_clips)
ipd.display(ipd.Audio(data=generated_audio, rate=sample_rate)) # display audio widget

### Random interpolation

Lets now create an interpolation between two random points in latent space, and list to what that sounds like:

In [None]:
latent_interp = create_latent_interp(intervals=100, z_dim=latent_dim)
interpolation_clips = []

for latent in latent_interp:
    # Convert to tensor and reshape for RAVE input
    latent = torch.tensor(latent)
    # This changes the shape from (128) to (1,128,1)
    latent = latent.unsqueeze(0).unsqueeze(2)
    
    # Generate audio clip and append to list
    gen_audio_clip = model.decode(z)
    gen_audio_clip = gen_audio_clip.reshape(-1).cpu().numpy()
    interpolation_clips.append(gen_audio_clip)

# Concatonate list of audio clips into one array
generated_audio = np.concatenate(interpolation_clips)
print(generated_audio.shape)
ipd.display(ipd.Audio(data=generated_audio, rate=sample_rate)) # display audio widget

### Load an audio file and listen to it
We can load an audio file using librosa (`li`). `li.load` returns an array where every item corresponds to the amplitude at each time sample. You can convert from time in samples to time in seconds using `time = np.arange(0, len(input_data))/sample_rate`

In [None]:
input_file = "../media/sounds/368377__rmeluch__cello-phrase-5sec.wav" 
input_data = li.load(input_file, sr=sample_rate)[0] # load input audio

time = np.arange(0, len(input_data)) / sample_rate # to obtain the time in seconds, we need to divide the sample index by the sample rate
plt.plot(time,input_data)
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude")
plt.title(input_file.split("/")[-1])
plt.grid()

ipd.display(ipd.Audio(data=input_data, rate=sample_rate)) # display audio widget

### Perform timbre transfer
We can now load a pretrained model using `torch.jit.load` and encode the input audio into a latent representation.For the vintage model, we will be encoding our input audio into a latent space trained on 80h of "vintage music". We can then decode the latent representation an synthesise it. This will make the original sound as if it was "vintage music" (timbre transfer).

In [None]:
x = torch.from_numpy(input_data).reshape(1, 1, -1) # convert audio to tensor and add batch and channel dimensions
z = model.encode(x) # encode audio into latent representation

# synthesize audio from latent representation
y = model.decode(z).numpy() # decode latent representation and convert tensor to numpy array
y = y[:,0,:].reshape(-1) # remove batch and channel dimensions
y = y[abs(len(input_data)- len(y)):] # trim to match input length --> for some reason the output is a bit longer than the input

# save output audio
output_file =f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_generated.wav").split("/")[-1]}'
sf.write(output_file,y, sample_rate)

ipd.Audio(output_file) # display audio widget

We can compare the input and output sound wave and spectogram

In [None]:
f1, t1, Zxx1 = signal.stft(input_data, fs=sample_rate, nperseg=2048, noverlap=512)
f2, t2, Zxx2 = signal.stft(y, fs=sample_rate, nperseg=2048, noverlap=512)

fig, axs = plt.subplots(2, 2,figsize=(10,5), sharex=True)

axs[0,0].plot(time,input_data)
axs[0,0].set_ylabel("Amplitude")
axs[0,0].grid()
axs[0,0].set_title(input_file.split("/")[-1])
axs[1,0].plot(time,y)
axs[1,0].set_ylabel("Amplitude")
axs[1,0].set_xlabel("Time (seconds)")
axs[1,0].grid()
axs[1,0].set_title(output_file.split("/")[-1])

axs[0,1].pcolormesh(t1, f1[:100], np.abs(li.amplitude_to_db(Zxx1[:100,:],
                                                       ref=np.max)))
axs[1,1].pcolormesh(t2, f2[:100], np.abs(li.amplitude_to_db(Zxx2[:100,:],
                                                       ref=np.max)))
axs[1,1].set_xlabel("Time (seconds)")
axs[0,1].set_title("STFT")
axs[0,1].set_ylabel("Frequency (Hz)")
axs[1,1].set_ylabel("Frequency (Hz)")


### Alter latent representation
We can now modify the latent coordinates of the input file to alter the representation. We can start by adding a constant bias (a displacement) to the coordinates in the latent space

In [None]:
print(z.shape) # the second dimension corresponds to the latent dimension, in this case, there's 8 latent dimensions

d0 = 1.09  # change in latent dimension 0
d1 = -3 
d2 = 0.02
d3 = 0.5 
# we leave dimensions 4-8 unchanged

z_modified = torch.clone(z) # copy latent representation
# bias latent dimensions (displace each sample representation by a constant value)
z_modified[:, 0] += torch.linspace(d0,d0, z.shape[-1])
z_modified[:, 1] += torch.linspace(d1,d1, z.shape[-1])
z_modified[:, 2] += torch.linspace(d2,d2, z.shape[-1])
z_modified[:, 3] += torch.linspace(d3,d3, z.shape[-1])

y_latent_1 = model.decode(z_modified).numpy() # decode latent representation and convert tensor to numpy array
y_latent_1 = y_latent_1[:,0,:].reshape(-1) # remove batch and channel dimensions
y_latent_1 = y_latent_1[abs(len(input_data)- len(y_latent_1)):] # trim to match input length
output_file = f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_latent_generated_1.wav").split("/")[-1]}'
sf.write(output_file,y_latent_1, sample_rate) # save output audio

ipd.Audio(output_file) # display audio widget

### Sinusoid manipulation to latent

Instead of using a constant (a bias) to displace the representation of every sample in the latent space, we can use a function so that we "navigate" the latent space. For example, we can use a sinusoidal function that the representation oscillates around the original encoded one:

In [None]:
z_modified = torch.clone(z) # copy original latent representation

# bias latent dimensions with a sinusoidal function at 440 Hz
t = torch.linspace(0, z.shape[-1], z.shape[-1])
for idx in range(0, z.shape[1]): # for each latent dimension
    z_modified[:, idx] += torch.sin(440*2*np.pi*t)

y_latent_2 = model.decode(z_modified).numpy() # decode latent representation and convert tensor to numpy array
y_latent_2 = y_latent_2[:,0,:].reshape(-1) # remove batch and channel dimensions
y_latent_2 = y_latent_2[abs(len(input_data)- len(y_latent_2)):] # trim to match input length
output_file = f'{generated_path}/{input_file.replace(".wav", f"_{pretrained_model}_latent_generated_1.wav").split("/")[-1]}'
sf.write(output_file,y_latent_2, sample_rate) # save output audio

ipd.Audio(output_file) # display audio widget

### Tasks

**Task 1:** Run this code with the different pre-trained models to see the differences in the audio generated by the model.

**Task 2:** Load in your own audio track (your favourite song or a recording you have) and do the timbre transfer with it. 

Then move onto `interactive-audio-generation.py` to see how we can use these RAVE models for realtime interactive audio generation. 