# Training Neural Networks to Emulate Audio Effects

I'm gearing up to start a new job at an AI lab, working on RL systems, and I want to do a bit of hands-on NN work to gear up.

Let's figure out how to train a model to emulate various audio effects, like distortion units or reverbs or whatever.

I haven't read any literature on how to do this, so I'll probably do some stupid stuff here, but I find it's more educational to go in blind sometimes. Basically everything I know about NNs is just from discussions with my friend who works on neural compression codecs.

I figure an architecture that might work pretty well is to have the previous `m` input samples, $[i_n, i_{n+1}, ..., i_{n+m-1}, i_{n+m}]$, and the previous `m-1` output samples, $[o_n, o_{n+1}, ..., o_{n+m-1}]$, and train the network to generate the next output sample, $o_{n+m}$

```
                     .-------.
[m prev inputs]----->|  The  |
[m-1 prev outputs]-->| Model |-->[next output]
                     '-------'
```

This architecture seems good to me because:

1. It's easy to generate examples of input data
2. It gives the model enough context to (theoretically) recover information like "what is the current phase of this oscillator"


As for how to actually represent the audio, some initial thoughts:

1. You could just pass in and extract straight up $[-1,1]$ float audio into/out of the model.
   1. The model will have to grow hardware to implement a threshold detector ADC or something, which seems kind of wasteful
   2. This fails to accurately capture entropy density in human audio perception. A signal with peak amplitude 0.01 is often just as clear to humans as a signal with peak amplitude 1.0. Which brings us to a possible improvement
2. Pre- and post-process the audio with some sort of compander, like a $\mu$-law
   1. This will probably help the model maintain perceptual accuracy across wide volume ranges
3. Encode a binary representation of the audio. This feels to me like it will probably be difficult for the model to reason about.
4. Use some sort of one-hot encoding for the amplitude, probably in combination with a companding algorithm. This might be too expensive on the input side (since we have a lot of inputs), but could work well on the output side. We could have 256 different outputs for 8-bit audio, for example.
   1. We could also possibly allow the network to output a continuous "residual" value for adding additional accuracy on top of the one-hot value.
  
For now, let's just try the first thing and see how it goes. I expect it will work OK for well-normalized (loud) audio and poorly for quiet audio.

As for our loss function: probably some sort of perceptual similarity metric would be best, but let's start with a super simple metric like squared error in $\mu$-law space.

First up, I need a bunch of training data. I took a bunch of songs from my music library and put them in the `examples` folder. (I'm old enought that I have about 75GB of local music files.) 

When we want to train a new effect, we'll have a bunch of training runs of the form:

1. Pick a bunch of random song slices from the library
2. Feed the song slices through the reference implementation of the effect (could be software or hardware)
3. Train the model to predict the next sample of the effect across all I/O examples in the batch

In [4]:
import os
import numpy as np
import torchaudio
import random
from pathlib import Path
from typing import List, Tuple

def list_audio_files(examples_dir: str = "examples") -> List[str]:
    """
    List all audio files in the examples directory.
    
    Args:
        examples_dir: Path to the examples directory
        
    Returns:
        List of audio file paths
    """
    audio_extensions = {'.mp3', '.m4a', '.wav', '.flac', '.aac', '.ogg', '.mp4'}
    audio_files = []
    
    examples_path = Path(examples_dir)
    if not examples_path.exists():
        raise FileNotFoundError(f"Examples directory '{examples_dir}' not found")
    
    for file_path in examples_path.rglob('*'):
        if file_path.is_file() and file_path.suffix.lower() in audio_extensions:
            audio_files.append(str(file_path))
    
    return sorted(audio_files)

def read_audio_normalized(file_path: str) -> np.ndarray:
    """
    Read an audio file, convert to mono, and normalize so peak absolute value is 1.0.
    
    Args:
        file_path: Path to the audio file
        
    Returns:
        Normalized mono audio as float32 numpy array with shape (samples,)
    """
    # Load audio using torchaudio (handles many formats)
    try:
        waveform, sample_rate = torchaudio.load(file_path)
        
        # Convert to numpy float32
        audio = waveform.numpy().astype(np.float32)
        
        # Convert to mono by averaging channels if stereo/multi-channel
        if audio.shape[0] > 1:
            audio = np.mean(audio, axis=0)
        else:
            audio = audio.squeeze(0)
        
        # Normalize to peak amplitude of 1.0
        peak = np.max(np.abs(audio))
        if peak > 0:  # Avoid division by zero for silent audio
            audio = audio / peak
            
        return audio
        
    except Exception as e:
        raise RuntimeError(f"Failed to load audio file '{file_path}': {str(e)}")


    

In [5]:
def generate_audio_pairs(audio_effect, file_count: int = 100) -> List[Tuple[np.ndarray, np.ndarray]]:
    # TODO claude: Improve docs
    """
    Read in some files, apply the effect to them

    Args:
        audio_effect: Function which takes in the clean audio (f32 mono array) and outputs a
            processed array of the same size
        file_count: How many files to read
    """
    # TODO claude: implement
    pass

def generate_slices_from_pair(audio_pairs : List[Tuple[np.ndarray, np.ndarray]],
                             context_window : int,
                             slices_per_pair : int
                             ) -> (np.ndarray, np.ndarray):
    # TODO claude: Improve docs
    """
    Given reference input/output dry/wet audio pairs,
    create some tensors suitable for training our network.
    Will generate a bunch of randomly-selected slices from each audio pair.

    Args:
        audio_pairs: A list of (dry,wet) mono audio arrays. Entire processed songs
        context_window: The number of historical input samples the neural network gets
        slices_per_pair: The number of example slices we want for each input/output pair

    Returns:
        An f32 tensor of size (slices_per_pair * len(audio_pairs), context_window * 2 - 1)
            The first axis is our "batch size", so to speak
            The second axis is the `m` input samples concatenated with the `m-1` previous output samples
        An f32 tensor of size (slices_per_pair * len(audio_pairs), 1)
            This is the next output sample corresponding to each input block
    """
    # TODO claude: implement
    pass