Permalink
Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
195 lines (137 sloc) 8.13 KB

Using ni to visualize audio files

I wrote the binary interfacing for ni with the intention of doing signal processing, but until recently I haven't used it much. This weekend I decided to try some spectral visualizations for audio. Here's Bonobo's Black Sands album, log-frequency FFT at every quarter-second:

image

How to replicate this setup

  1. Install ni. This is just one command, and ni has no dependencies.
  2. Install Python+NumPy (apt install python-numpy on Ubuntu).
  3. Install ffmpeg (apt install ffmpeg on Ubuntu).

I also use youtube-dl to get compressed audio files.

Once you've installed that stuff, you should be able to run ni --js in a terminal, pop open a browser to localhost:8090, and enter commands into the top bar as shown in the screenshots. (You'll have to do some panning/zooming/etc to get the view to line up correctly.)

What's going on here

Let's start with the command line. I've got these basic steps:

  • /home/spencertipping/r/glacial/music/orig/bonobo-black-sands-2.ogg: the compressed audio file
  • sr1[...]: run ... on the r1 server using an SSH connection
    • e[ffmpeg -i - -f wav -]: decode ogg audio to wav
    • bp'r rp"ss"': run perl in a binary context: extract two packed short ints per record and emit them as text
    • r-100: drop rows from the header, give or take
    • pa+b: add left+right channels together to get mono
    • p'r $., pl(44100), rl(10000)': grab and emit windows of 44100 samples, and advance forward by 10000 samples per output row. Each output will be prefixed with the sample offset ($., which is the input line number in Perl).
    • S24[...]: horizontally scale ... by a factor of 24, since r1 has 24 processors
      • NB'x = abs(fft.fft(x * sin(array(range(x.shape[1]))*pi / x.shape[1])**2))': windowed FFT of each row of samples; sin(...)**2 is a Hann window
      • p'r a, $F[$_], log $_ for 1..FM>>1': flatten the row of FFT outputs to a series of rows, each of the form (time offset, amplitude, log(frequency)). Take only the first half, since the second half is a mirror image.
      • p'r a, (1-rand()*rand())*b/1000000, c for 1..b/1000000': a way to get the graphics to look better. We output multiple dots per FFT point, weighted by amplitude; this accentuates spikes.

Here's what it looks like to develop this pipeline:

Step 1: audio samples

If you just have the audio samples and plot them in 2D, you'll see correlation between left/right channels.

image

You can add the sample offset as the third dimension to see the waveform:

image

The intro also has some channel covariance caused by phase shifting:

image

Step 2: rows of sample windows

I've added some of the later commands to convert the data into something that can be visualized.

At this point we have windows of audio in the time-domain. Windows overlap about 80% with each other, and are just over one second long.

image

Step 3: FFT with rectangular window

Each of the above sample windows gets individually Fourier-transformed. Initially it doesn't look like much:

image

To make this easier to parse, let's log-transform the frequency (that's how we perceive pitch), chop off the FFT mirror image, and alias amplitude as color:

image

Step 4: Highlighting peaks

We can't see much here because so many values are zero or near-zero. Let's use rp'b>1000000' to remove small values:

image

Now we need a way to make the peaks stand out. A simple strategy is to make multiple copies of tall points, each jittered slightly downwards from the top. I'm also going to divide each amplitude by 1000000 so the view axes are scaled more evenly.

image

Step 5: Hann windowing

Windowing helps focus the frequencies we're measuring, and it also narrows the effective time range of each FFT row (which is good because they overlap). We can generate a Hann window in numpy like this:

# here, N is the window width in samples
hann = sin(array(range(N))*pi / N)**2

Here's what that window looks like:

image

And here's the log of the FFT:

image

For comparison, here's the log FFT of a rectangular window:

image

So when we multiply the signal by the Hann function, we'll eliminate a lot of the edge artifact noise we'd have otherwise.

image

Are different types of music visibly different?

Bach Cello Suite

image

Key change:

image

Large-scale structure:

image

Harmonics during a scale (since the frequencies are log-transformed, they appear to converge):

image

Beethoven's Ninth Symphony, mvmt 2

image

U2: Mysterious Ways

image

Penguin Cafe Orchestra: Perpetuum Mobile

image

Molly Johnson: Must have Left My Heart

image

This one is kind of interesting because the bass drum occupies the same frequency range as the electric bass line, but they appear distinct by looking at timing:

image

Adele: Cold Shoulder

image

Norah Jones: Sunrise

image

What does compression look like?

I'm going to compress Cold Shoulder because it has frequencies at both extremes.

Original

image image

MP3, 128kbps

No noticeable differences:

image image

MP3, 64kbps

No visible differences here either, though 64kbps is definitely audible:

image image

MP3, 32kbps

Definite differences here. Some high frequencies have disappeared, and the low frequencies are now quantized:

image image