Using ni to visualize audio files
I wrote the binary interfacing for ni with the intention of doing signal processing, but until recently I haven't used it much. This weekend I decided to try some spectral visualizations for audio. Here's Bonobo's Black Sands album, log-frequency FFT at every quarter-second:
How to replicate this setup
- Install ni. This is just one command, and ni has no dependencies.
- Install Python+NumPy (
apt install python-numpyon Ubuntu).
- Install ffmpeg (
apt install ffmpegon Ubuntu).
I also use youtube-dl to get compressed audio files.
Once you've installed that stuff, you should be able to run
ni --js in a
terminal, pop open a browser to
localhost:8090, and enter commands into the
top bar as shown in the screenshots. (You'll have to do some
panning/zooming/etc to get the view to line up correctly.)
What's going on here
Let's start with the command line. I've got these basic steps:
/home/spencertipping/r/glacial/music/orig/bonobo-black-sands-2.ogg: the compressed audio file
r1server using an SSH connection
e[ffmpeg -i - -f wav -]: decode ogg audio to wav
bp'r rp"ss"': run perl in a binary context: extract two packed short ints per record and emit them as text
r-100: drop rows from the header, give or take
pa+b: add left+right channels together to get mono
p'r $., pl(44100), rl(10000)': grab and emit windows of 44100 samples, and advance forward by 10000 samples per output row. Each output will be prefixed with the sample offset (
$., which is the input line number in Perl).
S24[...]: horizontally scale
...by a factor of 24, since
r1has 24 processors
NB'x = abs(fft.fft(x * sin(array(range(x.shape))*pi / x.shape)**2))': windowed FFT of each row of samples;
sin(...)**2is a Hann window
p'r a, $F[$_], log $_ for 1..FM>>1': flatten the row of FFT outputs to a series of rows, each of the form
(time offset, amplitude, log(frequency)). Take only the first half, since the second half is a mirror image.
p'r a, (1-rand()*rand())*b/1000000, c for 1..b/1000000': a way to get the graphics to look better. We output multiple dots per FFT point, weighted by amplitude; this accentuates spikes.
Here's what it looks like to develop this pipeline:
Step 1: audio samples
If you just have the audio samples and plot them in 2D, you'll see correlation between left/right channels.
You can add the sample offset as the third dimension to see the waveform:
The intro also has some channel covariance caused by phase shifting:
Step 2: rows of sample windows
I've added some of the later commands to convert the data into something that can be visualized.
At this point we have windows of audio in the time-domain. Windows overlap about 80% with each other, and are just over one second long.
Step 3: FFT with rectangular window
Each of the above sample windows gets individually Fourier-transformed. Initially it doesn't look like much:
To make this easier to parse, let's log-transform the frequency (that's how we perceive pitch), chop off the FFT mirror image, and alias amplitude as color:
Step 4: Highlighting peaks
We can't see much here because so many values are zero or near-zero. Let's use
rp'b>1000000' to remove small values:
Now we need a way to make the peaks stand out. A simple strategy is to make multiple copies of tall points, each jittered slightly downwards from the top. I'm also going to divide each amplitude by 1000000 so the view axes are scaled more evenly.
Step 5: Hann windowing
Windowing helps focus the frequencies we're measuring, and it also narrows the effective time range of each FFT row (which is good because they overlap). We can generate a Hann window in numpy like this:
# here, N is the window width in samples hann = sin(array(range(N))*pi / N)**2
Here's what that window looks like:
And here's the log of the FFT:
For comparison, here's the log FFT of a rectangular window:
So when we multiply the signal by the Hann function, we'll eliminate a lot of the edge artifact noise we'd have otherwise.
Are different types of music visibly different?
Bach Cello Suite
Harmonics during a scale (since the frequencies are log-transformed, they appear to converge):
Beethoven's Ninth Symphony, mvmt 2
U2: Mysterious Ways
Penguin Cafe Orchestra: Perpetuum Mobile
Molly Johnson: Must have Left My Heart
This one is kind of interesting because the bass drum occupies the same frequency range as the electric bass line, but they appear distinct by looking at timing:
Adele: Cold Shoulder
Norah Jones: Sunrise
What does compression look like?
I'm going to compress Cold Shoulder because it has frequencies at both extremes.
No noticeable differences:
No visible differences here either, though 64kbps is definitely audible:
Definite differences here. Some high frequencies have disappeared, and the low frequencies are now quantized: