Using ni to visualize audio files
I wrote the binary interfacing for ni with the intention of doing signal processing, but until recently I haven't used it much. This weekend I decided to try some spectral visualizations for audio. Here's Bonobo's Black Sands album, log-frequency FFT at every quarter-second:
Things you'll need to install if you want to do this
- ni, which is just one script with no dependencies.
- NumPy (
apt install python-numpyon Ubuntu)
apt install ffmpeg)
Taking a look at some audio sample data
Before we get into anything too heavy, let's figure out how to access the sample data from the audio. There are three steps involved:
ffmpegto stream from Vorbis to WAV
- Use ni's
bfoperator to unpack unsigned short stereo samples (
- Seek to the end of silence, about 236000 samples in
This is almost the right format for
ni --js. The only thing we need to do is
prepend the sample index using the left-column-juxtapose operator
W and the
n (an infinite stream of ascending integers). I'm also truncating the
input data (
r100000) and limiting it to just the left channel to keep it
fAB -- we've dropped field
C). Here's the command:
bonobo-black-sands-04.ogg e[ffmpeg -i - -f wav -] bfss r-236000 Wn r100000 fAB
Note that file paths are relative to wherever
ni --js is running. I started it
from the same terminal I had open to look at the samples.
The data's bounds are larger than
ni --js is configured for by default, so we
increase the view distance to zoom out.
Collecting sample windows
The next order of business is to break the audio into windows, each one second long and spaced 1/4 second apart. The overlap is deliberate; later on we'll multiply by a Hann window to attenuate the edges.
Right now the sample data is a vertical stream:
1 left1 right1 2 left2 right2 ...
Our windowed data structure will look a little different. First, let's stick to just a single channel -- in this case the left one. Second, we want one window per line, each prefixed with its time offset:
0 left1 left2 left3 ... left44100 0.25 left11026 left11027 left11028 ... left55125 ...
We can build this up from the wav binary by changing the
bf template to unpack
11025 left-samples at a time (
bf'(sx2)11025') and then using nondestructive
readahead ("peek lines",
pl(n)) in a perl mapper context. Perl defines
the current line number, so we can prepend the window timestamps in the same
step. I subtract three from it because ni advances perl's internal line counter
when we peek lines.
Note that I skip the first window (
r-1, "drop one row") because it contains
WAV header data. This means our first window begins just before 0.25 seconds.
Here's the code we have so far:
$ ni bonobo-black-sands-04.ogg \ e[ffmpeg -i - -f wav -] \ bf'(sx2)11025' \ p'r +($.-3)*0.25, $_, pl(3)' \ r-1
Visualizing our windows
As it stands, we have too many dimensions for
ni --js to use. We'll need to
convert our windows to a series of
x y z points, one per audio sample. Here's
how we want our coordinates to be mapped:
x=1 x=2 x=3 ... time1=z sample1=y sample2=y sample3=z ... time2=z sample1=y sample2=y ... ...
...and here's the output we want:
x y z 1 sample1 time1 2 sample2 time1 3 sample3 time1 ... 1 sample1 time2 2 sample2 time2 ...
This is a form of dense-to-sparse matrix conversion, which is a fairly common
problem; ni has an operator to do most of the work for us. If we treat each row
as a separate Nx1 matrix, we can use
Y (dense -> sparse) to emit one row per
cell. Those rows will look like this:
time1 0 0 sample1 time1 0 1 sample2 time1 0 2 sample3 ... time2 0 0 sample1 time2 0 1 sample2 ...
From there, we can depth-stack by using
fCDA to set the correct axes. Here's
our command so far:
$ ni bonobo-black-sands-04.ogg \ e[ffmpeg -i - -f wav -] \ bf'(sx2)11025' \ p'r +($.-3)*0.25, F_, pl(3)' \ r-1 YB fCDA
...and here's a URL that sets up the view state:
It's also possible to use the sparse representation to vertically stack the
waveforms. Instead of producing a three-column output, we add
Applying an FFT using
This is where we pull in NumPy, which provides an FFT function. ni's
operator runs a series of matrices through a NumPy context. If provided, its
column spec indicates the first column of matrix data (everything to the left is
treated as the matrix partition and is preserved).
We can get a simple FFT by transforming each window like this:
NB'x = abs(fft.fft(x))'
This is a good opportunity to normalize the window vectors to sum to 1; this way we don't have enormous FFT components.
NB'x = abs(fft.fft(x / x.sum()))
We should do a few things to these FFTs:
- Drop the top half; it's just a mirror image (
rp'a < 22050')
- Rescale the spectrum so the low-frequency range is larger (
- Log-scale the amplitudes (
- Apply a Hann window to each FFT input
x * sin(array(range(x.shape))*pi / x.shape)**2)
,L is a cell-level operator that takes the unsigned log of a number. Here's
what we have in the command bar so far:
bonobo-black-sands-04.ogg e[ffmpeg -i - -f wav -] bf'(sx2)11025' p'r +($.-3)*0.25, F_, pl(3)' r-1 NB'x = abs(fft.fft(x / (1 + x.sum()) * sin(array(range(x.shape))*pi / x.shape)**2))' YB fCDA rp'a < 22050' ,LAB p'r a, b + c*30' r5000000
r5000000 just cuts the datastream after a while so interaction is
faster. Without it, the backend ni process can outrun Chrome's websocket driver,
which causes the browser to become unusably slow while data is being streamed.
Improving the presentation
TODO: explain this part better
We need two new steps:
rxB to copy each row as many times as the value in
B (amplitude), and
,AB to randomly attenuate the resulting copied
amplitudes by an exponentially-distributed value. This creates a shadow
downwards from the largest values.
This transform has the secondary benefit of removing the zero points, which makes it easier to see the data we care about.
Stacking back into 3D and mapping color
Almost done. Let's remove the
p'r a, b + c*30' we had to vertically stack the
FFTs; now we're back to three dimensions. I swapped X and Z so X corresponds to
time and Z to log(frequency), so we have this:
The last thing we want is to add color, which we can do just by setting the axis
ABCB (copying Y into the chroma channel). I'm also copying Y into
the opacity channel, so we really have
Final notes: scaling it out
Right now everything's running in a single stream and we're making fairly uneven use of our processors. If we want to process an album's worth of sound data, we should parallelize the expensive parts -- in this case, everything downstream of our window assembly.
I'm also going to change the exponential distribution for
,AB by appending
8: this produces a much sharper curve.
I have more processors on
r1 than I have locally, so here's my pipeline
bonobo-black-sands-04.ogg # send data still compressed sr1[ e[ffmpeg -i - -f wav -] # ffmpeg on the server bf'(sx2)11025' p'r +($.-3)*0.25, F_, pl(3)' r-1 S24[ # everything from here is splittable NB'x = abs(fft.fft(x / (1 + x.sum()) * sin(array(range(x.shape))*pi / x.shape)**2))' YB fCDA rp'a < 22050' ,LAB rxB ,AB fCBA]]