This notebook converts all `.npy` samples in the directory `.drumData` into STFT features. Unlike, our other feature extraction pipelines (MFCC, MIR and WaveNet) parts of the code in this notebook are adapted from Kyle McDonald's original implementation to make the STFT features work well with TSNE through a series of transformations and edge case checks. We found these to be useful when running the features through TSNE, but they were uneccesary for UMAP. Output dimensiosn per file are (32,32). 

The original notebook can be found at:
https://github.com/kylemcdonald/AudioNotebooks/blob/master/Samples%20to%20Fingerprints.ipynb

In [1]:
data_root = 'drumData'
n_fft = 1024
hop_length = n_fft/4
use_logamp = False # boost the brightness of quiet sounds
reduce_rows = 10 # how many frequency bands to average into one
reduce_cols = 1 # how many time steps to average into one
crop_rows = 32 # limit how many frequency bands to use
crop_cols = 32 # limit how many time steps to use
limit = None # set this to 100 to only process 100 samples

In [2]:
%matplotlib inline
from utils import *
from tqdm import *
from os.path import join
from matplotlib import pyplot as plt
from skimage.measure import block_reduce
from multiprocessing import Pool
import numpy as np
import librosa

# Load audio samples

In [3]:
drumNames = ["kick", "tom", "snare", "clap", "hi.hat", "ride", "crash"]
drumFingerPrints = {}
drumSamples = {}
for d in drumNames:
    %time drumSamples[d] = np.load(join(data_root, d+'_samples.npy'))

CPU times: user 854 µs, sys: 219 ms, total: 220 ms
Wall time: 380 ms
CPU times: user 561 µs, sys: 18.1 ms, total: 18.7 ms
Wall time: 40.4 ms
CPU times: user 551 µs, sys: 98.1 ms, total: 98.6 ms
Wall time: 158 ms
CPU times: user 1.79 ms, sys: 69.1 ms, total: 70.8 ms
Wall time: 148 ms
CPU times: user 562 µs, sys: 7.03 ms, total: 7.59 ms
Wall time: 10.8 ms
CPU times: user 446 µs, sys: 13.1 ms, total: 13.5 ms
Wall time: 22.2 ms
CPU times: user 661 µs, sys: 39.3 ms, total: 40 ms
Wall time: 55.9 ms


# STFT extraction pipeline

In [4]:
window = np.hanning(n_fft)
def job(y):
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length, window=window)
    amp = np.abs(S)
    if reduce_rows > 1 or reduce_cols > 1:
        amp = block_reduce(amp, (reduce_rows, reduce_cols), func=np.mean)
    if amp.shape[1] < crop_cols:
        amp = np.pad(amp, ((0, 0), (0, crop_cols-amp.shape[1])), 'constant')
    amp = amp[:crop_rows, :crop_cols]
    if use_logamp:
        amp = librosa.logamplitude(amp**2)
    amp -= amp.min()
    if amp.max() > 0:
        amp /= amp.max()
    amp = np.flipud(amp) # for visualization, put low frequencies on bottom
    return amp

for d in drumNames:
    pool = Pool()
    %time fingerprints = pool.map(job, drumSamples[d][:limit])
    fingerprints = np.asarray(fingerprints).astype(np.float32)
    drumFingerPrints[d] = fingerprints
    print "generated finger print for", d, fingerprints.shape

CPU times: user 283 ms, sys: 230 ms, total: 512 ms
Wall time: 1.81 s
generated finger print for kick (5158, 32, 32)
CPU times: user 26.8 ms, sys: 24.3 ms, total: 51 ms
Wall time: 137 ms
generated finger print for tom (422, 32, 32)
CPU times: user 144 ms, sys: 141 ms, total: 284 ms
Wall time: 732 ms
generated finger print for snare (2546, 32, 32)
CPU times: user 77.6 ms, sys: 75.9 ms, total: 153 ms
Wall time: 435 ms
generated finger print for clap (1324, 32, 32)
CPU times: user 13.8 ms, sys: 10.2 ms, total: 24 ms
Wall time: 66.3 ms
generated finger print for hi.hat (159, 32, 32)
CPU times: user 16.4 ms, sys: 13.6 ms, total: 30.1 ms
Wall time: 90.8 ms
generated finger print for ride (228, 32, 32)
CPU times: user 45.4 ms, sys: 45.5 ms, total: 90.9 ms
Wall time: 255 ms
generated finger print for crash (723, 32, 32)


# Write features to `.npy` files

In [None]:
for d in drumNames:
    np.save(join(data_root, d+'_stft.npy'), drumFingerPrints[d])
    print "saved", d+'_stft.npy'