-
Notifications
You must be signed in to change notification settings - Fork 730
Description
Hi, I'm currently updating my torch codebase from using librosa to torchaudio for transforms, to take advantage of the (much) faster stft torch implementation on the GPU. However, running into several occasions where the output from Spectrogram vs. librosa.core._spectrogram, MelSpectrogram vs. librosa.melspectrogram have different results. Does this repo ensure consistency with another python audio library for those transformations? I think it would be good to have consistency with another widely used library. Currently figuring out the correct params to ensure consistency and I can PR something if that sounds useful.
For example:
sound, sample_rate = torchaudio.load('wav_file.wav')
sound = sound
sound_librosa = sound.cpu().numpy().squeeze().T
sample_rate = 16000
n_mels = 40
window_stride = 0.01
window_size = 0.025
hop_length = int(sample_rate * window_stride)
n_fft = int(sample_rate * window_size)
stft_librosa = librosa.stft(y=sound_librosa,
hop_length=hop_length,
n_fft=n_fft)
spectro_librosa, n_fft = librosa.core.spectrum._spectrogram(y=sound_librosa,
hop_length=hop_length,
n_fft=n_fft, power=2)
mel_basis = librosa.filters.mel(sample_rate,
n_mels=n_mels,
n_fft=n_fft,
norm=None, # non-standard
htk=True) # non-standard
check = np.dot(mel_basis, spectro_librosa)
stft_torch = torch.stft(soundcuda,
hop_length=hop_length,
n_fft=n_fft,
window=window).transpose(1, 2)
spectro_torch = stft_torch.pow(2).sum(-1)
melscale = torchaudio.transforms.MelScale(n_mels=n_mels)
check2 = melscale(check)
#check == check2
The torchaudio MelScale uses the non-default librosa options norm=None, htk=True on librosa.filters.mel (https://librosa.github.io/librosa/_modules/librosa/filters.html#mel). I also removed the default spectrogram normalization at https://github.com/pytorch/audio/blob/master/torchaudio/transforms.py#L198, which is not a librosa option.
There's also functional inconsistencies between the librosa and torchaudio function calls -- librosa returns a spectrogram with librosa.feature.melspectrogram, whereas torchaudio converts the spectrogram to the Db scale.