Use librosa for inference.py instead of torchaudio #29

AlexJian1086 · 2021-09-26T18:47:30Z

Hi I was going through inference pipeline and i wanted to know if there is a way we can replace Kaldi Fbank implementation to livbrosa library, I am hoping to run it on my jeson device and kaldi uses mkl library which is not suitable for ARM architectures.

I've tried multiple methods but the results are not same as kaldi's fbank implementation. Any help would be appreciated. Thankyou.

@JeffC0628 @YuanGongND

AlexJian1086 · 2021-09-27T06:10:28Z

WIth reference to the paper for the calculation of mel filterbank, I am using librosa.feature.melspectrogram() function to replace kaldi of pythorch given in inferency.py but I am not sure about how to replicate the parameters such as '25ms Hamming window every 10ms' and what would be hop_length, n_fft, win_length for librosa? Please provide the clarity.

YuanGongND · 2021-09-27T18:57:39Z

Hi there,

Matching outputs of Librosa and torchaudio is out of the scope of this repo, you should consult either librosa or torchaudio authors. It might be hard to make them exactly the same but I assume you should be able to get similar output with appropriate parameters. Or, you can train/fine-tune the model using the librosa generated spectrogram.

Specifically for librosa.feature.melspectrogram(), hop_length should be 10ms, win_length should be 25ms, window should be scipy.signal.windows.hann, sr should be 16,000, n_fft should be 128.

-Yuan

AlexJian1086 · 2021-09-28T05:58:36Z

Ah okay, thank you for clarification.
Although what exactly should I fine-tune here to achieve the desired results as inference pipeline for audioset, I assume the window size, overlap, mel bin etc would still remain same as provided in paper?

Also fbanks calculated in torchaudio.compliance.kaldi.fbank is same as librosa.feature.melspectrogram() and python_speech_features.base.logfbank?

YuanGongND · 2021-10-08T16:21:51Z

So the best way is to train and test using the feature extracted by the same toolkit. For audio event classification, you can just reuse our window size, overlap, etc to save time for searching; if your task is significantly different from audio event classification, you can consider using your own parameters.

The output of different toolkits might be different, you need experiments to confirm if they are the same.

-Yuan

YuanGongND added the question Further information is requested label Oct 8, 2021

YuanGongND closed this as completed Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use librosa for inference.py instead of torchaudio #29

Use librosa for inference.py instead of torchaudio #29

AlexJian1086 commented Sep 26, 2021 •

edited

Loading

AlexJian1086 commented Sep 27, 2021

YuanGongND commented Sep 27, 2021

AlexJian1086 commented Sep 28, 2021 •

edited

Loading

YuanGongND commented Oct 8, 2021

Use librosa for inference.py instead of torchaudio #29

Use librosa for inference.py instead of torchaudio #29

Comments

AlexJian1086 commented Sep 26, 2021 • edited Loading

AlexJian1086 commented Sep 27, 2021

YuanGongND commented Sep 27, 2021

AlexJian1086 commented Sep 28, 2021 • edited Loading

YuanGongND commented Oct 8, 2021

AlexJian1086 commented Sep 26, 2021 •

edited

Loading

AlexJian1086 commented Sep 28, 2021 •

edited

Loading