Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dithering constant #371

Closed
Oktai15 opened this issue Dec 18, 2019 · 9 comments · Fixed by #453
Closed

Dithering constant #371

Oktai15 opened this issue Dec 18, 2019 · 9 comments · Fixed by #453
Assignees

Comments

@Oktai15
Copy link
Contributor

Oktai15 commented Dec 18, 2019

Why do torchaudio.compliance.kaldi.fbank and torchaudio.compliance.kaldi.spectrogram have so large dither default parameter (=1.0)? It very often just noises full output.

It's common to use dither around 0, e.g 0.00001 in QuartzNet, Jasper -- near to SOTA ASR models (https://github.com/NVIDIA/NeMo/blob/master/examples/asr/configs/quartznet15x5.yaml).

I want to notice that even in torchaudio tutorial we have dither = 0.0: https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html.

Also look at this issue and how it was resolved: #157

@vincentqb
Copy link
Contributor

Why do torchaudio.compliance.kaldi.fbank and torchaudio.compliance.kaldi.spectrogram have so large dither default parameter (=1.0)? It very often just noises full output.

Can you provide an example of code with noisy output using default value?

It's common to use dither around 0, e.g 0.00001 in QuartzNet, Jasper -- near to SOTA ASR models (https://github.com/NVIDIA/NeMo/blob/master/examples/asr/configs/quartznet15x5.yaml).

Can you provide Kaldi's or other software's default value?

@vincentqb vincentqb self-assigned this Dec 23, 2019
@Oktai15
Copy link
Contributor Author

Oktai15 commented Dec 23, 2019

About example: as I already mentioned, @vincentqb, check this your issue #157

@vincentqb
Copy link
Contributor

vincentqb commented Dec 26, 2019

Thanks for pointing this out. We should make sure thatspectrogram, fbank, and mfcc uses the same default.

From #157, it does seem like a value of 1 is large. If dither is set to zero though, the user should specify the energy_floor. Thoughts on what could be a good default, and what other softwares do?

@vincentqb
Copy link
Contributor

Addresses part of #263

@popcornell
Copy link
Contributor

Second this, the default right now makes the whole torchaudio.compliace.kaldi features totally unusable out-of-the-box.
I spent one hour looking at possible bugs on labels only to find out that basically my model was fed noise because of the dither default value.

@Oktai15
Copy link
Contributor Author

Oktai15 commented Mar 1, 2020

@popcornell I know that feel bro (the same problem I had had and after that I created this issue)

@vincentqb
Copy link
Contributor

vincentqb commented Mar 2, 2020

@popcornell @Oktai15 -- We're looking at what would be a good value to use. What would you say would be a reasonable nonzero value? What values do other packages use, and that you like?

@cpuhrsch -- I would have used 0. by default but the implementation says explicitly to specify energy_floor in that case, see here. Do you have more context? Note that all the test where done with dither=0.: see here for the original testing, and here where the second parameter (dither) in the file names for testing are all 0., e.g. spec-XXX-0-....

In the absence of more information, I'd suggest dither=1e-5.

import torchaudio
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

filename = "steam-train-whistle-daniel_simon.mp3"
s, sr = torchaudio.load(filename)

spec0 = torchaudio.transforms.Spectrogram()(s)[0]
plt.imshow(spec0.log2().numpy(), cmap='gray')
plt.show()

spec1 = torchaudio.compliance.kaldi.spectrogram(s, dither=0.)
plt.imshow(spec1.t().numpy(), cmap='gray')
plt.show()

spec2 = torchaudio.compliance.kaldi.spectrogram(s, dither=1e-5)
plt.imshow(spec2.t().numpy(), cmap='gray')
plt.show()

# Mean absolute percent difference 
print(2*((spec1 - spec2).abs()/(spec1.abs() + spec2.abs())).mean())
# We see an average absolute percentage difference of 0.25%.

@vincentqb
Copy link
Contributor

Based on this discussion, we'll simply set dither to 0 and energy_floor to 1 by default. This also seems to behave very closely to a small value of dither, see below.

import torchaudio
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

filename = "/Users/vincentqb/audio/test/assets/steam-train-whistle-daniel_simon.mp3"
s, sr = torchaudio.load(filename)

spec0 = torchaudio.transforms.Spectrogram()(s)[0]
plt.imshow(spec0.log2().numpy(), cmap='gray')
plt.show()

spec1 = torchaudio.compliance.kaldi.spectrogram(s, dither=0., energy_floor=1.)
plt.imshow(spec1.t().numpy(), cmap='gray')
plt.show()

spec2 = torchaudio.compliance.kaldi.spectrogram(s, dither=1e-6)
plt.imshow(spec2.t().numpy(), cmap='gray')
plt.show()

# Mean absolute percent difference 
print(2*((spec1 - spec2).abs()/(spec1.abs() + spec2.abs())).mean())
# We see an average absolute percentage difference of 0.16%.

@csukuangfj
Copy link
Collaborator

Why do torchaudio.compliance.kaldi.fbank and torchaudio.compliance.kaldi.spectrogram have so large dither default parameter (=1.0)

Kaldi uses 1 as the default dither value. It is fine for Kaldi because waveform in kaldi
has a range [-32768, 32767]. 1 is relatively small compared to the maximum value 32767.

However, in torchaudio,

torchaudio.load(filename)

returns a tensor with values in the range [-1, 1]. So if you still use the default value 1 from
Kaldi, you will distort the audio signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants