New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable mel_scale option #593
Conversation
torchaudio/functional.py
Outdated
@@ -431,6 +496,7 @@ def create_fb_matrix( | |||
f_max (float): Maximum frequency (Hz) | |||
n_mels (int): Number of mel filterbanks | |||
sample_rate (int): Sample rate of the audio waveform | |||
htk (bool): Use HTK formula instead of Slaney |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this instead be a flag that allows the user to pass a selection (e.g. "Slaney" or "HTK") so that we could add more formulas later on (if that could come up)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The htk
boolean flag is a convention from librosa. Someone had also suggested mel_scale="htk"|"slaney"
, see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason htk (bool) is preferred over the mel_scale flag (outside of it being a convention from a popular library)? In any case, I, of course, trust your judgement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see other reasons. I prefer mel_scale
personally as it is more explicit about which convention is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to mel_scale
torchaudio/functional.py
Outdated
|
||
return mels | ||
else: | ||
raise ValueError('mel_scale should be one of "htk" or "slaney".') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, it is more readable if argument validation is carried out at the beginning of the function.
Also, (this is kind of nit) when one branch of if ~ elif ~ else
clause is much longer than the others, extracting the content to separate function improves readability.
i.e.
if mel_scale not in ['slaney', 'htk']:
raise ValueError('mel_scale should be one of "htk" or "slaney".')
if mel_scale == 'htk':
return 2595.0 * torch.log10(torch.tensor(1.0 + (freq / 700.0), dtype=torch.float64))
else:
return _hz_to_mel_slaney(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with checking at the beginning. For the readability, since this is already a private function, how does the following looks to you?
if mel_scale not in ['slaney', 'htk']:
raise ValueError('mel_scale should be one of "htk" or "slaney".')
if mel_scale == "htk":
return 2595.0 * torch.log10(torch.tensor(1.0 + (freq / 700.0), dtype=torch.float64))
# Fill in the linear part
f_min = 0.0
f_sp = 200.0 / 3
mels = (freq - f_min) / f_sp
# Fill in the log-scale part
min_log_hz = 1000.0
min_log_mel = (min_log_hz - f_min) / f_sp
logstep = math.log(6.4) / 27.0
log_t = freq >= min_log_hz
mels[log_t] = min_log_mel + torch.log(torch.tensor(freq[log_t] / min_log_hz, dtype=torch.float64)) / logstep
return mels
torchaudio/functional.py
Outdated
Returns: | ||
mels (Tensor): Input frequencies in Mels | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to add a note on why we have to use float64
for log10
.
torchaudio/functional.py
Outdated
def create_fb_matrix( | ||
n_freqs: int, | ||
f_min: float, | ||
f_max: float, | ||
n_mels: int, | ||
sample_rate: int | ||
sample_rate: int, | ||
mel_scale: str = "htk", | ||
) -> Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the new implementation returns float64 Tensor whereas the original implementation returns float32. If so, this is BC breaking.
Either case, I think it's good idea to document what dtype the resulting tensor is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll cast to float32
at the end of the function. We can always add a dtype
parameter to create_fb_matrix
if other types are needed later.
The dtype returned should be documented in all functions indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversions like that should incur a fully copy. I'd double check what the performance implications of using float64 are. We could also set expectations and let the user pass float64 Tensors if they want higher precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea, but I would leave that discussion as a separate PR, since this may affect the outcome of #611.
torchaudio/functional.py
Outdated
@@ -446,22 +515,25 @@ def create_fb_matrix( | |||
# Equivalent filterbank construction by Librosa | |||
all_freqs = torch.linspace(0, sample_rate // 2, n_freqs) | |||
|
|||
# torch.log10 with float32 produces different results on different CPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note added in code since this is not visible to the end-user
torchaudio/functional.py
Outdated
raise ValueError('mel_scale should be one of "htk" or "slaney".') | ||
|
||
if mel_scale == "htk": | ||
return 2595.0 * torch.log10(1.0 + (freq / 700.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way of rearranging the calculation to improve numerical stability. We know that all values will be greater than 1, if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying torch.log10((700. + freq) / 700.)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are typical ranges on values within freq and do our tests mimic those adequately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.log10((700. + freq) / 700.)
behaves better and doesn't require float conversions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't necessarily just rely on the librosa tests. We should understand how this is used for our own sake and to have something to point to when it fails outside of those domains.
Rebased, and extended to MelScale, MelSpectrogram. |
torchaudio/functional/functional.py
Outdated
raise ValueError('mel_scale should be one of "htk" or "slaney".') | ||
|
||
if mel_scale == "htk": | ||
return 2595.0 * torch.log10((700.0 + freq) / 700.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep math.log10
as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that break torchscript compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though I rewrote the function with the math library and it is also torchscriptable now
librosa_mel = librosa.feature.melspectrogram( | ||
y=sound_librosa, sr=sample_rate, n_fft=n_fft, | ||
hop_length=hop_length, n_mels=n_mels, htk=True, norm=norm) | ||
hop_length=hop_length, n_mels=n_mels, htk=mel_scale == "htk", norm=norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
sample_rate = 16000 | ||
sound = common_utils.get_sinusoid(n_channels=1, sample_rate=sample_rate) | ||
sound_librosa = sound.cpu().numpy().squeeze() | ||
melspect_transform = torchaudio.transforms.MelSpectrogram( | ||
sample_rate=sample_rate, window_fn=torch.hann_window, | ||
hop_length=hop_length, n_mels=n_mels, n_fft=n_fft, norm=norm) | ||
hop_length=hop_length, n_mels=n_mels, n_fft=n_fft, norm=norm, mel_scale=mel_scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
Current error:
|
Tests are now green, but for one unrelated here. No test thresholds needed to be adjusted for the current implementation, and the implementation remains the same as before for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks!
Add
mel_scale
option enabling the htk or slaney mel scale, see here for discussion and librosa.Relates #589, internal, (#608?), (#259 for mel-hz functional)