-
Notifications
You must be signed in to change notification settings - Fork 633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding normalized=True options to MelScale and MelSpectrogram modules #426
Conversation
to allow area normalization
b56daeb
to
e623700
Compare
Fix incoming! |
Is there value in pulling out this normalization approach and making it a separate transform? |
IMHO it wouldn't be necessary; it's still Mel spectrogram (and MFCC) conceptually, and people typically use whichever is the default in the toolkit. Many papers don't bother to report the Mel scale or normalization. Still, It'd be nicer to have numerical interoperability as discussed in #287, hence this PR! |
@@ -229,10 +229,10 @@ def _test_librosa_consistency_helper(n_fft, hop_length, power, n_mels, n_mfcc, s | |||
# test mel spectrogram | |||
melspect_transform = torchaudio.transforms.MelSpectrogram( | |||
sample_rate=sample_rate, window_fn=torch.hann_window, | |||
hop_length=hop_length, n_mels=n_mels, n_fft=n_fft) | |||
hop_length=hop_length, n_mels=n_mels, n_fft=n_fft, normalized=mel_norm == "slaney") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From comment, there seems to be more than one way to normalize, is that the case? If so, wouldn't it be cleaner to just say normalized="slaney"
?
However, if we go this direction, I second @cpuhrsch, it may be valuable to have a SlaneyScale
transform like the MelScale
transform that torchaudio has instead. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion; the comments ("htk" vs "slaney") was about the mel-hz conversion methods, while this PR addresses an orthogonal option where the height of the triangular filters can be adjusted to have the same area. (figure)
This (and librosa's API) is somewhat confusing because there are the Mel scale from Slaney and Slaney normalization, both referring to the auditory toolbox by Malcolm Slaney.
So for this specific issue I'd suggest using normalized
or area_normalize
for the names.
NOT related to the PR but regarding the Mel scale options:
We can later consider supporting other Mel scales (which is good to have for better librosa compliance but not my top priority).
Even in that case, having SlaneyScale
and MelScale
would be quite confusing; both are Mel scales, and there is no "standard" Mel scale used in literature (which is quite unfortunate).
@@ -406,6 +406,7 @@ def create_fb_matrix(n_freqs, f_min, f_max, n_mels, sample_rate): | |||
f_max (float): Maximum frequency (Hz) | |||
n_mels (int): Number of mel filterbanks | |||
sample_rate (int): Sample rate of the audio waveform | |||
normalized (bool, optional): area-normalize the FB, so that each filter has equal area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "If True, normalize each filter to have equal area. If False, normalize each filter to have equal height. (Default: False
)"
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it! But before commiting these changes, I just learned about the upcoming chainge in librosa.filters.mel where the area normalizatioin has two variants, so people may choose from:
norm=None
: same as torchaudio's default MelScalenorm="slaney"
: the very new default (since 0.7.2, in Jan 13, 2020); same as the area normalization in this PR, dividing the filter by the width between the adjacent mel frequencies. While this makes the L1 norm of each filter slightly different, this was preferred historically because it is regression tested to be bit identical with the MATLAB counterpart.norm=1
: area normalization that has been the default historically and behaves exactly the same asnorm="slaney"
, but will change behavior in future librosa (likely 0.8), where the components will exactly sum to 1 (i.e. L1-normalized)norm=p
: similar to the above, using Lp-norm, but there's no practical reason to use this.
Do you think we should follow their convention instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know! Given how the implementation is just a small division at the end of the computation on the filterbanks, I was still considering something like MelScale
, SlaneyScale
. However, if we add a MelPScale
with a parameter for p-norm, it seems like we are back to having to provide a parameter.
Let' just use their new convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll modify the current PR to have librosa-0.8-style norm=
options instead.
(Again, below not pertaining to this PR)
I still think the naming SlaneyScale
on top of MelScale
would be confusing as commented above, because:
- The Mel scale from the MATLAB auditory toolbox (referred to as "slaney") is also a Mel scale. It might give an impression that it is not, or less legitimate than the HTK Mel scale.
- What the current
MelScale
class represents is actually a Mel filter bank, because the term "Mel scale" simply refer to a series of Mel frequencies mapped from Hz (by a function that is roughly linear below 1kHz and roughly logarithmic above 1kHz). TheMelScale
class contains a set of triangular filters constructed from the Mel scale. SoMelFilterBank
would have been a better naming. - With
MelScale
encapsulating both the scale AND the filterbank, it is not quite clear in the nameSlaneyScale
if "slaney" is referring to the scale or the normalization method, and we too seem to be confused throughout the comments on this PR. - Nobody uses the term "slaney scale"; when googled with quotes, I see only three irrelevant results and none in Google Scholar.
Regardless of the future naming of SlaneyScale
(or mel_scale="htk"|"slaney"
as I'd prefer), it should only apply to the Mel scale, and it is orthogonal to the norm
options this PR is addressing.
I'll re-open the PR once the librosa 0.8+ style |
The change will be around standardizing |
Yes! I didn't mean to block on them but rather getting the code ready on my side 😄 |
This allows area normalization for Mel spectrogram (See #287 ), where the filterbank
In the same issue I also mentioned adding the Slaney mel frequencies in addition to HTK ones, but as the Slaney scale is quite cumbersome (manually connect the linear and logarithmic pieces), I'm not sure if the maintainers thinks it's worth adding it at the moment. So adding the area normalization only for now.
One caveat might be that the namining might be confusing since
normalized
is also used in theSpectrogram
module, but I couldn't come up with better choice. Other sensible choices for naming includenormalization=None | "area"
orarea_normalize=False | True
, and I am happy to change the naming to either.For the test, I've added one more case in the librosa consistency checks. CR appreciated!