-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchaudio-contrib: some augmentations #285
Conversation
torchaudio/functional.py
Outdated
|
||
|
||
@torch.jit.script | ||
def stft(waveform, pad, window, n_fft, hop_length, win_length): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stft is already in pytorch core. I know that not having batching is annoying but replicating a function across both libraries is too. We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?
👍
Gotcha, so I can take this out no problem and then have the STFT
layer just use torch.stft
? I think the bigger point for this was to have the non-normalized output of stft as a functional/layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying there's an operation provided by torchaudio that normalizes the output when you would like it not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so Spectrogram
calls the spectrogram
functional which is basically computing the stft and then getting the power of the complex tensor (saying "normalizing" might have confusing on my part, sorry).
But in order to use the phase vocoder, the TimeStretch
transform has to be applied to the output of the stft
before the complex norm (i.e. before doingspec_f = spec_f.pow(power).sum(-1)
).
So that's why I was kinda wanting torchaudio to have a transform that gives the complex STFT
like we have in contrib, because if someone then wants to work with randomly stretched spectrograms, they can simply do:
nn.Sequential(STFT, TimeStretch, ComplexNorm, ...)
whereas currently STFT
and ComplexNorm
are "coupled" together as Spectrogram
, and we can't put the time stretching in between (i.e. nn.Sequential(Spectrogram, TimeStretch, ...)
won't work).
In this PR, we can focus on single inputs in order to avoid adding STFT here again, as mentioned by @cpuhrsch. For batching, I would prefer waiting on a standardized approach. If that is so important, I'd open a separate PR for that so we don't block the rest :) |
Quick thought: I'm not necessarily suggesting to change something, but maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Eventually, it might be nice to add the snippet as a quick test to ensure all work well together in the future.
Yeah that's actually how we have it in -contrib. The layers are STFT, ApplyFilterbank, TimeStretch, etc. So then if say you want a db melspectrogram, simply chain |
I'd say we write the implementations here as if |
So I've removed the functional and transform complex-and-also-batched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added some tests for the masking functionals. Are the checks I wrote enough? I can add comments and/or more checks if needed. Should the next steps be writing some preliminary tests for the augmentation layers?
Is there something in SpecAugment we could compare against? Otherwise, the tests and the demo you provided here seems good to me.
I can also put a gist together showing how to chain these transforms.
If you have a demo of some code that could be fun to show, it could be shown as a new torchaudio tutorial, e.g. here, or example. :) These are useful since they also provide an integration test. This can be done as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. LGTM!
Hmm I'll look into it! I think the check we have to make sure the masked # of columns is < the mask parameter is a good start but there may be other things to compare with.
Yeah absolutely. Happy to do a separate PR with a demo/tutorial once we've sorted out the |
PS: let's not worry about this in this PR, but we try to make the transforms be thin wrappers around functionals as much as possible. :) |
This reverts commit 008791c.
This reverts commit 008791c.
This reverts commit 008791c.
* STFT transform and function from #285 * merge options in existing functionality. * remove dimension 2 check. add test. * using ... * update spectrogram test.
Adding augmentations:
TimeStretch
,TimeMasking
andFrequencyMasking
.@vincentqb @cpuhrsch @keunwoochoi
On complex STFT
We had discussions about this in torchaudio-contrib, about how it's useful to have a transform that outputs the complex stft. So basically the same as the current
Spectrogram
but without normalizing the output. In our previous PR we added thecomplex_norm
to the functionals so then theSpectrogram
transform could just call the complex stft functional (which will basically just wraptorch.stft
plus whatever padding we're currently doing, batching (?)) and then pass the output tocomplex_norm
functional.An example where this can be needed is for example with the introduction of the
TimeStretch
layer. It takes as input a complex spectrogram so if a user wanted to make a module where a spectrogram is computed and stretched it could look something like:I have added a
STFT
transform as well as astft
functional to be able to showcase theTimeStretch
augmentation (we have something like this in torchaudio-contrib), but of course this should be reworked once we get some feedback!On batching
I've made
STFT
be able to handle a batch dimension, and so can the augmentation layers. I can't seem to find it but I remember a discussion on how the layers should handle batch of inputs. Should the layers apply transforms only to single inputs for now? I can make changes if that is the case.On the augmentations
So we talked about what augmentations to do in #259 about doing
TimeStretch
,TimeMasking
,FrequencyMasking
andPitchShift
but I've only included the first three in this PR since pitch shifting I think will require a bit more discussion and this can be a solid first step. I haven't included the tests for them yet since I figured many things might change after feedback is given. A quick summary of what they do:Time Stretching
Simply wrapping and allowing for batching for the
phase_vocoder
functional introduced in the last PR. A fixed rate can be passed when initializing, or a (randomly generated if doing augmentation) rate can be given to the forward method each time. Looks like:Time and Frequency masking
From SpecAugment, apply masks of a desired value to an input spectrogram. I've included a flag in case the input is batched and the user wants the masks to be independent of each other. Examples:
Same mask for batch:
Independent mask for batch:
Some demo code:
gives: