Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchaudio-contrib: some augmentations #285

Merged
merged 11 commits into from
Sep 20, 2019

Conversation

ksanjeevan
Copy link
Contributor

@ksanjeevan ksanjeevan commented Sep 17, 2019

Adding augmentations: TimeStretch, TimeMasking and FrequencyMasking.

@vincentqb @cpuhrsch @keunwoochoi

On complex STFT

We had discussions about this in torchaudio-contrib, about how it's useful to have a transform that outputs the complex stft. So basically the same as the current Spectrogram but without normalizing the output. In our previous PR we added the complex_norm to the functionals so then the Spectrogram transform could just call the complex stft functional (which will basically just wrap torch.stft plus whatever padding we're currently doing, batching (?)) and then pass the output to complex_norm functional.
An example where this can be needed is for example with the introduction of the TimeStretch layer. It takes as input a complex spectrogram so if a user wanted to make a module where a spectrogram is computed and stretched it could look something like:

model = nn.Sequential( STFT(n_fft=n_fft, hop_length=hop_length),
                           TimeStretch(freq=num_freqs, hop_length=hop_length, fixed_rate=1.3),
                           ComplexNorm(power=2.0),
                           AmplitudeToDB() )

I have added a STFT transform as well as a stft functional to be able to showcase the TimeStretch augmentation (we have something like this in torchaudio-contrib), but of course this should be reworked once we get some feedback!

On batching

I've made STFT be able to handle a batch dimension, and so can the augmentation layers. I can't seem to find it but I remember a discussion on how the layers should handle batch of inputs. Should the layers apply transforms only to single inputs for now? I can make changes if that is the case.

On the augmentations

So we talked about what augmentations to do in #259 about doing TimeStretch, TimeMasking, FrequencyMasking and PitchShift but I've only included the first three in this PR since pitch shifting I think will require a bit more discussion and this can be a solid first step. I haven't included the tests for them yet since I figured many things might change after feedback is given. A quick summary of what they do:

Time Stretching

Simply wrapping and allowing for batching for the phase_vocoder functional introduced in the last PR. A fixed rate can be passed when initializing, or a (randomly generated if doing augmentation) rate can be given to the forward method each time. Looks like:
pv

Time and Frequency masking

From SpecAugment, apply masks of a desired value to an input spectrogram. I've included a flag in case the input is batched and the user wants the masks to be independent of each other. Examples:
Same mask for batch:
no_iid
Independent mask for batch:
iid

Some demo code:

num_freqs, hop_length = 400, 512
model = nn.Sequential( STFT(n_fft=(num_freqs - 1) * 2, hop_length=hop_length),
                           TimeStretch(freq=num_freqs, hop_length=hop_length, fixed_rate=1.3),
                           ComplexNorm(power=2.0),
                           FrequencyMasking(freq_mask_param=60, iid_masks=False),
                           TimeMasking(time_mask_param=30, iid_masks=False),
                           AmplitudeToDB() )
inp = torch.as_tensor(librosa.load('file.wav')[0]).view(1, -1)
out = model(inp)
plot_heatmap(out)

gives:
plot



@torch.jit.script
def stft(waveform, pad, window, n_fft, hop_length, win_length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stft is already in pytorch core. I know that not having batching is annoying but replicating a function across both libraries is too. We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're working on a more principled abstraction around batching, but we'd like to avoid having a reimplementation of this. Maybe we can do this in a separate PR?

👍

Gotcha, so I can take this out no problem and then have the STFT layer just use torch.stft? I think the bigger point for this was to have the non-normalized output of stft as a functional/layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying there's an operation provided by torchaudio that normalizes the output when you would like it not to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so Spectrogram calls the spectrogram functional which is basically computing the stft and then getting the power of the complex tensor (saying "normalizing" might have confusing on my part, sorry).

But in order to use the phase vocoder, the TimeStretch transform has to be applied to the output of the stft before the complex norm (i.e. before doingspec_f = spec_f.pow(power).sum(-1)).

So that's why I was kinda wanting torchaudio to have a transform that gives the complex STFT like we have in contrib, because if someone then wants to work with randomly stretched spectrograms, they can simply do:

nn.Sequential(STFT, TimeStretch, ComplexNorm, ...)

whereas currently STFT and ComplexNorm are "coupled" together as Spectrogram, and we can't put the time stretching in between (i.e. nn.Sequential(Spectrogram, TimeStretch, ...) won't work).

@vincentqb
Copy link
Contributor

I've made STFT be able to handle a batch dimension, and so can the augmentation layers. I can't seem to find it but I remember a discussion on how the layers should handle batch of inputs. Should the layers apply transforms only to single inputs for now? I can make changes if that is the case.

In this PR, we can focus on single inputs in order to avoid adding STFT here again, as mentioned by @cpuhrsch. For batching, I would prefer waiting on a standardized approach. If that is so important, I'd open a separate PR for that so we don't block the rest :)

@vincentqb
Copy link
Contributor

We had discussions about this in torchaudio-contrib, about how it's useful to have a transform that outputs the complex stft. So basically the same as the current Spectrogram but without normalizing the output. In our previous PR we added the complex_norm to the functionals so then the Spectrogram transform could just call the complex stft functional (which will basically just wrap torch.stft plus whatever padding we're currently doing, batching (?)) and then pass the output to complex_norm functional.

Quick thought: I'm not necessarily suggesting to change something, but maybe Spectrogram should have been two operations then :)

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Eventually, it might be nice to add the snippet as a quick test to ensure all work well together in the future.

@ksanjeevan
Copy link
Contributor Author

Quick thought: I'm not necessarily suggesting to change something, but maybe Spectrogram should have been two operations then :)

Yeah that's actually how we have it in -contrib. The layers are STFT, ApplyFilterbank, TimeStretch, etc. So then if say you want a db melspectrogram, simply chain nn.Sequential(STFT, ComplexNorm, ApplyFilterbank, AmplitudeToDb). So we don't have an explicit Spectrogram layer, although I guess having both options can be ok?

@vincentqb
Copy link
Contributor

vincentqb commented Sep 17, 2019

Yeah that's actually how we have it in -contrib. The layers are STFT, ApplyFilterbank, TimeStretch, etc. So then if say you want a db melspectrogram, simply chain nn.Sequential(STFT, ComplexNorm, ApplyFilterbank, AmplitudeToDb). So we don't have an explicit Spectrogram layer, although I guess having both options can be ok?

I'd say we write the implementations here as if Spectrogram does not exist. We can then decide to axe it, keep it, or turn it into a wrapper, keeping in mind backward compatibility.

@ksanjeevan
Copy link
Contributor Author

ksanjeevan commented Sep 18, 2019

So I've removed the functional and transform complex-and-also-batched STFT, per @cpuhrsch and @vincentqb comments. Also added some tests for the masking functionals. Are the checks I wrote enough? I can add comments and/or more checks if needed. Should the next steps be writing some preliminary tests for the augmentation layers? I can also put a gist together showing how to chain these transforms.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added some tests for the masking functionals. Are the checks I wrote enough? I can add comments and/or more checks if needed. Should the next steps be writing some preliminary tests for the augmentation layers?

Is there something in SpecAugment we could compare against? Otherwise, the tests and the demo you provided here seems good to me.

I can also put a gist together showing how to chain these transforms.

If you have a demo of some code that could be fun to show, it could be shown as a new torchaudio tutorial, e.g. here, or example. :) These are useful since they also provide an integration test. This can be done as a separate PR.

test/test_functional.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. LGTM!

@ksanjeevan
Copy link
Contributor Author

Is there something in SpecAugment we could compare against? Otherwise, the tests and the demo you provided here seems good to me.

Hmm I'll look into it! I think the check we have to make sure the masked # of columns is < the mask parameter is a good start but there may be other things to compare with.

If you have a demo of some code that could be fun to show, it could be shown as a new torchaudio tutorial, e.g. here, or example. :) These are useful since they also provide an integration test. This can be done as a separate PR.

Yeah absolutely. Happy to do a separate PR with a demo/tutorial once we've sorted out the STFT stuff (just so it's cleaner for the user).

@vincentqb vincentqb self-assigned this Sep 19, 2019
@vincentqb
Copy link
Contributor

PS: let's not worry about this in this PR, but we try to make the transforms be thin wrappers around functionals as much as possible. :)

@vincentqb vincentqb merged commit 5c0773f into pytorch:master Sep 20, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 6, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 6, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 18, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 18, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 18, 2019
vincentqb added a commit to vincentqb/audio that referenced this pull request Nov 18, 2019
vincentqb added a commit that referenced this pull request Nov 18, 2019
* STFT transform and function from #285

* merge options in existing functionality.

* remove dimension 2 check. add test.

* using ...

* update spectrogram test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants