-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex STFT transform from spectrogram #327
Conversation
dbe4f0c
to
008791c
Compare
@vincentqb Yeah that would work! So I'm not seeing anything on |
# default values are consistent with librosa.core.spectrum._spectrogram | ||
spec_f = _stft( | ||
waveform, n_fft, hop_length, win_length, window, True, "reflect", False, True | ||
) | ||
|
||
# unpack batch | ||
spec_f = spec_f.reshape(shape[:-1] + spec_f.shape[-3:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb is support for batching going to be done optionally then? 😁 I can add it to the augmentation transforms if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we decided to go ahead and add batching using reshape now :) we'll then update these codes when nested tensor comes out
Right, there should only be a check for none to disable the power computation |
torchaudio/functional.py
Outdated
r""" | ||
spectrogram(waveform, pad, window, n_fft, hop_length, win_length, power, normalized) | ||
|
||
Create a spectrogram from a raw audio signal. | ||
|
||
Args: | ||
waveform (torch.Tensor): Tensor of audio of dimension (channel, time) | ||
waveform (torch.Tensor): Tensor of audio of dimension ([batch,] channel, time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not directly about this PR, but probably the optional last dimension for complex stft should be mentioned as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention was specifying the optional batch dimension as (*, channel, time)
. Maybe it's nicer to follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, thanks for pointing this out!
torchaudio/functional.py
Outdated
@@ -218,42 +218,51 @@ def istft( | |||
def spectrogram( | |||
waveform, pad, window, n_fft, hop_length, win_length, power, normalized | |||
): | |||
# type: (Tensor, int, Tensor, int, int, int, int, bool) -> Tensor | |||
# type: (Tensor, int, Tensor, int, int, int, Optional[int], bool) -> Tensor | |||
r""" | |||
spectrogram(waveform, pad, window, n_fft, hop_length, win_length, power, normalized) | |||
|
|||
Create a spectrogram from a raw audio signal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the {optional / magnitude} should be also mentioned here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the optional batch dimension too. So.. something like
Create a spectrogram or a batch of spectrograms from a raw audio signal. The spectrogram can be
either magnitude-only or complex.
test/test_transforms.py
Outdated
waveform, sample_rate = torchaudio.load(self.test_filepath) # (2, 278756), 44100 | ||
|
||
# Single then transform then batch | ||
expected = transforms.Spectrogram()(waveform).unsqueeze(0).repeat(3,1,1,1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linter will fix this anyway, but (3,1,1,1)
--> (3, 1, 1, 1)
test/test_transforms.py
Outdated
expected = transforms.Spectrogram()(waveform).unsqueeze(0).repeat(3,1,1,1) | ||
|
||
# Batch then transform | ||
waveform = waveform.unsqueeze(0).repeat(3,1,1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also (3, 1, 1)
torchaudio/functional.py
Outdated
|
||
Args: | ||
waveform (torch.Tensor): Tensor of audio of dimension (channel, time) | ||
waveform (torch.Tensor): Tensor of audio of dimension (*, channel, time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd explicitly call this 'batch', i.e. ([batch], channel, time) or such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more general than batch. It works with a tensor of any dimension ending in (channel, time). In fact, this could be updated to (*, time) by making channel optional. Saying "(batch, channel, time) or (channel, time) or (time)" is a little heavy. And saying "([batch], [channel], time)" hides the generality. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of anything more compact. Just by following what it is something like ..(*, time), e.g., (time), (channel, time), (batch, channel, time)
? It's kinda verbose but if we can reduce its ambiguity :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with reducing ambiguity :)
Is (..., time)
clearer than (*, time)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm............ maybe? (But if I might have confused you, I meant ..audio of dimensions (*, time), e.g., (time), blah -
.) The current torch.stft
doc uses (*, time, ..)
to indicate an optional batch dimension which is more limited than we want here. Probably (..., time)
makes more sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the better globbing to use would be (**, time)
not (*, time)
for (..., time)
-- but (..., time)
is most transparent :)
cd8b0bc
to
a67ed63
Compare
This reverts commit 008791c.
a67ed63
to
08b268b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We offer a layer that applies complex STFT without the power/normalization done in spectrogram. However, we do not want to overload the name "STFT" existing in pytorch, as done in the original proposal in the proposal from #285.
@ksanjeevan -- thoughts?
CC @ksanjeevan @keunwoochoi @cpuhrsch