Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InverseMelScale Implementation #366

Closed
wants to merge 15 commits into from

Conversation

jaeyeun97
Copy link
Contributor

@jaeyeun97 jaeyeun97 commented Dec 13, 2019

InverseMelScale implementation for #351
Test codes are coming!

Edit: It would be great if someone could chime in on the test code guideline for torchaudio?

@jaeyeun97 jaeyeun97 changed the title Feature InverseMelScale Inverse of MelScale Dec 13, 2019
@vincentqb vincentqb self-requested a review December 13, 2019 19:26
@vincentqb
Copy link
Contributor

If the code is in progress, you can prefix the pull request with [WIP] to indicate it is not yet ready for review/merging.

See also comment.

@jaeyeun97 jaeyeun97 changed the title Inverse of MelScale [WIP] Inverse of MelScale Dec 13, 2019
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This needs testing for jittability and correctness.

It would also be interesting to see how far InverseMelScale(MelScale(x)) is from x and MelScale(InverseMelScale(y)) from y.

@vincentqb
Copy link
Contributor

vincentqb commented Dec 14, 2019

The automated testing here picks up flake8 errors.

./torchaudio/transforms.py:206:9: E303 too many blank lines (2)
./torchaudio/transforms.py:206:51: W291 trailing whitespace
./torchaudio/transforms.py:210:1: W293 blank line contains whitespace
./torchaudio/transforms.py:222:37: W291 trailing whitespace

@jaeyeun97
Copy link
Contributor Author

InverseMelScale code changed significantly to be JITable--it now uses SVD to compute the inverse of the Mel basis matrix.

@jaeyeun97
Copy link
Contributor Author

No longer the case; using SGD for optimization yields best results when auditioned, along with the lowest and most stable loss values. JITability will have to suffer, I guess?

@vincentqb
Copy link
Contributor

No longer the case; using SGD for optimization yields best results when auditioned, along with the lowest and most stable loss values. JITability will have to suffer, I guess?

What do you mean "best results"? Can it be quantified? Can we compare the outputs with librosa, SGD and SVD? If the goal is to compute the inverse of a matrix, what other ways have you tried? inverse, pseudo-inverse? (btw this post may be relevant)

In general, I'd say jitability is important, and leave the code snippet with SGD in a comment here, and a TODO note in code with a link to here for when optimizers become jitable too :)

@jaeyeun97
Copy link
Contributor Author

Okay, I'll try to post reasonable test results between original, SVD, and SGD examples.

As a heads up--it's easier when you look at the resulting spectrograms of the tests, which shows a stark difference (mostly by having reasonable data where the missing bases result in random data on the case of SVD or pseudo-inverse methods).

@jaeyeun97
Copy link
Contributor Author

jaeyeun97 commented Dec 25, 2019

Okay, I have some results here, although they aren't quantitative ones. They generally favor solving by SGD, although the difference between SGD and SVD methods are much smaller if n_fft == 2^n.

n_fft == 400 and n_mels == 50:
400

n_fft == 512 and n_mels == 64:
512

n_fft == 2048 and n_mels == 256:
2048

@vincentqb Should I still just progress with SVD methods?

@vincentqb
Copy link
Contributor

Thanks for running those tests.

  1. Can you provide the codes for each version (e.g. using SVD and SGD)? in the comment here is fine.
  2. We need to quantify the differences, though I do agree that we'll need to go with SGD if we measure that SVD is not stable.
  3. Do you have a reference we can compare against? Something like librosa's that we could introduce as a test.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs a (quantitative) test against a reference, such as librosa.

@jaeyeun97
Copy link
Contributor Author

  1. You can find the code at https://gist.github.com/jaeyeun97/8651dff509d5b084636ac6c3a7547108, where the SGD version is the one you can currently find in this branch.

  2. Comparing with the Librosa version I'd say SGD is the right way to go. Librosa uses L-BGFS for solving the minimum squared difference as well, so having such a process here seems natural. The only quantifying measure I can think of is comparing the Euclidean distance (torch.dist?) between the proposed method, librosa's method, and the original spectrogram, since torch.allclose does not really work on any of the methods.

@jaeyeun97
Copy link
Contributor Author

The following shows the results including librosa:

n_fft == 400, n_mels == 50:
400

n_fft == 2048, n_mels == 256:
2048

@vincentqb
Copy link
Contributor

vincentqb commented Jan 17, 2020

Qualitatively, it's clear the the SVD implementation wouldn't do, so we'll end up with the SGD implementation even though that prevents the implementation from being jitable.

Let's look at torch.dist with p=0 and p=1 between

  • LibrosaInverse(melspec) and SGDInverse(melspec)
  • LibrosaInverse(melspec) and SVDInverse(melspec)

This will help us see the average difference and max difference. We'll need to add a test to make sure that the p=0 or p=1 distance between librosa and torchaudio is smaller than some threshold.

@jaeyeun97
Copy link
Contributor Author

Sorry, was on a long trip, will be finishing up this feature in the next few days.

@jaeyeun97
Copy link
Contributor Author

jaeyeun97 commented Feb 13, 2020

@vincentqb here are the torch.dist results:

SGDInverse SVDInverse
p=0 522046. 516637.
p=1 14831.9863 1549607.8750
p=2 77.0535 14585.5244

The above was with n_fft == 400 which may be the reason for such large differences, will test on other values and see what happens.

EDIT: I have done the same with other nfft values. While the p=2 values show a much smaller difference for case (i) than case (ii), I see little difference in p=1 or p=0 values. I am not quite sure why p=0 may be significant, since it is very unlikely (or so I presume) for any corresponding "cells" to be exactly the same after a numerical solving process.

It looks like it is much more difficult to show "similarity" of audio than I had previously thought.

@jaeyeun97
Copy link
Contributor Author

jaeyeun97 commented Feb 14, 2020

The p=1 difference above should be enough to justify SGD?

Now for the test I'd say it makes more sense to compare the difference between:

  • torch.dist(LibrosaInverse(MelScale(spec), spec))
  • torch.dist(SGDInverse(MelScale(spec), spec))

if we are going to move ahead with SGDInverse.

@vincentqb
Copy link
Contributor

Yes, based on these results, let's use the SGD variant. Thanks for providing a gist in comment with SVD. There's an alternative version of SVD coming up in pytorch/pytorch#29488, but we can revisit after the latter is merged.

I am not quite sure why p=0 may be significant, since it is very unlikely (or so I presume) for any corresponding "cells" to be exactly the same after a numerical solving process.

Oops, this was a typo :) I meant p=1 and p=2. Thanks for providing them!

@vincentqb
Copy link
Contributor

Let's make the tests green :) Can you rebase your code onto master?

@jaeyeun97
Copy link
Contributor Author

Let's make the tests green :) Can you rebase your code onto master?

yep done!

@vincentqb
Copy link
Contributor

vincentqb commented Feb 18, 2020

Thanks! Can you do the following?

  • We put the computation part in torchaudio.functional and then wrap this in torchaudio.transforms. The documentation also needs to be updated for functionals.
  • We need tests (e.g. test/test_functional.py and test/test_transform.py) for correctness. Can you add in the corresponding test file Inverse(specgram) is close to librosa?
  • We need to show consistency: (1) Inverse(MelSpecgram(waveform)) is close to waveform, and (2) MelSpecgram(Inverse(specgram)) is close to specgram.

@mthrok
Copy link
Collaborator

mthrok commented Feb 26, 2020

Hi 👋

I am working to wrap up this nice PR. (@jaeyeun97 let me know if you have a compelling reason that I should not.)

Following @vincentqb 's suggestion, I extracted the main part of InverseMelScale from transform module to functional module, and then I noticed that functional module does not have an equivalent of MelScale implementation there.

So I had offline discussion with @vincentqb, and we agreed that it's okay to leave both MelScale and InverseMelScale in transform module.

The followings are the points brought up;

  1. MelScale is creating filter banks and applying it with matmul. Filter bank generation is defined in functional module. So it's mostly done in functional module.
  2. The consistency across MelScale and InverseMelScale is important here. (I was writing a numerical compatibility test for functional version of InverseMelScale, but writing it without an equivalent of MelScale was weird and code looks ugly.)

So I will keep the implementation of InverseMelScale transform as is and will add test for numerical compatibility to supplement this PR.

@jaeyeun97
Copy link
Contributor Author

@mthrok Ah sorry, I decided to seriously look for jobs now so that took priority.

I noticed that functional module does not have an equivalent of MelScale implementation there.

I was also hesitant on moving code to functional because of this issue, thanks for sorting it out!

mthrok added a commit to mthrok/audio that referenced this pull request Feb 27, 2020
This PR follows up pytorch#366 and adds test for `InverseMelScale` (and `MelScale`) for librosa compatibility.

For `MelScale` compatibility test;
1. Generate spectrogram
2. Feed the spectrogram to `torchaudio.transforms.MelScale` instance
3. Feed the spectrogram to `librosa.feature.melspectrogram` function.
4. Compare the result from 2 and 3 elementwise.
Element-wise numerical comparison is possible because under the hood their implementations use the same algorith.

For `InverseMelScale` compatibility test, it is more elaborated than that.
1. Generate the original spectrogram
2. Convert the original spectrogram to Mel scale using `torchaudio.transforms.MelScale` instance
3. Reconstruct spectrogram using torchaudio implementation
3.1. Feed the Mel spectrogram to `torchaudio.transforms.InverseMelScale` instance and get reconstructed spectrogram.
3.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 3.1.
4. Reconstruct spectrogram using librosa
4.1. Feed the Mel spectrogram to `librosa.feature.inverse.mel_to_stft` function and get reconstructed spectrogram.
4.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 4.1. (this is the reference.)
5. Check that resulting P1 distance are in a roughly same value range.

Element-wise numerical comparison is not possible due to the difference algorithms used to compute the inverse. The reconstructed spectrograms can have some values vary in magnitude.
Therefore the strategy here is to check that P1 distance (reconstruction loss) is not that different from the value obtained using `librosa`. For this purpose, threshold was empirically chosen

```
print('p1 dist (orig <-> ta):', torch.dist(spec_orig, spec_ta, p=1))
print('p1 dist (orig <-> lr):', torch.dist(spec_orig, spec_lr, p=1))
>>> p1 dist (orig <-> ta): tensor(1482.1917)
>>> p1 dist (orig <-> lr): tensor(1420.7103)
```

This value can vary based on the length and the kind of the signal being processed, so it was handpicked.
@mthrok
Copy link
Collaborator

mthrok commented Feb 28, 2020

@jaeyeun97 I see. Good luck with your job hunting!

@jaeyeun97
Copy link
Contributor Author

@mthrok Thanks! Just reviewed #448 and looks good to me.

@jaeyeun97 jaeyeun97 changed the title [WIP] Inverse of MelScale InverseMelScale Implementation Feb 28, 2020
@jaeyeun97
Copy link
Contributor Author

Some checks failed but they look like curl failures.

mthrok added a commit to mthrok/audio that referenced this pull request Feb 28, 2020
This PR follows up pytorch#366 and adds test for `InverseMelScale` (and `MelScale`) for librosa compatibility.

For `MelScale` compatibility test;
1. Generate spectrogram
2. Feed the spectrogram to `torchaudio.transforms.MelScale` instance
3. Feed the spectrogram to `librosa.feature.melspectrogram` function.
4. Compare the result from 2 and 3 elementwise.
Element-wise numerical comparison is possible because under the hood their implementations use the same algorith.

For `InverseMelScale` compatibility test, it is more elaborated than that.
1. Generate the original spectrogram
2. Convert the original spectrogram to Mel scale using `torchaudio.transforms.MelScale` instance
3. Reconstruct spectrogram using torchaudio implementation
3.1. Feed the Mel spectrogram to `torchaudio.transforms.InverseMelScale` instance and get reconstructed spectrogram.
3.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 3.1.
4. Reconstruct spectrogram using librosa
4.1. Feed the Mel spectrogram to `librosa.feature.inverse.mel_to_stft` function and get reconstructed spectrogram.
4.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 4.1. (this is the reference.)
5. Check that resulting P1 distance are in a roughly same value range.

Element-wise numerical comparison is not possible due to the difference algorithms used to compute the inverse. The reconstructed spectrograms can have some values vary in magnitude.
Therefore the strategy here is to check that P1 distance (reconstruction loss) is not that different from the value obtained using `librosa`. For this purpose, threshold was empirically chosen

```
print('p1 dist (orig <-> ta):', torch.dist(spec_orig, spec_ta, p=1))
print('p1 dist (orig <-> lr):', torch.dist(spec_orig, spec_lr, p=1))
>>> p1 dist (orig <-> ta): tensor(1482.1917)
>>> p1 dist (orig <-> lr): tensor(1420.7103)
```

This value can vary based on the length and the kind of the signal being processed, so it was handpicked.
vincentqb pushed a commit that referenced this pull request Feb 28, 2020
* Inverse Mel Scale Implementation

* Inverse Mel Scale Docs

* Better working version.

* GPU fix

* These shouldn't go on git..

* Even better one, but does not support JITability.

* Remove JITability test

* Flake8

* n_stft is a must

* minor clean up of initialization

* Add librosa consistency test

This PR follows up #366 and adds test for `InverseMelScale` (and `MelScale`) for librosa compatibility.

For `MelScale` compatibility test;
1. Generate spectrogram
2. Feed the spectrogram to `torchaudio.transforms.MelScale` instance
3. Feed the spectrogram to `librosa.feature.melspectrogram` function.
4. Compare the result from 2 and 3 elementwise.
Element-wise numerical comparison is possible because under the hood their implementations use the same algorith.

For `InverseMelScale` compatibility test, it is more elaborated than that.
1. Generate the original spectrogram
2. Convert the original spectrogram to Mel scale using `torchaudio.transforms.MelScale` instance
3. Reconstruct spectrogram using torchaudio implementation
3.1. Feed the Mel spectrogram to `torchaudio.transforms.InverseMelScale` instance and get reconstructed spectrogram.
3.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 3.1.
4. Reconstruct spectrogram using librosa
4.1. Feed the Mel spectrogram to `librosa.feature.inverse.mel_to_stft` function and get reconstructed spectrogram.
4.2. Compute the sum of element-wise P1 distance of the original spectrogram and that from 4.1. (this is the reference.)
5. Check that resulting P1 distance are in a roughly same value range.

Element-wise numerical comparison is not possible due to the difference algorithms used to compute the inverse. The reconstructed spectrograms can have some values vary in magnitude.
Therefore the strategy here is to check that P1 distance (reconstruction loss) is not that different from the value obtained using `librosa`. For this purpose, threshold was empirically chosen

```
print('p1 dist (orig <-> ta):', torch.dist(spec_orig, spec_ta, p=1))
print('p1 dist (orig <-> lr):', torch.dist(spec_orig, spec_lr, p=1))
>>> p1 dist (orig <-> ta): tensor(1482.1917)
>>> p1 dist (orig <-> lr): tensor(1420.7103)
```

This value can vary based on the length and the kind of the signal being processed, so it was handpicked.

* Address review feedbacks

* Support arbitrary batch dimensions.

* Add batch test

* Use view for batch

* fix sgd

* Use negative indices and update docstring

* Update threshold

Co-authored-by: Charles J.Y. Yoon <jaeyeun97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants