Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opus_compare reflexivity #308

Closed
heshpdx opened this issue Jan 9, 2024 · 7 comments
Closed

opus_compare reflexivity #308

heshpdx opened this issue Jan 9, 2024 · 7 comments

Comments

@heshpdx
Copy link

heshpdx commented Jan 9, 2024

Hi, I have noticed that opus_compare is not a reflexive function. That is, the result of opus_compare is dependent on the order of the arguments. Is this expected?

$ ls -l a/amp_podcast_11v.dec b/amp_podcast_11v.dec 
-rw-rw-r-- 2 mjm mjm 107837440 Jan  8 23:42 a/amp_podcast_11v.dec
-rw-rw-r-- 2 mjm mjm 107837440 Jan  8 23:42 b/amp_podcast_11v.dec

$ cksum a/amp_podcast_11v.dec b/amp_podcast_11v.dec
3149414850 107837440 a/amp_podcast_11v.dec
2441038936 107837440 b/amp_podcast_11v.dec

$ ./opus_compare -s a/amp_podcast_11v.dec b/amp_podcast_11v.dec
Test vector PASSES
Opus quality metric: 19.7 %

$ ./opus_compare -s b/amp_podcast_11v.dec a/amp_podcast_11v.dec
Test vector FAILS
Internal weighted error is 0.366566

One audio file was generated by an opus binary built with gcc-12, and the other from opus built with llvm-15. Many other outputs verify correctly between the two builds but this one is an outlier. I'm happy to post the audio files on dropbox or something if anyone wants to take a look. But I think this is really a question about the inner workings of opus_compare.

This question is motivated by an effort to turn opus into a new benchmark for SPEC CPUv8. Due to its prominence and longevity in the industry, I proposed opus to be included in the next set of marquee benchmarks in SPEC CPU. Part of that effort is to validate the output of the benchmark across a wide variety of architectures and compilers, and so I am using opus_compare to verify the work was done correctly, defined by the bounds of the comparison inherent to opus_compare. That is how I ended up here. Thanks for your time and guidance.

@jmvalin
Copy link
Member

jmvalin commented Jan 9, 2024

Keep in mind that opus_compare is not designed to be a general quality assessment tool. Rather, it is only meant to be used to evaluate the decoder output on the official test vectors (not other samples).

@heshpdx
Copy link
Author

heshpdx commented Feb 29, 2024

Thank you for your insights. I think this is still the best tool for our purposes. Could you share some insights on how I can relax the comparison to be a little more tolerant for my large file comparison? Which variables in opus_compare.c could we play with? I see TEST_WIN_SIZE and TEST_WIN_STEP - can you share intuition on what changing those would do? If you have other suggestions, I welcome them.

@jmvalin
Copy link
Member

jmvalin commented Feb 29, 2024

If your goal is just to make it a bit more lenient, then I think the simplest thing to do would be just to change the threshold to something you're OK with. Near the end of the file, you'll see the following two lines:
err=pow(err/nframes,1.0/16);
Q=100*(1-0.5log(1+err)/log(1.13));
You could simply change the first to be
err=leniency
pow(err/nframes,1.0/16);
or something like that.
Are you trying to test just the decoder or also the encoder. opus_compare was designed to evaluate the decoder, which has an "almost bit exact" definition, i.e. decoders will only differ by rounding error. If you're comparing encoders, then you can expect much larger differences.

@heshpdx
Copy link
Author

heshpdx commented Feb 29, 2024

Thank you! I will play around with this. I tried a couple values of LENIENCY and the failure above turned into:
LENIENCY=0.5:

Test vector PASSES
Opus quality metric: 31.2 % (internal weighted error is 0.183283)

and LENIENCY=0.3

Test vector PASSES
Opus quality metric: 57.3 % (internal weighted error is 0.109970)

This allows me to set a tolerance after listening to the output and figuring out if it is acceptable for our needs.

For your other question, I am performing both encode and decode. I take a .wav file, encode it with opus, then take that encoded bitstream and decode it with opus. The final decoded file is what we run opus_compare on, to compare the audio from two different systems (CPUs, ISA, compiler, OS, whatever). We are looking to ensure that the same work was accomplished. Because system differences and a lossy algorithm can build up for long tests, I wanted to allow for slightly higher tolerance. I have listened to the audio for the ones that do not pass with the RFC/opus standard code, and they sound just fine to my ear. So this leniency idea is very appropriate.

Do you have any guidance on choosing leniency values? We want to keep the bounds tight enough, because SPEC CPU is used by compiler writers to ensure that new code generation flows don't break functionality.

@jmvalin
Copy link
Member

jmvalin commented Feb 29, 2024

If you have an encoder in the loop, then you can get larger but valid differences. As an example, try compiling one encoder with --enable-fixed-point but not the other one and see the difference. It's going to be kinda big. But then again you might want to compare "apples to apples", in which case such a difference may not be something you want to accept. I guess my main question would be what you're trying to catch with that test. Are you trying to detect if someone cheated and used a lower complexity setting to get better speed or are you just trying to check that the build isn't completely broken and producing garbled audio.

@heshpdx
Copy link
Author

heshpdx commented Feb 29, 2024

Everyone will build with the same options, and everyone will run opus with the same flags. So, it is the second one you mention: making sure that the math is correct enough and the audio matches within some bounds so there is no garbled audio (which we already caught once!)

@heshpdx
Copy link
Author

heshpdx commented Jun 1, 2024

Update: changing the threshold worked for us, and allowed benchmark output verification to succeed on a myriad of systems and compilers. I acknowledge that opus_compare is very strict in it's audio quality comparison, magnitudes more than what can be perceived by the human ear. Thank you for your technical contributions!

@heshpdx heshpdx closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants