Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantization in evaluation #1

Closed
sniklaus opened this issue Mar 22, 2021 · 6 comments
Closed

quantization in evaluation #1

sniklaus opened this issue Mar 22, 2021 · 6 comments

Comments

@sniklaus
Copy link

Thanks for sharing your code! I just looked into it a little bit and it seems there is no quantization in the evaluation?

CDFI/test.py

Lines 36 to 47 in d7f79e5

frame_out = model(in0, in1)
lps = lpips(self.gt_list[idx].cuda(), frame_out, net_type='squeeze')
imwrite(frame_out, output_dir + '/' + self.im_list[idx] + '/' + output_name + '.png', range=(0, 1))
frame_out = frame_out.squeeze().detach().cpu().numpy()
gt = self.gt_list[idx].numpy()
psnr = skimage.metrics.peak_signal_noise_ratio(image_true=gt, image_test=frame_out)
ssim = skimage.metrics.structural_similarity(np.transpose(gt, (1, 2, 0)),
np.transpose(frame_out, (1, 2, 0)), multichannel=True)

However, it is common practice to quantize your interpolation estimate before computing any metrics as shown in the examples below. If you submit results to a benchmark, like the one from Middlebury, you will have to quantize the interpolation estimates to save them as an image so it has been the norm to quantize all results throughout the evaluation.

https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/benchmark.py#L28
https://github.com/hzwer/arXiv2020-RIFE/blob/15cb7f2389ccd93e8b8946546d4665c9b41541a3/benchmark/Vimeo90K.py#L36
https://github.com/baowenbo/DAIN/blob/9d9c0d7b3718dfcda9061c85efec472478a3aa86/demo_MiddleBury.py#L162-L166
https://github.com/laomao0/BIN/blob/b3ec2a27d62df966cc70880bb3d13dcf147f7c39/test.py#L406-L410

The reason why this is important is that the quantization step has a negative impact on the metrics. So if one does not quantize the results of their method before computing the metrics while the results from other methods had the quantization step in place, then the evaluation is slightly biased. Would you hence be able to share the evaluation metrics for CDFI with the quantization? This would greatly benefit future work that compares to CDFI to avoid this bias. And thanks again for sharing your code!

@tding1
Copy link
Owner

tding1 commented Mar 22, 2021

Thanks for pointing this interesting question. My response is as follows:

  1. I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations. For example, the AdaCoF evaluation script:
    https://github.com/HyeongminLEE/AdaCoF-pytorch/blob/f121ee0e8cb403216c7bd5183154dbd1cf6966f4/TestModule.py#L51-L55
    and the CAIN evaluation script:
    https://github.com/myungsub/CAIN/blob/fff8fc321c5a76904ed2a12c9500e055d4c77256/main.py#L161-L175
    are directly comparing the model output and the ground truth without an extra quantization step. In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game (the quantization step only results in very slight difference as you mentioned above, which has negligible influence in practice).

  2. All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

  3. For your information, I do test the CDFI again with the quantization step on the three benchmark datasets, see the comparison:
    --------------------CDFI (w/o quantization) -----------------CDFI (w/ quantization)
    Vimeo-90K----------35.19, 0.978, 0.010----------------------35.17, 0.978, 0.010
    Middlebury ---------37.17, 0.983, 0.008-----------------------37.14, 0.983, 0.007
    UCF101-DVF--------35.24, 0.967, 0.015-----------------------35.21, 0.967, 0.015

    In these tests, the extra quantization seems to lead to slightly worse PSNR (no more than 0.03), while it has no effect on SSIM and even results in a better LPIPS for the evaluation on Middlebury.

To sum up, I really appreciate your comments on the "quantization" issue. Although it is not so consistent in many of the SOTA implementations and only makes very slight difference, we will keep this in mind in future research.

@tding1 tding1 closed this as completed Mar 22, 2021
@tding1 tding1 changed the title quantization missing in evaluation quantization in evaluation Mar 22, 2021
@sniklaus
Copy link
Author

sniklaus commented Mar 22, 2021

Thanks for providing the evaluation results with quantization!

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

@tding1
Copy link
Owner

tding1 commented Mar 22, 2021

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

What I meant here is that I adopted the practice from AdaCoF and CAIN, which happens to compare results without such "quantization", meaning the "quantization" practice is not adopted everywhere.

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

To be honest, before you came to me with the issue, as a reader, I never realize such a subtlety from the presentation of the existing papers, no matter they do the "quantization" or not. I conjecture that this is partially because the difference is really slight and has no actual effect in practice. In any cases, I will make it clear in the future.

@sniklaus
Copy link
Author

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

@tding1
Copy link
Owner

tding1 commented Mar 22, 2021

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

You are right! Thanks!

@hzwer
Copy link

hzwer commented Mar 25, 2021

Hi, we are working on another VFI work RIFE. We recently wrote the evaluation scripts for VFI methods.
hzwer/ECCV2022-RIFE#124
We reproduced EDSC, CAIN, DAIN, BMBC and some other methods and we tried our best to confirm the experimental data. Welcome to have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants