quantization in evaluation #1

sniklaus · 2021-03-22T18:04:15Z

Thanks for sharing your code! I just looked into it a little bit and it seems there is no quantization in the evaluation?

CDFI/test.py

Lines 36 to 47 in d7f79e5

    
           frame_out = model(in0, in1) 
        
           lps = lpips(self.gt_list[idx].cuda(), frame_out, net_type='squeeze') 
        
           imwrite(frame_out, output_dir + '/' + self.im_list[idx] + '/' + output_name + '.png', range=(0, 1)) 
        
           frame_out = frame_out.squeeze().detach().cpu().numpy() 
        
           gt = self.gt_list[idx].numpy() 
        
           psnr = skimage.metrics.peak_signal_noise_ratio(image_true=gt, image_test=frame_out) 
        
           ssim = skimage.metrics.structural_similarity(np.transpose(gt, (1, 2, 0)), 
        
                                                        np.transpose(frame_out, (1, 2, 0)), multichannel=True)

However, it is common practice to quantize your interpolation estimate before computing any metrics as shown in the examples below. If you submit results to a benchmark, like the one from Middlebury, you will have to quantize the interpolation estimates to save them as an image so it has been the norm to quantize all results throughout the evaluation.

https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/benchmark.py#L28
https://github.com/hzwer/arXiv2020-RIFE/blob/15cb7f2389ccd93e8b8946546d4665c9b41541a3/benchmark/Vimeo90K.py#L36
https://github.com/baowenbo/DAIN/blob/9d9c0d7b3718dfcda9061c85efec472478a3aa86/demo_MiddleBury.py#L162-L166
https://github.com/laomao0/BIN/blob/b3ec2a27d62df966cc70880bb3d13dcf147f7c39/test.py#L406-L410

The reason why this is important is that the quantization step has a negative impact on the metrics. So if one does not quantize the results of their method before computing the metrics while the results from other methods had the quantization step in place, then the evaluation is slightly biased. Would you hence be able to share the evaluation metrics for CDFI with the quantization? This would greatly benefit future work that compares to CDFI to avoid this bias. And thanks again for sharing your code!

tding1 · 2021-03-22T20:38:57Z

Thanks for pointing this interesting question. My response is as follows:

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations. For example, the AdaCoF evaluation script:
https://github.com/HyeongminLEE/AdaCoF-pytorch/blob/f121ee0e8cb403216c7bd5183154dbd1cf6966f4/TestModule.py#L51-L55
and the CAIN evaluation script:
https://github.com/myungsub/CAIN/blob/fff8fc321c5a76904ed2a12c9500e055d4c77256/main.py#L161-L175
are directly comparing the model output and the ground truth without an extra quantization step. In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game (the quantization step only results in very slight difference as you mentioned above, which has negligible influence in practice).
All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.
For your information, I do test the CDFI again with the quantization step on the three benchmark datasets, see the comparison:
--------------------CDFI (w/o quantization) -----------------CDFI (w/ quantization)
Vimeo-90K----------35.19, 0.978, 0.010----------------------35.17, 0.978, 0.010
Middlebury ---------37.17, 0.983, 0.008-----------------------37.14, 0.983, 0.007
UCF101-DVF--------35.24, 0.967, 0.015-----------------------35.21, 0.967, 0.015

In these tests, the extra quantization seems to lead to slightly worse PSNR (no more than 0.03), while it has no effect on SSIM and even results in a better LPIPS for the evaluation on Middlebury.

To sum up, I really appreciate your comments on the "quantization" issue. Although it is not so consistent in many of the SOTA implementations and only makes very slight difference, we will keep this in mind in future research.

sniklaus · 2021-03-22T20:56:29Z

Thanks for providing the evaluation results with quantization!

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

tding1 · 2021-03-22T21:28:01Z

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

What I meant here is that I adopted the practice from AdaCoF and CAIN, which happens to compare results without such "quantization", meaning the "quantization" practice is not adopted everywhere.

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

To be honest, before you came to me with the issue, as a reader, I never realize such a subtlety from the presentation of the existing papers, no matter they do the "quantization" or not. I conjecture that this is partially because the difference is really slight and has no actual effect in practice. In any cases, I will make it clear in the future.

sniklaus · 2021-03-22T21:37:03Z

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

tding1 · 2021-03-22T21:39:54Z

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

You are right! Thanks!

hzwer · 2021-03-25T06:01:59Z

Hi, we are working on another VFI work RIFE. We recently wrote the evaluation scripts for VFI methods.
hzwer/ECCV2022-RIFE#124
We reproduced EDSC, CAIN, DAIN, BMBC and some other methods and we tried our best to confirm the experimental data. Welcome to have a look.

tding1 closed this as completed Mar 22, 2021

tding1 changed the title ~~quantization missing in evaluation~~ quantization in evaluation Mar 22, 2021

sniklaus mentioned this issue Mar 22, 2021

inconsistent SSIM computation #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization in evaluation #1

quantization in evaluation #1

sniklaus commented Mar 22, 2021

tding1 commented Mar 22, 2021

sniklaus commented Mar 22, 2021 •

edited

Loading

tding1 commented Mar 22, 2021

sniklaus commented Mar 22, 2021

tding1 commented Mar 22, 2021

hzwer commented Mar 25, 2021 •

edited

Loading

quantization in evaluation #1

quantization in evaluation #1

Comments

sniklaus commented Mar 22, 2021

tding1 commented Mar 22, 2021

sniklaus commented Mar 22, 2021 • edited Loading

tding1 commented Mar 22, 2021

sniklaus commented Mar 22, 2021

tding1 commented Mar 22, 2021

hzwer commented Mar 25, 2021 • edited Loading

sniklaus commented Mar 22, 2021 •

edited

Loading

hzwer commented Mar 25, 2021 •

edited

Loading