Reduce variance of evaluation in reference #5819

YosuaMichael · 2022-04-14T12:07:40Z

resolve #4730

NOTE:

For detection, optical_flow we don't add warning if number of processed samples is different from the len(dataset) because it seems that it is already handled (let me know if this is actually wrong!)
For similarity, seems like the script does not support distributed. Hence we don't add warning if the number of processed samples is different from len(dataset) as well (following arguments in Evaluation code of references is slightly off #4559 (comment) the warning only needed in distributed setting)
We have done test on classification, detection, optical-flow, segmentation, similarity, and video classification. The resulting variance seems small enough (<0.02% differences). For each type we did eval twice

YosuaMichael · 2022-04-21T13:14:30Z

[RESOLVED]
Note: the video_classification trigger a warning when trained with 4 gpu and batch size=16

UserWarning: It looks like the dataset has 2167123 samples, but 94932 samples were used for the validation, which might bias the res
ults. Try adjusting the batch size and / or the world size. Setting the world size to 1 is always a safe bet.

I have tried the eval 3 times with 3 different settings on number of gpu and batch size:

4 GPU, batch_size=16, result: Clip Acc@1 53.500 Clip Acc@5 75.991
1 GPU, batch_size=16, result: Clip Acc@1 53.502 Clip Acc@5 75.991
1 GPU, batch_size=1, result: Clip Acc@1 53.513 Clip Acc@5 75.995

From this, seems like the variance is relatively small. We notice that the warning when we use 1 GPU (no matter what batch size), is a bit different:

UserWarning: It looks like the dataset has 2167123 samples, but 94930 samples were used for the validation, which might bias the results. Try adjusting the batch size and / or the world size. Setting the world size to 1 is always a safe bet.

take note that previously 94932 become 94930. However both of them significantly different from 2167123 that we got from len(dataloader.dataset). I think this imply that len(dataloader.dataset) might be wrong. Will investigate more on this.

Update:
Seems like the cause is because the dataset have a variable number of clips per video, and len(dataloader.dataset) will return the total number of clips in the whole dataset while the UniformClipSampler that we use for testing will only take a fixed amount of clip per video (default is 5). Hence, our num_processed_samples will more like number of video * number of clips per video instead of the total number of clips. To fix this, I will take the len of the UniformClipSampler instead of the dataset.

…hael/vision into references/reduce-eval-variance

NicolasHug

Thanks a lot @YosuaMichael ! Nice work. I only have a minor comment regarding a potential simplification, but other than that, LGTM. I'll approve now, feel free to merge when addressed.

NicolasHug · 2022-05-03T12:51:20Z

references/optical_flow/train.py

        raise ValueError("The device must be cuda if we want to run in distributed mode using torchrun")
    device = torch.device(args.device)

+    if args.use_deterministic_algorithms:


To avoid duplicated code with args.test_only, I would suggest to do something like:

Suggested change

if args.use_deterministic_algorithms:

if args.use_deterministic_algorithms or args.test_only:

and to remove these 2 lines

vision/references/optical_flow/train.py

Lines 228 to 229 in 65238ce

torch.backends.cudnn.benchmark = False

torch.backends.cudnn.deterministic = True

I would also suggest to do that for every references as they seem to follow the same pattern

Hi @NicolasHug , I have just tried what you suggested and got error when I try to use test-only:

File "/fsx/users/yosuamichael/conda/envs/vision-c113/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Following the error note, when I set the env var CUBLAS_WORKSPACE_CONFIG=:4096:8 it works, but I think this will create a friction to users.

I think the main issue is that we use torch.use_deterministic_algorithms(True) when we set args.use_deterministic_algorithms and this is much more stricter than torch.backends.cudnn.deterministic = True when we use args.test_only. (see here )

Hence as of now we can only avoid duplicate of the line torch.backends.cudnn.benchmark = False which I think still okay to have 1 line duplicate for now. What do you think?

Ah, fair point. I forgot that one is stricter than the other. I think the way you did it is fine then. Sorry for the noise!

Summary: * Change code to reduce variance in eval * Remove unnecessary new line * Fix missing import warnings * Fix the warning on video_classification * Fix bug to get len of UniformClipSampler Reviewed By: YosuaMichael Differential Revision: D36204389 fbshipit-source-id: dfa0dbad60cf2a236f61e3c5d4a459731f07557b

Change code to reduce variance in eval

fe7a5f0

YosuaMichael added enhancement module: reference scripts labels Apr 14, 2022

YosuaMichael requested a review from NicolasHug April 14, 2022 12:07

YosuaMichael self-assigned this Apr 14, 2022

facebook-github-bot added the cla signed label Apr 14, 2022

YosuaMichael added 2 commits April 20, 2022 14:53

Merge branch 'main' into references/reduce-eval-variance

f7da625

Remove unnecessary new line

60347ac

YosuaMichael changed the title ~~[WIP] Reduce variance of evaluation in reference~~ Reduce variance of evaluation in reference Apr 21, 2022

Merge branch 'main' into references/reduce-eval-variance

899028d

YosuaMichael marked this pull request as ready for review April 21, 2022 09:48

YosuaMichael and others added 2 commits April 21, 2022 11:02

Fix missing import warnings

07f0622

Merge branch 'main' into references/reduce-eval-variance

aa43994

YosuaMichael and others added 7 commits April 25, 2022 14:07

Fix the warning on video_classification

e92a13c

Merge branch 'references/reduce-eval-variance' of github.com:YosuaMic…

81f4a65

…hael/vision into references/reduce-eval-variance

Fix bug to get len of UniformClipSampler

1316541

Merge branch 'main' into references/reduce-eval-variance

7264c7e

Merge branch 'main' into references/reduce-eval-variance

8dd62b8

Merge branch 'main' into references/reduce-eval-variance

1d0ea32

Merge branch 'main' into references/reduce-eval-variance

c38a308

NicolasHug approved these changes May 3, 2022

View reviewed changes

Merge branch 'main' into references/reduce-eval-variance

bea7549

YosuaMichael merged commit e556640 into pytorch:main May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce variance of evaluation in reference #5819

Reduce variance of evaluation in reference #5819

Uh oh!

YosuaMichael commented Apr 14, 2022 •

edited

Loading

Uh oh!

YosuaMichael commented Apr 21, 2022 •

edited

Loading

Uh oh!

NicolasHug left a comment

Uh oh!

NicolasHug May 3, 2022

Uh oh!

YosuaMichael May 3, 2022 •

edited

Loading

Uh oh!

NicolasHug May 3, 2022

Uh oh!

Uh oh!

	if args.use_deterministic_algorithms:
	if args.use_deterministic_algorithms or args.test_only:

	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True

Reduce variance of evaluation in reference #5819

Reduce variance of evaluation in reference #5819

Uh oh!

Conversation

YosuaMichael commented Apr 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YosuaMichael commented Apr 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 3, 2022

Choose a reason for hiding this comment

Uh oh!

YosuaMichael May 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YosuaMichael commented Apr 14, 2022 •

edited

Loading

YosuaMichael commented Apr 21, 2022 •

edited

Loading

YosuaMichael May 3, 2022 •

edited

Loading