-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Reduce variance of evaluation in reference #5819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce variance of evaluation in reference #5819
Conversation
[RESOLVED]
I have tried the eval 3 times with 3 different settings on number of gpu and batch size:
From this, seems like the variance is relatively small. We notice that the warning when we use 1 GPU (no matter what batch size), is a bit different:
take note that previously 94932 become 94930. However both of them significantly different from 2167123 that we got from Update: |
…hael/vision into references/reduce-eval-variance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @YosuaMichael ! Nice work. I only have a minor comment regarding a potential simplification, but other than that, LGTM. I'll approve now, feel free to merge when addressed.
raise ValueError("The device must be cuda if we want to run in distributed mode using torchrun") | ||
device = torch.device(args.device) | ||
|
||
if args.use_deterministic_algorithms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid duplicated code with args.test_only
, I would suggest to do something like:
if args.use_deterministic_algorithms: | |
if args.use_deterministic_algorithms or args.test_only: |
and to remove these 2 lines
vision/references/optical_flow/train.py
Lines 228 to 229 in 65238ce
torch.backends.cudnn.benchmark = False | |
torch.backends.cudnn.deterministic = True |
I would also suggest to do that for every references as they seem to follow the same pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NicolasHug , I have just tried what you suggested and got error when I try to use test-only
:
File "/fsx/users/yosuamichael/conda/envs/vision-c113/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
Following the error note, when I set the env var CUBLAS_WORKSPACE_CONFIG=:4096:8
it works, but I think this will create a friction to users.
I think the main issue is that we use torch.use_deterministic_algorithms(True)
when we set args.use_deterministic_algorithms
and this is much more stricter than torch.backends.cudnn.deterministic = True
when we use args.test_only
. (see here )
Hence as of now we can only avoid duplicate of the line torch.backends.cudnn.benchmark = False
which I think still okay to have 1 line duplicate for now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair point. I forgot that one is stricter than the other. I think the way you did it is fine then. Sorry for the noise!
Summary: * Change code to reduce variance in eval * Remove unnecessary new line * Fix missing import warnings * Fix the warning on video_classification * Fix bug to get len of UniformClipSampler Reviewed By: YosuaMichael Differential Revision: D36204389 fbshipit-source-id: dfa0dbad60cf2a236f61e3c5d4a459731f07557b
resolve #4730
NOTE: