New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests fail on A100 GPUs due to inaccurate/differing float values #52278
Comments
cc @ptrblck, did you see similar failures in your CI? |
Nightly CI isn't failing for @Flamefire I assume you've built PyTorch from source with CUDA11.1 + cudnn8.0.4 + NCCL 2.8.3? EDIT: yes, you are:
|
FWIW: We are trying to automate the installation of PyTorch on HPC clusters, so the software environment is as reproducible as possible (same versions except for core system stuff like glibc). See easybuilders/easybuild-easyconfigs#12003 for our "recipe" To make the test pass I had to change the comparison to
I'm unsure of that isn't a more general issue or if the reproducibility is maybe not that easy for the DDP |
I did some more experiments with Inside
For another node where it works values are:
Printing the Next I checked the single loss values (using
So in comparison of the final means:
Hence I conclude that the mean calculation in PyTorch in that configuration is inexact. Example code to reproduce that:
|
Sorry for spamming, but wanted to share my progress: For the mean I found that the imprecision is caused by multiplying the sum of values by However it did not change the results at all. I get the literal same error: Sanity checking that the reason is not my build I tested it with the docker container Running further I see similar issues in
|
@Flamefire based on the comment, want to confirm with you that the tests not only failed on distributed test, right? test_nn.py also failed? regarding distributed tests, how about tests in "test_distributed_spawn.py"? "test_distributed_spawn.py" run the same tests as "test_distributed_fork.py", the only difference is that one spawn subprocess and another one fork subprocesses. |
This is correct. But as the tests run in order and stop on first failed file this was the first to fail. Spawn fails too and I have confirmation of other HPC centers where the same test (in this case the distributed one) fail with the exact same values and differences. |
@Flamefire I see, looks like it is a general issue, not distributed package specific. In this case, Would you please change the title so that the general PyTorch Oncall will help you better. Thanks |
I have the same problem, with 8 RTX 3090, and an AMD EPYC 7F72 processor. |
cc @zasdfgbnm, were the accuracies for the tests in question adjusted? |
Any update here? FWIW the following tests from test_nn.py fail on A100:
When running only those then just the first test fails, so the failures are input dependent (there is a The first fails always, even when run standalone. Error message:
When run inside the full test_nn the greatest difference is 0.01662997265166055 which matches what I see with the other tf32 tests. |
馃悰 Bug
Testing PyTorch on our new cluster with A100 (CC 8.0) GPUs I'm seeing multiple failures in the distributed tests of PyTorch.
E.g from the TestDistBackendWithFork suite the following tests fail:
The same setup (same version of CUDA, Compiler, NCCL, ...) works on many other systems with e.g. V100 GPUs
To Reproduce
Steps to reproduce the behavior:
More verbose log: error.log
Environment
conda
,pip
, source): sourcecc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @mruberry @VitalyFedyunin @walterddr @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu
The text was updated successfully, but these errors were encountered: