Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvFuser] failing correctness tests in torchbench.py #76662

Closed
jansel opened this issue May 2, 2022 · 24 comments
Closed

[nvFuser] failing correctness tests in torchbench.py #76662

jansel opened this issue May 2, 2022 · 24 comments
Labels
module: nvfuser triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects

Comments

@jansel
Copy link
Contributor

jansel commented May 2, 2022

nvFuser is failing many correctness checks when run with TorchDynamo in torchbench.py using the latest PyTorch master.

If I run:

./torchbench.py --backend=ts --nvfuser --float32 -dcuda -k attention_is_all_you_need_pytorch

I get "INCORRECT"

While if I remove --nvfuser it works.

If you run all models:

./torchbench.py --backend=ts --nvfuser --float32 -dcuda

You will find around a third of them are failing correctness checks.

cc @jjsjann123 @kevinstephano @csarofeen

@csarofeen
Copy link
Contributor

Thanks for filing an issue.
Can you point me to the logic to check correctness?
Is this checking E2E network correctness in inference?
What's the determined as the "correct" value?
How did you select tolerances?

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

It is hitting this line:
https://github.com/facebookresearch/torchdynamo/blob/0d59ce9bfae18337a88de32743777e3a75c4ad5c/torchbench.py#L1110

Which checks correctness with torch.allclose here:
https://github.com/facebookresearch/torchdynamo/blob/0d59ce9bfae18337a88de32743777e3a75c4ad5c/torchdynamo/testing.py#L100

For threshold atol=rtol=1e-4, which is a lower bar than the defaults of rtol=1e-05, atol=1e-08.

Also note I am setting --float32 mode, so there shouldn't be float16 issues.

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

Actually the --float32 option is in a branch. I'll submit a PR for that today. But it just calls model=model.to(torch.float32).

@csarofeen
Copy link
Contributor

How do you decide on tolerances? I assume the default fuser doesn't trigger this?

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

TorchDynamo has many backends and experimentally nearly all backends pass that tolerance level in float32 mode.

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

I missed a few questions:

Is this checking E2E network correctness in inference?

Yes

What's the determined as the "correct" value?

Eager mode

@syed-ahmed
Copy link
Contributor

Hi @jansel, I'm working on reproducing this issue. Could you please tell us what GPU architecture(s) you ran the tests on? Please provide the output of collect_env as well.

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

I am running on an RTX 3090

$ python torch/utils/collect_env.py 
Collecting environment information...
PyTorch version: 1.12.0a0+gita3f10ec
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.31

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] bert-pytorch==0.0.1a4
[pip3] functorch==0.2.0a0+00aa660
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] pytorch-transformers==1.2.0
[pip3] torch==1.12.0a0+gita3f10ec
[pip3] torch-struct==0.5
[pip3] torchaudio==0.12.0a0+c4f1252
[pip3] torchdynamo==0.2.0
[pip3] torchfile==0.1.0
[pip3] torchtext==0.13.0a0+4b821f4
[pip3] torchvision==0.13.0a0+d0dede0
[conda] bert-pytorch              0.0.1a4                   dev_0    <develop>
[conda] functorch                 0.2.0a0+00aa660           dev_0    <develop>
[conda] magma-cuda113             2.5.2                         1    pytorch
[conda] mkl                       2022.0.2                 pypi_0    pypi
[conda] mkl-include               2022.0.1           h06a4308_117  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.2           py38hd8d4704_0  
[conda] numpy-base                1.21.2           py38h2b8c604_0  
[conda] pytorch-transformers      1.2.0                    pypi_0    pypi
[conda] torch                     1.12.0a0+gita3f10ec           dev_0    <develop>
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torchaudio                0.12.0a0+c4f1252           dev_0    <develop>
[conda] torchdynamo               0.2.0                     dev_0    <develop>
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.13.0a0+d0dede0           dev_0    <develop>

@jansel
Copy link
Contributor Author

jansel commented May 2, 2022

The --float32 option is added here:
pytorch/torchdynamo#189

@syed-ahmed
Copy link
Contributor

Hi @jansel, we believe the issue you are seeing is a matter of selecting a correct metric
for comparison. Following is a summary of what I collected during triaging this issue on attention_is_all_you_need_pytorch.
The first two rows are the behavior you reported when nvfuser is turned on/off.

correct_result new_result (correct_result - new_result).abs().max().item()
eager, float nvfuser=true, float 0.0009196549654006958
eager, float nvfuser=false, float 0.0
eager, double eager, float 0.0014105449791524394
eager, double nvfuser=true, float 0.001397865862659109
eager, double nvfuser=false, float 0.0014105449791524394
eager, double nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::softmax", True) 0.001333331807427729
eager, double nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::layer_norm", True) 0.0013510409274578583
eager, double nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::dropout", True) 0.001374832917094948
eager, double nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::unsqueeze", True) 0.001397865862659109
eager, double nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::layer_norm", True), torch._C._jit_set_nvfuser_skip_node_kind("aten::softmax", True) 0.001374346343565036
eager, double nvfuser=true, float, PYTORCH_NVFUSER_DISABLE_FMA=1 0.0013143678607354659

As reflected in the summary, when compared to a more accurate reference value (eager, double for correct_result),
the results are consistent among eager and nvfuser. Hence, our recommendation for comparing the backends would be
to use double for the reference values and probably using custom tolerances per benchmark, as you can see that
with a tolerance of 1e-4, torch.allclose for eager, double and eager, float would also fail.

@mruberry mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 3, 2022
@jansel
Copy link
Contributor Author

jansel commented May 5, 2022

Thanks, for that! I am convinced that for this case nvFuser is different-from-eager (and most other frameworks), but still correct.

I reran my test with cosine similarity comparing to a float64 ground truth. Most results seem like they are in an acceptable range except for tacotron2:

$ ./torchbench.py -dcuda --backend=ts -k tacotron2 --cosine
cuda eval  tacotron2                          Similarity score=0.9999997615814209
Similarity score=0.9999995827674866
Similarity score=0.9999999403953552
Similarity score=0.9999986886978149
1.041x p=0.00
speedup gmean=1.04x mean=1.04x
$ ./torchbench.py -dcuda --backend=ts -k tacotron2 --cosine --nvfuser
cuda eval  tacotron2                          Similarity score=0.6445340514183044
INCORRECT

tacotron2 has a cosine similary score of 64% with nvFuser and >99% without it.

I am running in a branch that is comparing to float64, though I was also able to reproduce these results in the main branch comparing to float32.

@csarofeen
Copy link
Contributor

Thanks for the update and continuing to look into accuracy. CC @syed-ahmed again, hopefully he can repro and start turning fusions of one by one to see if we can find a culprit.

@syed-ahmed
Copy link
Contributor

syed-ahmed commented May 5, 2022

@jansel For tacotron2, this is what's happening. Tacotron2 has dropout enabled in the Prenet layer, even in eval mode (see this issue). As a result, the cosine similarity is expected to vastly differ since dropout is not intended to be bit accurate between the two backends. Our current understanding is this:

  • Your reference value is with dropout in eager.
  • Your new value is with dropout in nvfuser.
  • Since dropout is not bit accurate between eager and nvfuser, the cosine similarity is expected to vastly differ.
  • You can verify that by turning off dropout fusion (torch._C._jit_set_nvfuser_skip_node_kind("aten::dropout", True)), which gives you dropout from eager, and that will show you a cosine similarity of 99%.

My suggestion is to either turn off dropout somehow in tacotron2 or handle tacotron2 as a special case in your test suite.

@jansel
Copy link
Contributor Author

jansel commented May 5, 2022

Thanks for the update, the dropout root cause makes sense.

  1. Would it be possible for nvFuser to share the same RNG algorithm as eager? What are the challenges there?
  2. If the nvFuser RNG algorithm is better/faster/etc could we update eager to use that?
  3. In the current implementation if the user manually calls torch.cuda.set_rng_state() do they get deterministic results?

@kevinstephano
Copy link
Collaborator

It isn't a matter of setting the seeds consistently, unfortunately. The issue is that the exact elements that get dropped out are dependent on the exact blocking and tiling of the cuda kernel onto execution units that dictates the order of querying for random values. If you change that for any reason, like performance or you have hardware of a different size, the order of the dropped out elements will change.

@jansel
Copy link
Contributor Author

jansel commented May 5, 2022

@ezyang was describing the RNG algorithm used in eager that pre-allocates random numbers in a way that is not dependent on execution order.

What was the thought process around deciding how the RNG operates? What contracts does it provide to users in terms of reproducibility?

@davidberard98 davidberard98 added this to Bugs in NVFuser May 6, 2022
@csarofeen
Copy link
Contributor

csarofeen commented May 6, 2022

Would be happy to have @ezyang's thoughts here, but RNG is not deterministic in PyTorch on different GPUs, because of this same exact issue. Eager mode doesn't solve this, it just is consistent with itself like nvFuser is with itself.

We have had issues filed against us with people expecting different GPUs to produce bitwise identical RNG, this is definitely not the case in eager mode.

@csarofeen
Copy link
Contributor

nvFuser's segmentation and heuristics should be run to run (on the same gpu) deterministic like eager mode given the same seed.

@csarofeen
Copy link
Contributor

csarofeen commented May 6, 2022

This requires the same fusions to be generated, so changing anything in the network topology could invalidate that, but it can also invalidate that constraint in eager mode if any added layers include RNG.

@ezyang
Copy link
Contributor

ezyang commented May 6, 2022

I guess my main thought is, it seems to me that it ought to be possible use Philox RNG to get RNG that is consistent even if you block/tile the kernel differently (because the offset corresponding to any given element should be invariant to blocking/tiling). Of course if you're not using Philox but something else, then of course the RNG is not the same, but to my uninformed eye it seems possible to make them line up (but maybe not easy!)

@csarofeen
Copy link
Contributor

Yes we're using Philox. Please, let us know if you have an idea how to do that. As today we don't have such a mechanism, but if you can figure out how we can approach it to make RNG independent of thread order, and number of threads, we'd be open to looking at reimplementing RNG.

@csarofeen
Copy link
Contributor

@jansel can we close this issue now? I think all discrepancies have been accounted for, or are you still confident there's an issue somewhere?

@jansel
Copy link
Contributor Author

jansel commented May 6, 2022

Yes. Thanks for looking into these issues!

@jansel jansel closed this as completed May 6, 2022
@csarofeen
Copy link
Contributor

Thanks for looking at accuracy, please feel free to reopen if anything else comes up.

@davidberard98 davidberard98 moved this from Bugs to Performance in NVFuser May 12, 2022
@davidberard98 davidberard98 moved this from Performance to Done in NVFuser May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: nvfuser triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Development

No branches or pull requests

6 participants