[nvFuser] failing correctness tests in torchbench.py #76662

jansel · 2022-05-02T15:56:26Z

nvFuser is failing many correctness checks when run with TorchDynamo in torchbench.py using the latest PyTorch master.

If I run:

./torchbench.py --backend=ts --nvfuser --float32 -dcuda -k attention_is_all_you_need_pytorch

I get "INCORRECT"

While if I remove --nvfuser it works.

If you run all models:

./torchbench.py --backend=ts --nvfuser --float32 -dcuda

You will find around a third of them are failing correctness checks.

cc @jjsjann123 @kevinstephano @csarofeen

The text was updated successfully, but these errors were encountered:

csarofeen · 2022-05-02T16:07:14Z

Thanks for filing an issue.
Can you point me to the logic to check correctness?
Is this checking E2E network correctness in inference?
What's the determined as the "correct" value?
How did you select tolerances?

jansel · 2022-05-02T16:13:27Z

It is hitting this line:
https://github.com/facebookresearch/torchdynamo/blob/0d59ce9bfae18337a88de32743777e3a75c4ad5c/torchbench.py#L1110

Which checks correctness with torch.allclose here:
https://github.com/facebookresearch/torchdynamo/blob/0d59ce9bfae18337a88de32743777e3a75c4ad5c/torchdynamo/testing.py#L100

For threshold atol=rtol=1e-4, which is a lower bar than the defaults of rtol=1e-05, atol=1e-08.

Also note I am setting --float32 mode, so there shouldn't be float16 issues.

jansel · 2022-05-02T16:17:46Z

Actually the --float32 option is in a branch. I'll submit a PR for that today. But it just calls model=model.to(torch.float32).

csarofeen · 2022-05-02T16:20:51Z

How do you decide on tolerances? I assume the default fuser doesn't trigger this?

jansel · 2022-05-02T16:26:46Z

TorchDynamo has many backends and experimentally nearly all backends pass that tolerance level in float32 mode.

jansel · 2022-05-02T16:33:42Z

I missed a few questions:

Is this checking E2E network correctness in inference?

Yes

What's the determined as the "correct" value?

Eager mode

syed-ahmed · 2022-05-02T16:42:33Z

Hi @jansel, I'm working on reproducing this issue. Could you please tell us what GPU architecture(s) you ran the tests on? Please provide the output of collect_env as well.

jansel · 2022-05-02T17:06:14Z

I am running on an RTX 3090

$ python torch/utils/collect_env.py 
Collecting environment information...
PyTorch version: 1.12.0a0+gita3f10ec
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.22.3
Libc version: glibc-2.31

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.13.0-40-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] bert-pytorch==0.0.1a4
[pip3] functorch==0.2.0a0+00aa660
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] pytorch-transformers==1.2.0
[pip3] torch==1.12.0a0+gita3f10ec
[pip3] torch-struct==0.5
[pip3] torchaudio==0.12.0a0+c4f1252
[pip3] torchdynamo==0.2.0
[pip3] torchfile==0.1.0
[pip3] torchtext==0.13.0a0+4b821f4
[pip3] torchvision==0.13.0a0+d0dede0
[conda] bert-pytorch              0.0.1a4                   dev_0    <develop>
[conda] functorch                 0.2.0a0+00aa660           dev_0    <develop>
[conda] magma-cuda113             2.5.2                         1    pytorch
[conda] mkl                       2022.0.2                 pypi_0    pypi
[conda] mkl-include               2022.0.1           h06a4308_117  
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.21.2           py38hd8d4704_0  
[conda] numpy-base                1.21.2           py38h2b8c604_0  
[conda] pytorch-transformers      1.2.0                    pypi_0    pypi
[conda] torch                     1.12.0a0+gita3f10ec           dev_0    <develop>
[conda] torch-struct              0.5                      pypi_0    pypi
[conda] torchaudio                0.12.0a0+c4f1252           dev_0    <develop>
[conda] torchdynamo               0.2.0                     dev_0    <develop>
[conda] torchfile                 0.1.0                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.13.0a0+d0dede0           dev_0    <develop>

jansel · 2022-05-02T17:18:51Z

The --float32 option is added here:
pytorch/torchdynamo#189

syed-ahmed · 2022-05-02T22:47:23Z

Hi @jansel, we believe the issue you are seeing is a matter of selecting a correct metric
for comparison. Following is a summary of what I collected during triaging this issue on attention_is_all_you_need_pytorch.
The first two rows are the behavior you reported when nvfuser is turned on/off.

`correct_result`	`new_result`	`(correct_result - new_result).abs().max().item()`
eager, float	nvfuser=true, float	0.0009196549654006958
eager, float	nvfuser=false, float	0.0
eager, double	eager, float	0.0014105449791524394
eager, double	nvfuser=true, float	0.001397865862659109
eager, double	nvfuser=false, float	0.0014105449791524394
eager, double	nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::softmax", True)	0.001333331807427729
eager, double	nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::layer_norm", True)	0.0013510409274578583
eager, double	nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::dropout", True)	0.001374832917094948
eager, double	nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::unsqueeze", True)	0.001397865862659109
eager, double	nvfuser=true, float, torch._C._jit_set_nvfuser_skip_node_kind("aten::layer_norm", True), torch._C._jit_set_nvfuser_skip_node_kind("aten::softmax", True)	0.001374346343565036
eager, double	nvfuser=true, float, PYTORCH_NVFUSER_DISABLE_FMA=1	0.0013143678607354659

As reflected in the summary, when compared to a more accurate reference value (eager, double for correct_result),
the results are consistent among eager and nvfuser. Hence, our recommendation for comparing the backends would be
to use double for the reference values and probably using custom tolerances per benchmark, as you can see that
with a tolerance of 1e-4, torch.allclose for eager, double and eager, float would also fail.

jansel · 2022-05-05T02:14:46Z

Thanks, for that! I am convinced that for this case nvFuser is different-from-eager (and most other frameworks), but still correct.

I reran my test with cosine similarity comparing to a float64 ground truth. Most results seem like they are in an acceptable range except for tacotron2:

$ ./torchbench.py -dcuda --backend=ts -k tacotron2 --cosine
cuda eval  tacotron2                          Similarity score=0.9999997615814209
Similarity score=0.9999995827674866
Similarity score=0.9999999403953552
Similarity score=0.9999986886978149
1.041x p=0.00
speedup gmean=1.04x mean=1.04x
$ ./torchbench.py -dcuda --backend=ts -k tacotron2 --cosine --nvfuser
cuda eval  tacotron2                          Similarity score=0.6445340514183044
INCORRECT

tacotron2 has a cosine similary score of 64% with nvFuser and >99% without it.

I am running in a branch that is comparing to float64, though I was also able to reproduce these results in the main branch comparing to float32.

csarofeen · 2022-05-05T02:28:06Z

Thanks for the update and continuing to look into accuracy. CC @syed-ahmed again, hopefully he can repro and start turning fusions of one by one to see if we can find a culprit.

syed-ahmed · 2022-05-05T19:21:31Z

@jansel For tacotron2, this is what's happening. Tacotron2 has dropout enabled in the Prenet layer, even in eval mode (see this issue). As a result, the cosine similarity is expected to vastly differ since dropout is not intended to be bit accurate between the two backends. Our current understanding is this:

Your reference value is with dropout in eager.
Your new value is with dropout in nvfuser.
Since dropout is not bit accurate between eager and nvfuser, the cosine similarity is expected to vastly differ.
You can verify that by turning off dropout fusion (torch._C._jit_set_nvfuser_skip_node_kind("aten::dropout", True)), which gives you dropout from eager, and that will show you a cosine similarity of 99%.

My suggestion is to either turn off dropout somehow in tacotron2 or handle tacotron2 as a special case in your test suite.

jansel · 2022-05-05T20:24:50Z

Thanks for the update, the dropout root cause makes sense.

Would it be possible for nvFuser to share the same RNG algorithm as eager? What are the challenges there?
If the nvFuser RNG algorithm is better/faster/etc could we update eager to use that?
In the current implementation if the user manually calls torch.cuda.set_rng_state() do they get deterministic results?

kevinstephano · 2022-05-05T20:55:42Z

It isn't a matter of setting the seeds consistently, unfortunately. The issue is that the exact elements that get dropped out are dependent on the exact blocking and tiling of the cuda kernel onto execution units that dictates the order of querying for random values. If you change that for any reason, like performance or you have hardware of a different size, the order of the dropped out elements will change.

jansel · 2022-05-05T22:06:10Z

@ezyang was describing the RNG algorithm used in eager that pre-allocates random numbers in a way that is not dependent on execution order.

What was the thought process around deciding how the RNG operates? What contracts does it provide to users in terms of reproducibility?

csarofeen · 2022-05-06T02:20:32Z

Would be happy to have @ezyang's thoughts here, but RNG is not deterministic in PyTorch on different GPUs, because of this same exact issue. Eager mode doesn't solve this, it just is consistent with itself like nvFuser is with itself.

We have had issues filed against us with people expecting different GPUs to produce bitwise identical RNG, this is definitely not the case in eager mode.

csarofeen · 2022-05-06T02:21:14Z

nvFuser's segmentation and heuristics should be run to run (on the same gpu) deterministic like eager mode given the same seed.

csarofeen · 2022-05-06T02:21:53Z

This requires the same fusions to be generated, so changing anything in the network topology could invalidate that, but it can also invalidate that constraint in eager mode if any added layers include RNG.

ezyang · 2022-05-06T03:39:00Z

I guess my main thought is, it seems to me that it ought to be possible use Philox RNG to get RNG that is consistent even if you block/tile the kernel differently (because the offset corresponding to any given element should be invariant to blocking/tiling). Of course if you're not using Philox but something else, then of course the RNG is not the same, but to my uninformed eye it seems possible to make them line up (but maybe not easy!)

csarofeen · 2022-05-06T11:38:58Z

Yes we're using Philox. Please, let us know if you have an idea how to do that. As today we don't have such a mechanism, but if you can figure out how we can approach it to make RNG independent of thread order, and number of threads, we'd be open to looking at reimplementing RNG.

csarofeen · 2022-05-06T17:53:31Z

@jansel can we close this issue now? I think all discrepancies have been accounted for, or are you still confident there's an issue somewhere?

jansel · 2022-05-06T18:01:23Z

Yes. Thanks for looking into these issues!

csarofeen · 2022-05-06T18:03:02Z

Thanks for looking at accuracy, please feel free to reopen if anything else comes up.

jansel added the module: nvfuser label May 2, 2022

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 3, 2022

davidberard98 added this to Bugs in NVFuser May 6, 2022

jansel closed this as completed May 6, 2022

davidberard98 moved this from Bugs to Performance in NVFuser May 12, 2022

davidberard98 moved this from Performance to Done in NVFuser May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvFuser] failing correctness tests in torchbench.py #76662

[nvFuser] failing correctness tests in torchbench.py #76662

jansel commented May 2, 2022

csarofeen commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

csarofeen commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

syed-ahmed commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

syed-ahmed commented May 2, 2022

jansel commented May 5, 2022

csarofeen commented May 5, 2022

syed-ahmed commented May 5, 2022 •

edited

jansel commented May 5, 2022

kevinstephano commented May 5, 2022

jansel commented May 5, 2022

csarofeen commented May 6, 2022 •

edited

csarofeen commented May 6, 2022

csarofeen commented May 6, 2022 •

edited

ezyang commented May 6, 2022

csarofeen commented May 6, 2022

csarofeen commented May 6, 2022

jansel commented May 6, 2022

csarofeen commented May 6, 2022

[nvFuser] failing correctness tests in torchbench.py #76662

[nvFuser] failing correctness tests in torchbench.py #76662

Comments

jansel commented May 2, 2022

csarofeen commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

csarofeen commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

syed-ahmed commented May 2, 2022

jansel commented May 2, 2022

jansel commented May 2, 2022

syed-ahmed commented May 2, 2022

jansel commented May 5, 2022

csarofeen commented May 5, 2022

syed-ahmed commented May 5, 2022 • edited

jansel commented May 5, 2022

kevinstephano commented May 5, 2022

jansel commented May 5, 2022

csarofeen commented May 6, 2022 • edited

csarofeen commented May 6, 2022

csarofeen commented May 6, 2022 • edited

ezyang commented May 6, 2022

csarofeen commented May 6, 2022

csarofeen commented May 6, 2022

jansel commented May 6, 2022

csarofeen commented May 6, 2022

syed-ahmed commented May 5, 2022 •

edited

csarofeen commented May 6, 2022 •

edited

csarofeen commented May 6, 2022 •

edited