New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nvFuser] failing correctness tests in torchbench.py #76662
Comments
Thanks for filing an issue. |
It is hitting this line: Which checks correctness with For threshold Also note I am setting |
Actually the |
How do you decide on tolerances? I assume the default fuser doesn't trigger this? |
TorchDynamo has many backends and experimentally nearly all backends pass that tolerance level in float32 mode. |
I missed a few questions:
Yes
Eager mode |
Hi @jansel, I'm working on reproducing this issue. Could you please tell us what GPU architecture(s) you ran the tests on? Please provide the output of collect_env as well. |
I am running on an RTX 3090
|
The |
Hi @jansel, we believe the issue you are seeing is a matter of selecting a correct metric
As reflected in the summary, when compared to a more accurate reference value ( |
Thanks, for that! I am convinced that for this case nvFuser is different-from-eager (and most other frameworks), but still correct. I reran my test with cosine similarity comparing to a float64 ground truth. Most results seem like they are in an acceptable range except for tacotron2:
tacotron2 has a cosine similary score of 64% with nvFuser and >99% without it. I am running in a branch that is comparing to float64, though I was also able to reproduce these results in the main branch comparing to float32. |
Thanks for the update and continuing to look into accuracy. CC @syed-ahmed again, hopefully he can repro and start turning fusions of one by one to see if we can find a culprit. |
@jansel For tacotron2, this is what's happening. Tacotron2 has dropout enabled in the Prenet layer, even in eval mode (see this issue). As a result, the cosine similarity is expected to vastly differ since dropout is not intended to be bit accurate between the two backends. Our current understanding is this:
My suggestion is to either turn off dropout somehow in tacotron2 or handle tacotron2 as a special case in your test suite. |
Thanks for the update, the dropout root cause makes sense.
|
It isn't a matter of setting the seeds consistently, unfortunately. The issue is that the exact elements that get dropped out are dependent on the exact blocking and tiling of the cuda kernel onto execution units that dictates the order of querying for random values. If you change that for any reason, like performance or you have hardware of a different size, the order of the dropped out elements will change. |
@ezyang was describing the RNG algorithm used in eager that pre-allocates random numbers in a way that is not dependent on execution order. What was the thought process around deciding how the RNG operates? What contracts does it provide to users in terms of reproducibility? |
Would be happy to have @ezyang's thoughts here, but RNG is not deterministic in PyTorch on different GPUs, because of this same exact issue. Eager mode doesn't solve this, it just is consistent with itself like nvFuser is with itself. We have had issues filed against us with people expecting different GPUs to produce bitwise identical RNG, this is definitely not the case in eager mode. |
nvFuser's segmentation and heuristics should be run to run (on the same gpu) deterministic like eager mode given the same seed. |
This requires the same fusions to be generated, so changing anything in the network topology could invalidate that, but it can also invalidate that constraint in eager mode if any added layers include RNG. |
I guess my main thought is, it seems to me that it ought to be possible use Philox RNG to get RNG that is consistent even if you block/tile the kernel differently (because the offset corresponding to any given element should be invariant to blocking/tiling). Of course if you're not using Philox but something else, then of course the RNG is not the same, but to my uninformed eye it seems possible to make them line up (but maybe not easy!) |
Yes we're using Philox. Please, let us know if you have an idea how to do that. As today we don't have such a mechanism, but if you can figure out how we can approach it to make RNG independent of thread order, and number of threads, we'd be open to looking at reimplementing RNG. |
@jansel can we close this issue now? I think all discrepancies have been accounted for, or are you still confident there's an issue somewhere? |
Yes. Thanks for looking into these issues! |
Thanks for looking at accuracy, please feel free to reopen if anything else comes up. |
nvFuser is failing many correctness checks when run with TorchDynamo in torchbench.py using the latest PyTorch master.
If I run:
I get "INCORRECT"
While if I remove
--nvfuser
it works.If you run all models:
You will find around a third of them are failing correctness checks.
cc @jjsjann123 @kevinstephano @csarofeen
The text was updated successfully, but these errors were encountered: