-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.arange numerics are different after 1.7 update on CPU #47043
Comments
this is a machine epsilon error i believe. not sure why it ran clean on previous versions |
You mean that the Hi-pri to find the root cause. |
yes. but only for this specific case. since |
Ok that sounds good. I am not sure if there is anything we can do on the pytorch side here? |
this doesn't explain why the behavior changes in 1.7 (I confirmed it does). |
does it happen in the nightly build as well? perhaps the release build for 1.7 was built with an updated compiler and there was something that changed there. |
I confirmed it's a problem on a recent version of master. |
Leaving high priority to figure out why vectorized path gives a different answer. |
Possibly changed by #38697. |
Confirmed it is caused by #38697. |
Ok so #38697 doesn't really vectorize the computation, it unrolls it. As pointed out by @ngimel, the difference is basically:
|
Also, numpy produces the "exact" result (that doesn't mean it does for all cases, I didn't check the implementation):
|
I noticed that CUDA has a similar problem, but in that case it is not a new issue.
Running the above script in Pytorch 1.6 shows this:
Running on the current main branch shows this:
Should a separate issue be opened for the difference between CUDA and CPU as well? |
It seems to me that we can solve this issue by creating a
Currently, each call to |
Well I implemented
Since that formula doesn't fix the issue, I think the problem is slightly more nuanced. There must be an additional numerical problem somewhere. I'm looking into what it might be. |
It's baffling, because gpu kernel is doing pretty much what formula above suggests, and yet it still has numerical issues. |
I believe the issue is AVX2. Consider the following program, which essentially performs a simplified arange_test.cpp
Build it with and without AVX2: Makefile
We get different results with each binary:
The AVX2 version matches today's Pytorch result if vectorization is turned off, and I suspect the non-AVX2 version matches what Pytorch used to give. The most notable difference is element 5--exactly 2 without AVX2, and slightly lower than 2 with AVX2. |
To summarize, there are two CPU numerical problems:
To confirm, I ran Using the current Pytorch 1.7 release:
Using my
So we only get the correct result in one of the above four cases: if we use AVX instead of AVX2 and we use I can create a PR to introduce |
We can certainly disable AVX2 support for one operation; we already compile kernels multiple times per AVX version, so simplest would be just to wrap the operation in a preprocessor macro that omits the DispatchStub registration when AVX2 is detected. Fixing the problem by disabling AVX2 does seem a bit goofy though... |
I propose we just could just revert the PR (or the part of it) that "vectorized" arange. This would leave the CUDA issue, but people don't seem as sensitive to it. We should also add a test case for this behavior so if we try to improve the performance of arange again we detect this. |
The problem appears to actually be the
But AVX is built with these flags:
Indeed, if I remove the Here's the code for this experiment: https://github.com/kurtamohler/pytorch-perf-test-scripts/tree/master/arange_compiler_flags Looking at these in godbolt shows that an FMA instruction is in fact used only if I think it could be possible to solve this issue by using AVX2 intrinsics to possibly prevent the compiler from being able to substitute with an FMA. Or perhaps there is some macro that would tell the compiler to not use FMA for the specified section of code. I'll look into whether a macro like that exists. |
Another possible solution is to separate code that we don't want to use FMA into a different file, and avoid using the |
I tried using intrinsics to see if it would avoid FMA, and it didn't work at first--but then I found a way to get it working. First, I changed from this
to this:
Evidently the
This worked! Since I'm using an intrinsic only for the multiply and not the add, the two operations don't get combined, and now we get the proper behavior for the example we've been looking at:
This result is exactly the same for I'll admit that it's a bit of an odd solution, but it does work. Is it alright if we go ahead with this? |
cc @VitalyFedyunin for the AVX build issues |
@kurtamohler Thank you for the detailed analysis; in particular, narrowing it down to FMA is really useful. You didn't say so explicitly, but it makes sense that FMA would cause this problem, as it changes the rounding behavior of the operation: multiply-add does two rounds, whereas fused multiply-add only does one round. I'm guessing that in this particular case, that's enough extra precision to "witness" the fact that 1.4 isn't actually 1.4 in floating point, and give you the odd rounding behavior. In fact, arguably, the FMA result is "more correct" (because 1.4 is not actually 1.4). All told, I'm not sure we should fix this. We compile with I'm open to other opinions though. Maybe @ngimel has more thoughts. |
Yeah I think that's a reasonable argument @ezyang . On the topic of better performance, I think using my |
I think I agree @coquelin77 Coming back to you, we think this might just be "expected" behavior. Does this leave you in a bad spot, e.g., it is now not easy for you to get the right behavior one way or another? |
For posterity, it IS possible to selectively disable fma: https://twitter.com/colesbury/status/1327331440992849923 |
it make things slightly more difficult is a small subset of functions, but i think that it would be in the best interest for torch to target speed. to get exact numeric consistency while making optimization passes is very difficult, if not impossible. I posted the issue primarily for you to be aware of the changes brought on during the update Thanks for the good work! |
Thanks, I'll close this as not a bug. BTW, for the public record, another way to implement the arange as originally quoted is to do arange on integers, and then divide by 10 after the fact; you should be able to avoid precision issues entirely that way. |
馃悰 Bug
The floor operation is not working properly for float64 dtype. other dtypes not tested.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
see code snippet.
Environment
Additional context
Only tested on CPU.
cc @ezyang @gchanan @zou3519 @bdhirsh @heitorschueroff
The text was updated successfully, but these errors were encountered: