New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardized clamp kernels to Numpy-like implementation #43288
Conversation
💊 CI failures summary and remediationsAs of commit 8308899 (more details on the Dr. CI page): Commit 8308899 was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 39 times. |
911a843
to
6add085
Compare
@mruberry Here's the change which standardizes clamp behaviour for all kernels to NumPy-like as discussed in the other PR. Let me know if this is enough for you to benchmark and anything else, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I ran this through our perf experiments and it passed. So that's good news because we can safely make this change, but bad news in that we don't know what was causing the performance degradation on the larger PR. There's a small chance it was benchmark flakiness, but I would like to propose two options for you, @vsimkus:
I kind of prefer the first option, myself, since the original PR is now in need of a rebase. What if we separated the changes into this PR, the non-quantization changes, and the quantization changes? Looking forward to hearing your thoughts. |
@mruberry Thanks! I agree with you - it will be easier to land smaller PRs. I'll be away now for a couple of weeks, but once I return I'll add some tests to this PR and update the docstrings where necessary. Then, we can work on the other bits. |
Great! Looking forward to it. |
fba4051
to
e6c2ee4
Compare
@mruberry As before, I've refactored the clamp tests to be more concise and use NumPy as the reference implementation. Also updated the docstring so that it shows the correct clamping output formula and removed an unnecessary comment about the argument types. I think it should be good for a review and benchmarking now. |
This is looking pretty good, @vsimkus. Thanks for taking the time to follow-up. I'm looking forward to get these changes in. I have a question about removing the vectorized implementations and made some suggestions for simplifying the tests. |
e6c2ee4
to
a7f53d0
Compare
@mruberry I've now updated the PR with the suggested changes, and answered your question above on removing the complex vectorized clamp implementations. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks @vsimkus. Sorry this review took awhile, we just cut our 1.7 branch and that process was very time consuming.
Thanks for pointing out we're only removing the complex vectorized clamp impls.
One last issue before this gets imported is in the docs however. Take a look and let me know your thoughts.
a7f53d0
to
0e34ee8
Compare
@mruberry Thanks again for carefully reviewing the change :) I've updated the documentation and added the small change in torch_test ( |
Codecov Report
@@ Coverage Diff @@
## master #43288 +/- ##
=======================================
Coverage 68.19% 68.19%
=======================================
Files 410 410
Lines 53232 53232
=======================================
Hits 36302 36302
Misses 16930 16930
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Hey @vsimkus, sorry to bother you. Would you mind rebasing this? I'm seeing some odd failures internally and I'm hoping they're in the base revision. |
0e34ee8
to
8308899
Compare
@mruberry Done. I hope the failures go away :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
The rebase appears to have solved the issues. Thanks @vsimkus. Looking forward to the next PR in this series. Do you have a good idea for the next step or would you like to discuss some ideas? |
BC-breaking note
For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.
This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:
pytorch/aten/src/ATen/cpu/vec256/vec256_double.h
Line 304 in 78b95b6
but in other places it clamps differently:
pytorch/aten/src/ATen/cpu/vec256/vec256_base.h
Line 624 in 78b95b6
pytorch/aten/src/ATen/native/cuda/UnaryOpsKernel.cu
Line 160 in 78b95b6
These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:
This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.
PR Summary
Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.
The same fix as in #32587 but isolated to the kernel change only, so that the internal team can benchmark.