Standardized clamp kernels to Numpy-like implementation #43288

vsimkus · 2020-08-19T19:59:24Z

BC-breaking note

For ease of exposition let a_min be the value of the "min" argument to clamp, and a_max be the value of the "max" argument to clamp.

This PR changes the behavior of torch.clamp to always compute min(max(a, a_min), a_max). torch.clamp currently computes this in its vectorized CPU specializations:

pytorch/aten/src/ATen/cpu/vec256/vec256_double.h

Line 304 in 78b95b6

return _mm256_min_pd(max, _mm256_max_pd(min, a));

but in other places it clamps differently:

pytorch/aten/src/ATen/cpu/vec256/vec256_base.h

Line 624 in 78b95b6

    
           c[i] = a[i] < min_vec[i] ? min_vec[i] : (a[i] > max_vec[i] ? max_vec[i] : a[i]);

pytorch/aten/src/ATen/native/cuda/UnaryOpsKernel.cu

Line 160 in 78b95b6

return (v < lower) ? lower : (v > upper ? upper : v);

These implementations are the same when a_min < a_max, but divergent when a_min > a_max. This divergence is easily triggered:

t = torch.arange(200).to(torch.float)
torch.clamp(t, 4, 2)[0]
: tensor(2.)

torch.clamp(t.cuda(), 4, 2)[0]
: tensor(4., device='cuda:0')

torch.clamp(torch.tensor(0), 4, 2)
: tensor(4)

This PR makes the behavior consistent with NumPy's clip. C++'s std::clamp's behavior is undefined when a_min > a_max, but Clang's std::clamp will return 10 in this case (although the program, per the above comment, is in error). Python has no standard clamp implementation.

PR Summary

Fixes discrepancy between AVX, CUDA, and base vector implementation for clamp, such that all implementations are consistent and use min(max_vec, max(min_vec, x) formula, thus making it equivalent to numpy.clip in all implementations.

The same fix as in #32587 but isolated to the kernel change only, so that the internal team can benchmark.

dr-ci · 2020-08-19T20:24:57Z

💊 CI failures summary and remediations

As of commit 8308899 (more details on the Dr. CI page):

Commit 8308899 was recently pushed. Waiting for builds...

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 39 times.

vsimkus · 2020-08-21T19:59:31Z

@mruberry Here's the change which standardizes clamp behaviour for all kernels to NumPy-like as discussed in the other PR. Let me know if this is enough for you to benchmark and anything else, thanks!

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-08-26T07:56:38Z

I ran this through our perf experiments and it passed. So that's good news because we can safely make this change, but bad news in that we don't know what was causing the performance degradation on the larger PR. There's a small chance it was benchmark flakiness, but I would like to propose two options for you, @vsimkus:

We could update this PR with a test for its behavior and land, then continue to refactor the larger PR into pieces we can validate incrementally with perf experiments
We could rebase the larger PR and run some clamp-specific benchmarks on it to verify it really does cause a performance regression, hoping the initial experiment failures were due to flakiness

I kind of prefer the first option, myself, since the original PR is now in need of a rebase. What if we separated the changes into this PR, the non-quantization changes, and the quantization changes?

Looking forward to hearing your thoughts.

vsimkus · 2020-08-26T08:25:14Z

@mruberry Thanks! I agree with you - it will be easier to land smaller PRs. I'll be away now for a couple of weeks, but once I return I'll add some tests to this PR and update the docstrings where necessary. Then, we can work on the other bits.

mruberry · 2020-08-26T08:53:46Z

@mruberry Thanks! I agree with you - it will be easier to land smaller PRs. I'll be away now for a couple of weeks, but once I return I'll add some tests to this PR and update the docstrings where necessary. Then, we can work on the other bits.

Great! Looking forward to it.

vsimkus · 2020-09-06T11:04:23Z

@mruberry As before, I've refactored the clamp tests to be more concise and use NumPy as the reference implementation. Also updated the docstring so that it shows the correct clamping output formula and removed an unnecessary comment about the argument types.

I think it should be good for a review and benchmarking now.

aten/src/ATen/cpu/vec256/vec256_complex_double.h

test/test_torch.py

mruberry · 2020-09-09T09:12:53Z

This is looking pretty good, @vsimkus. Thanks for taking the time to follow-up. I'm looking forward to get these changes in.

I have a question about removing the vectorized implementations and made some suggestions for simplifying the tests.

vsimkus · 2020-09-26T10:05:15Z

@mruberry I've now updated the PR with the suggested changes, and answered your question above on removing the complex vectorized clamp implementations. Let me know what you think.

test/test_torch.py

mruberry

Awesome! Thanks @vsimkus. Sorry this review took awhile, we just cut our 1.7 branch and that process was very time consuming.

Thanks for pointing out we're only removing the complex vectorized clamp impls.

One last issue before this gets imported is in the docs however. Take a look and let me know your thoughts.

torch/_torch_docs.py

vsimkus · 2020-10-01T13:06:47Z

@mruberry Thanks again for carefully reviewing the change :) I've updated the documentation and added the small change in torch_test (device=device). Also rebased on master, so it has the most recent changes now and can be benchmarked again before merging.

codecov · 2020-10-01T17:54:46Z

Codecov Report

Merging #43288 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #43288   +/-   ##
=======================================
  Coverage   68.19%   68.19%           
=======================================
  Files         410      410           
  Lines       53232    53232           
=======================================
  Hits        36302    36302           
  Misses      16930    16930

Impacted Files	Coverage Δ
torch/_torch_docs.py	`100.00% <ø> (ø)`
...ch/testing/_internal/common_methods_invocations.py	`91.45% <ø> (ø)`
torch/testing/_internal/common_utils.py	`78.25% <0.00%> (-0.09%)`	⬇️
torch/testing/_internal/expecttest.py	`78.57% <0.00%> (+1.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fbe597...8308899. Read the comment docs.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-10-05T16:34:28Z

Hey @vsimkus, sorry to bother you. Would you mind rebasing this? I'm seeing some odd failures internally and I'm hoping they're in the base revision.

…mentation

vsimkus · 2020-10-06T08:09:39Z

@mruberry Done. I hope the failures go away :)

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2020-10-06T21:28:53Z

The rebase appears to have solved the issues. Thanks @vsimkus. Looking forward to the next PR in this series. Do you have a good idea for the next step or would you like to discuss some ideas?

facebook-github-bot · 2020-10-07T00:17:46Z

@mruberry merged this pull request in e154b36.

pytorchbot added the open source label Aug 19, 2020

vsimkus force-pushed the clamp-kernel-standardization branch from 911a843 to 6add085 Compare August 19, 2020 20:53

vsimkus marked this pull request as ready for review August 21, 2020 19:59

facebook-github-bot reviewed Aug 23, 2020

View reviewed changes

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 24, 2020

vsimkus force-pushed the clamp-kernel-standardization branch 5 times, most recently from fba4051 to e6c2ee4 Compare September 6, 2020 11:02

mruberry self-requested a review September 9, 2020 08:43

mruberry added the module: numpy Related to numpy support, and also numpy compatibility of our operators label Sep 9, 2020