Add division overload with rounding_mode selection #50280

peterbell10 · 2021-01-08T16:21:09Z

Stack from ghstack:

Support JIT shape analysis for div with rounding_mode argument #50438 Support JIT shape analysis for div with rounding_mode argument
Sparse support for division rounding_mode argument #50437 Sparse support for division rounding_mode argument
Warn about floor_divide performing incorrect rounding #50281 Warn about floor_divide performing incorrect rounding
Add division overload with rounding_mode selection #50280 Add division overload with rounding_mode selection

As mentioned in gh-43874, this adds a rounding_mode={'true', 'trunc', 'floor'}
argument so torch.div can be used as a replacement for floor_divide during
the transitional period.

I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.

Note: I introduce new overloads for aten::div instead of just adding a default
rounding_mode because various JIT passes rely on the exact operator schema.

Differential Revision: D26123271

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. [ghstack-poisoned]

facebook-github-bot · 2021-01-08T16:21:41Z

💊 CI failures summary and remediations

As of commit 71b0cfe (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/ubuntu-rocm/Dockerfile
Auto-merging .circleci/docker/ubuntu-rocm/Dockerfile
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_rocm.sh
Auto-merging .circleci/docker/common/install_rocm.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py
Auto-merging .circleci/cimodel/data/dimensions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/ubuntu-rocm/Dockerfile
Auto-merging .circleci/docker/ubuntu-rocm/Dockerfile
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_rocm.sh
Auto-merging .circleci/docker/common/install_rocm.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py
Auto-merging .circleci/cimodel/data/dimensions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_6_gcc5_4_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/ubuntu-rocm/Dockerfile
Auto-merging .circleci/docker/ubuntu-rocm/Dockerfile
CONFLICT (add/add): Merge conflict in .circleci/docker/common/install_rocm.sh
Auto-merging .circleci/docker/common/install_rocm.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/dimensions.py
Auto-merging .circleci/cimodel/data/dimensions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. [ghstack-poisoned]

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. [ghstack-poisoned]

mruberry · 2021-01-11T13:42:39Z

aten/src/ATen/BatchingRegistrations.cpp

@@ -1129,6 +1129,12 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
  BINARY_POINTWISE_VA(rsub, Scalar);
  BINARY_POINTWISE(mul);
  BINARY_POINTWISE(div);
+  {
+    using Binop = Tensor (*)(const Tensor&, const Tensor&, std::string);


@rzou would you take a look here?

@mruberry I think you got the wrong user. Was that meant to be @zou3519?

It was, thanks @peterbell10. Darn autocomplete!

cc @zou3519

aten/src/ATen/native/BinaryOps.cpp

aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

mruberry · 2021-01-11T13:51:57Z

aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

+  } else if (isIntegralType(dtype, /*includeBool*/ false)) {
+    // There's no SIMD integer division, so don't try to vectorize it.
+    // TODO: if the divisor is a scalar, rewrite as multiplication by a constant.
+    AT_DISPATCH_INTEGRAL_TYPES(iter.common_dtype(), "div_floor_cpu", [&]() {


This is inconsistent between using dtype and iter.common_dtype().

I have removed all uses of iter.dtype(). If instead you meant the variable dtype, then I would note that it's assigned from iter.common_dtype() above. Just a bit less to type.

I realize the value is the same, just for readability the code might want to stick to either dtype or iter.common_dtype(). No big deal either way.

mruberry · 2021-01-11T13:52:57Z

aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

+      });
+    });
+  } else {
+    AT_DISPATCH_FLOATING_TYPES_AND2(kBFloat16, kHalf, iter.common_dtype(), "div_floor_cpu", [&]() {


Same dtype vs iter.common_dtype here, too.

aten/src/ATen/native/cuda/BinaryMulDivKernel.cu

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. Differential Revision: [D26123271](https://our.internmc.facebook.com/intern/diff/D26123271) [ghstack-poisoned]

JackCaoG · 2021-01-30T01:56:00Z

Hi @mruberry, I think I can get pt/xla pr ready this week. I will ping you when that is ready.

peterbell10 · 2021-01-31T02:50:32Z

@mruberry PTAL when you can. Have addressed your comments and rebased.

mruberry · 2021-02-01T03:03:10Z

torch/_torch_docs.py

+        * ``"true"`` - default behavior. Performs no rounding and, if both :attr:`input` and
+          :attr:`other` are integer types, promotes the inputs to the default scalar type.
+          Equivalent to true division in Python (the ``/`` operator) and NumPy's ``np.true_divide``.
+        * ``"trunc"`` - rounds the results of the division down.


The description for trunc and floor are identical, "rounds the results of the division down".

For trunc I think we can say rounds towards zero?

#50280 (comment)

mruberry · 2021-02-01T03:04:59Z

torch/testing/_internal/common_utils.py

@@ -641,8 +641,10 @@ def freeze_rng_state():
 def set_default_dtype(dtype):
    saved_dtype = torch.get_default_dtype()
    torch.set_default_dtype(dtype)
-    yield
-    torch.set_default_dtype(saved_dtype)
+    try:


Thank you for fixing this.

mruberry

One doc nit, otherwise looks awesome.

@JackCaoG let us know when this is safe to merge.

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. Differential Revision: [D26123271](https://our.internmc.facebook.com/intern/diff/D26123271) [ghstack-poisoned]

As mentioned in pytorchgh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. ghstack-source-id: 372c4beabf4db06072fd3608947fc50c9c1718bd Pull Request resolved: pytorch#50280

mruberry · 2021-02-02T07:47:22Z

One of the test failures is real: test_div_rounding_numpy_cuda_bfloat16.

We can skip the test for simplicity to unblock this landing. @JackCaoG's PR is ready to go so I'd like to land this during PST business hours on Tuesday, February 2nd.

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. Differential Revision: [D26123271](https://our.internmc.facebook.com/intern/diff/D26123271) [ghstack-poisoned]

peterbell10 · 2021-02-02T17:12:04Z

In hindsight, the BFloat16 comparison with random test data is unlikely to work perfectly. It's sensitive to exact rounding and since NumPy is rounding to float32 precision, it will occasionally get different answers.

mruberry · 2021-02-02T17:12:55Z

In hindsight, the BFloat16 comparison with random test data is unlikely to work perfectly. It's sensitive to exact rounding and since NumPy is rounding to float32 precision, it will occasionally get different answers.

We can disable the test for now. In the future we could use a fixture to ensure there are no rounding issues.

mruberry · 2021-02-02T18:46:01Z

aten/src/ATen/native/cpu/BinaryOpsKernel.cpp

+            }
+            return floordiv;
+          },
+          [](Vec256<scalar_t> a, Vec256<scalar_t> b) -> Vec256<scalar_t>{


This is triggering some internal build issues. Adding a vectorized function can be a little tricky because we often have to stub them out on some platforms, like Android.

Since we're so close to the branch cut, I propose removing the copysign implementation and this vectorized implementation. We can file an issue and add them back in a later PR where we can take our time and focus on that issue.

@mruberry this should be good now. Removed all Vec256 changes and unvectorized floor_divide.

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. Differential Revision: [D26123271](https://our.internmc.facebook.com/intern/diff/D26123271) [ghstack-poisoned]

mruberry · 2021-02-03T22:07:05Z

FYI there's another internal issue that I'm reviewing now. I'll keep this updated.

mruberry · 2021-02-04T03:21:39Z

Update: hacking through internal infra issues still.

mruberry · 2021-02-04T17:53:54Z

Update: blocking infra team confirms it's fixed its issue. This should land today.

facebook-github-bot · 2021-02-04T21:12:30Z

@mruberry merged this pull request in b150f15.

mruberry · 2021-02-04T21:16:38Z

Landed. Some changes had to be made internally:

div and floor_divide had to call the stub directly and not re-dispatch to the out= variant (unclear how necessary this was, there were some performance failures but they may have been flaky)
the kernel name "div_cpu" couldn't change because mobile manifests list kernel names for export, and they didn't understand the new name

peterbell10 requested a review from albanD as a code owner January 8, 2021 16:21

This was referenced Jan 8, 2021

OpInfo: Remove promotes_integers_to_float and infer it instead #50279

Closed

Warn about floor_divide performing incorrect rounding #50281

Closed

facebook-github-bot added the cla signed label Jan 8, 2021

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Jan 8, 2021

pytorchbot added the open source label Jan 8, 2021

peterbell10 added 2 commits January 8, 2021 16:48

peterbell10 requested a review from mruberry January 8, 2021 17:05

peterbell10 added 2 commits January 8, 2021 20:55