Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate renorm to ATen (CPU and CUDA) #59250

Closed
wants to merge 6 commits into from

Conversation

peterbell10
Copy link
Collaborator

Resubmit of #59108, closes #24754, closes #24616

This reuses linalg_vector_norm to calculate the norms. I just add a new kernel that turns the norm into a normalization factor, then multiply the original tensor using a normal broadcasted mul operator. The result is less code, and better performance to boot.

Benchmarks (CPU):

Shape Dim Before After (1 thread) After (8 threads)
(10, 10, 10) 0 11.6 us 4.2 us 4.2 us
1 14.3 us 5.2 us 5.2 us
2 12.7 us 4.6 us 4.6 us
(50, 50, 50) 0 330 us 120 us 24.4 us
1 350 us 135 us 28.2 us
2 417 us 130 us 24.4 us

Benchmarks (CUDA)

Shape Dim Before After
(10, 10, 10) 0 12.5 us 12.1 us
1 13.1 us 12.2 us
2 13.1 us 11.8 us
(50, 50, 50) 0 33.7 us 11.6 us
1 36.5 us 15.8 us
2 41.1 us 15 us

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 1, 2021

💊 CI failures summary and remediations

As of commit e4b4f8f (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 03 00:57:31 [ FAILED ] AtenXlaTensorTest.TestBitwiseAndPromotion
Jun 03 00:57:31 [----------] 1 test from XlaUtilCacheTest (0 ms total)
Jun 03 00:57:31 
Jun 03 00:57:31 [----------] Global test environment tear-down
Jun 03 00:57:31 [==========] 592 tests from 8 test suites ran. (468723 ms total)
Jun 03 00:57:31 [  PASSED  ] 588 tests.
Jun 03 00:57:31 [  SKIPPED ] 1 test, listed below:
Jun 03 00:57:31 [  SKIPPED ] AtenXlaTensorTest.TestGroupNormBackward
Jun 03 00:57:31 [  FAILED  ] 3 tests, listed below:
Jun 03 00:57:31 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAnd
Jun 03 00:57:31 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAndScalar
Jun 03 00:57:31 [  FAILED  ] AtenXlaTensorTest.TestBitwiseAndPromotion
Jun 03 00:57:31 
Jun 03 00:57:31  3 FAILED TESTS
Jun 03 00:57:31 + cleanup
Jun 03 00:57:31 + retcode=1
Jun 03 00:57:31 + set +x
Jun 03 00:57:31 =================== sccache compilation log ===================
Jun 03 00:57:31 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 03 00:57:31 Compile requests                      0
Jun 03 00:57:31 Compile requests executed             0
Jun 03 00:57:31 Cache hits                            0

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Comment on lines +652 to +658
if (acc_type != dtype) {
norm = at::linalg_vector_norm(self, p.toDouble(), reduce_dims,
/*keepdim=*/true, /*dtype=*/acc_type);
} else {
norm = at::linalg_vector_norm(self, p.toDouble(), reduce_dims,
/*keepdim=*/true);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For complex types, the input is cast to dtype and the output has type dtype so it has to be complex. However, if I leave the argument out entirely, the result is real. Seems like an odd way for this to work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kurtamohler for norm dtype behavior. Should we disallow specifying non-complex dtype for complex? I believe there was a warning about that at some point. But then, it doesn't allow users to specify different precision if the input is complex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's odd behavior. The explanation is that the input is converted to dtype before the norm is calculated, so if we allowed a non-complex dtype arg when the input is complex, we would be discarding the imaginary part. So at the moment we throw an error in that case: https://github.com/pytorch/pytorch/blob/afdfd2288ab/aten/src/ATen/native/LinearAlgebra.cpp#L2271

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the result should always be real though, even if dtype is complex. For complex outputs, the imaginary part is always going to be 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the result was always real then norm(tensor, ..., dtype=dtype) would behaave just like norm(tensor.to(dtype), ...). I think that's reasonable.

@ezyang ezyang removed their request for review June 1, 2021 17:36
@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 1, 2021
@albanD albanD removed their request for review June 1, 2021 20:59
@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

norm = at::linalg_vector_norm(
self, p, reduce_dims, /*keepdim=*/true);
}
auto grad_output = (self.conj() * grad).sum(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is discarding imaginary part expected here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The norm is real-valued so its gradient should be as well. In fact the gradcheck tests fail if I don't intentionally cast to real.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But /*dtype=*/c10::toValueType(acc_type) will always be throwing warnings when input is converted to real, even though that's what we want. Maybe (self.conj() * grad).real().contiguous().sum(...)? (.contiguous is optional, on cpu I suspect discontiguous reduction will be very bad, on the gpu on the other hand .contiguous() + reduction will be slower than just reduction, and in any case, people don't seem to care about renorm backward performance).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (self.conj() * grad).real().contiguous().sum(...)?

The awkward thing is that at::real isn't a no-op for real types. It throws an error so you can't write a one-liner. But if this way creates a warning then I guess there's no ignoring it.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 6408cbd.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Resubmit of pytorch#59108, closes pytorch#24754, closes pytorch#24616

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: pytorch#59250

Reviewed By: mruberry

Differential Revision: D28820359

Pulled By: ngimel

fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate renorm and renorm_ from the TH to Aten (CPU) Migrate renorm and renorm_ from the TH to Aten (CUDA)
4 participants