Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA/GPU] rsqrt is cheap and should be fused. #40998

Merged
merged 2 commits into from
Jul 17, 2020

Conversation

trentlo
Copy link
Contributor

@trentlo trentlo commented Jul 1, 2020

Please help to review the codes. Thanks.

The pattern is observed (at least) in BERT.

@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label Jul 1, 2020
@google-ml-butler google-ml-butler bot requested a review from joker-eph July 1, 2020 17:17
@gbaned gbaned self-assigned this Jul 2, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jul 2, 2020
@gbaned gbaned added the comp:xla XLA label Jul 2, 2020
Copy link

@Lukious Lukious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Fine and it worked in my env.
It's more optimize tf module for GPU usage

@gbaned gbaned added the awaiting review Pull request awaiting review label Jul 6, 2020
@trentlo
Copy link
Contributor Author

trentlo commented Jul 7, 2020

@sanjoy @thomasjoerg
Could either of you help to take a look or suggest a reviewer? Thanks!

Copy link
Contributor

@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Trent,

Can you share some specific cases where this helps? The last time I wanted to make this change I concluded 379268e was a more principled fix instead.

[edit: To clarify, I'm implying that tuning the heuristic added in https://github.com/tensorflow/tensorflow/commit/379268e9f4cbccfc46827408a0e67896c75af5b4 might be more effective.]

@trentlo
Copy link
Contributor Author

trentlo commented Jul 8, 2020

The heuristic is good (I like it). However, rsqrt (or div) is just mapped to one hardware instruction unlike other instrinsics, which will be expanded into a bunch of instructions when linking in libdevice. So, I think that marking cheap instructions cheap is orthogonal to the heuristic (which better deals with real expensive instructions).

I see the case in layer norm.
image

@@ -29,12 +29,23 @@ limitations under the License.
namespace xla {
namespace gpu {

bool ElementIsF32OrF16(const Shape& shape) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static or put under anonymous namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My oversight. Thanks for the catch. Will update it soon,

Comment on lines 40 to 41
// We say that some floating-point math ops are cheap on the GPU.
switch (instruction.opcode()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the rationale you mentioned in the PR (that these lower to single instructions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Will add.

Copy link
Contributor Author

@trentlo trentlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the PR soon.

@@ -29,12 +29,23 @@ limitations under the License.
namespace xla {
namespace gpu {

bool ElementIsF32OrF16(const Shape& shape) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My oversight. Thanks for the catch. Will update it soon,

Comment on lines 40 to 41
// We say that some floating-point math ops are cheap on the GPU.
switch (instruction.opcode()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Will add.

Also, polish comments in instruction_fusion.cc.
@trentlo
Copy link
Contributor Author

trentlo commented Jul 9, 2020

Updated. Please help to take a look again. Thanks!

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 9, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 9, 2020
@kokoro-team kokoro-team removed kokoro:force-run Tests on submitted change labels Jul 9, 2020
@gbaned gbaned removed the awaiting review Pull request awaiting review label Jul 10, 2020
@sanjoy
Copy link
Contributor

sanjoy commented Jul 15, 2020

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: https://gist.github.com/sanjoy/8161733b3e8f303d2f81b38814661f9a

Can you PTAL? Let me know if you can't reproduce the regression.

@trentlo
Copy link
Contributor Author

trentlo commented Jul 15, 2020

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: https://gist.github.com/sanjoy/8161733b3e8f303d2f81b38814661f9a

Can you PTAL? Let me know if you can't reproduce the regression.

I'd guess that it interacts with the fusion heuristic and produces a surprising fusion result. I will take a look.

@gbaned gbaned added stat:awaiting response Status - Awaiting response from author and removed ready to pull PR ready for merge process labels Jul 15, 2020
@trentlo
Copy link
Contributor Author

trentlo commented Jul 16, 2020

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: >
Can you PTAL? Let me know if you can't reproduce the regression.

@sanjoy, I instead see 1% speedup with this PR on V100 (according to the perf numbers reported by xla_profile). See the attached log file for some more details.
log.rsqrt_expensive.txt
log.rsqrt_cheap.txt

Are you sure if the regression is related to this PR? Also, I wonder if you see any perf gain?

@sanjoy
Copy link
Contributor

sanjoy commented Jul 17, 2020

Are you sure if the regression is related to this PR? Also, I wonder if you see any perf gain?

Could have been operator error, trying again.

@tensorflow-copybara tensorflow-copybara merged commit b1b40ae into tensorflow:master Jul 17, 2020
PR Queue automation moved this from Approved by Reviewer to Merged Jul 17, 2020
@gbaned gbaned removed the stat:awaiting response Status - Awaiting response from author label Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:xla XLA size:S CL Change Size: Small
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

7 participants