[XLA/GPU] rsqrt is cheap and should be fused. #40998

trentlo · 2020-07-01T17:17:11Z

Please help to review the codes. Thanks.

The pattern is observed (at least) in BERT.

Lukious

Looks Fine and it worked in my env.
It's more optimize tf module for GPU usage

trentlo · 2020-07-07T22:16:09Z

@sanjoy @thomasjoerg
Could either of you help to take a look or suggest a reviewer? Thanks!

sanjoy

Hi Trent,

Can you share some specific cases where this helps? The last time I wanted to make this change I concluded 379268e was a more principled fix instead.

[edit: To clarify, I'm implying that tuning the heuristic added in https://github.com/tensorflow/tensorflow/commit/379268e9f4cbccfc46827408a0e67896c75af5b4 might be more effective.]

trentlo · 2020-07-08T16:48:09Z

The heuristic is good (I like it). However, rsqrt (or div) is just mapped to one hardware instruction unlike other instrinsics, which will be expanded into a bunch of instructions when linking in libdevice. So, I think that marking cheap instructions cheap is orthogonal to the heuristic (which better deals with real expensive instructions).

I see the case in layer norm.

sanjoy · 2020-07-09T03:23:55Z

tensorflow/compiler/xla/service/gpu/instruction_fusion.cc

@@ -29,12 +29,23 @@ limitations under the License.
 namespace xla {
 namespace gpu {

+bool ElementIsF32OrF16(const Shape& shape) {


static or put under anonymous namespace.

My oversight. Thanks for the catch. Will update it soon,

sanjoy · 2020-07-09T03:24:10Z

tensorflow/compiler/xla/service/gpu/instruction_fusion.cc

+  // We say that some floating-point math ops are cheap on the GPU.
+  switch (instruction.opcode()) {


Please add the rationale you mentioned in the PR (that these lower to single instructions).

Make sense. Will add.

trentlo

Will update the PR soon.

trentlo · 2020-07-09T17:07:11Z

tensorflow/compiler/xla/service/gpu/instruction_fusion.cc

@@ -29,12 +29,23 @@ limitations under the License.
 namespace xla {
 namespace gpu {

+bool ElementIsF32OrF16(const Shape& shape) {


My oversight. Thanks for the catch. Will update it soon,

trentlo · 2020-07-09T17:07:26Z

tensorflow/compiler/xla/service/gpu/instruction_fusion.cc

+  // We say that some floating-point math ops are cheap on the GPU.
+  switch (instruction.opcode()) {


Make sense. Will add.

Also, polish comments in instruction_fusion.cc.

trentlo · 2020-07-09T18:01:15Z

Updated. Please help to take a look again. Thanks!

sanjoy · 2020-07-15T03:36:58Z

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: https://gist.github.com/sanjoy/8161733b3e8f303d2f81b38814661f9a

Can you PTAL? Let me know if you can't reproduce the regression.

trentlo · 2020-07-15T04:36:28Z

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: https://gist.github.com/sanjoy/8161733b3e8f303d2f81b38814661f9a

Can you PTAL? Let me know if you can't reproduce the regression.

I'd guess that it interacts with the fusion heuristic and produces a surprising fusion result. I will take a look.

trentlo · 2020-07-16T23:21:27Z

Hi @trentlo ,

This seems to regress a variant of resnet only on V100 by around 10%. Here is the pre-optimization HLO: >
Can you PTAL? Let me know if you can't reproduce the regression.

@sanjoy, I instead see 1% speedup with this PR on V100 (according to the perf numbers reported by xla_profile). See the attached log file for some more details.
log.rsqrt_expensive.txt
log.rsqrt_cheap.txt

Are you sure if the regression is related to this PR? Also, I wonder if you see any perf gain?

sanjoy · 2020-07-17T04:35:58Z

Are you sure if the regression is related to this PR? Also, I wonder if you see any perf gain?

Could have been operator error, trying again.

[XLA/GPU] Declares that rsqrt is cheap and should be fused.

31aeac6

google-ml-butler bot added the size:S CL Change Size: Small label Jul 1, 2020

googlebot added the cla: yes label Jul 1, 2020

google-ml-butler bot requested a review from joker-eph July 1, 2020 17:17

gbaned self-assigned this Jul 2, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Jul 2, 2020

gbaned added the comp:xla XLA label Jul 2, 2020

Lukious reviewed Jul 2, 2020

View reviewed changes

gbaned added the awaiting review Pull request awaiting review label Jul 6, 2020

sanjoy reviewed Jul 8, 2020

View reviewed changes

sanjoy reviewed Jul 9, 2020

View reviewed changes

trentlo commented Jul 9, 2020

View reviewed changes

[XLA] Make utility functions anonymous

372cc81

Also, polish comments in instruction_fusion.cc.

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 9, 2020

sanjoy approved these changes Jul 9, 2020

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 9, 2020

kokoro-team removed kokoro:force-run Tests on submitted change labels Jul 9, 2020

gbaned removed the awaiting review Pull request awaiting review label Jul 10, 2020

gbaned added stat:awaiting response Status - Awaiting response from author and removed ready to pull PR ready for merge process labels Jul 15, 2020

tensorflow-copybara merged commit b1b40ae into tensorflow:master Jul 17, 2020

PR Queue automation moved this from Approved by Reviewer to Merged Jul 17, 2020

gbaned removed the stat:awaiting response Status - Awaiting response from author label Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA/GPU] rsqrt is cheap and should be fused. #40998

[XLA/GPU] rsqrt is cheap and should be fused. #40998

trentlo commented Jul 1, 2020

Lukious left a comment

trentlo commented Jul 7, 2020

sanjoy left a comment •

edited

trentlo commented Jul 8, 2020

sanjoy Jul 9, 2020

trentlo Jul 9, 2020

sanjoy Jul 9, 2020

trentlo Jul 9, 2020

trentlo left a comment

trentlo Jul 9, 2020

trentlo Jul 9, 2020

trentlo commented Jul 9, 2020

sanjoy commented Jul 15, 2020

trentlo commented Jul 15, 2020

trentlo commented Jul 16, 2020 •

edited

sanjoy commented Jul 17, 2020

		// We say that some floating-point math ops are cheap on the GPU.
		switch (instruction.opcode()) {

[XLA/GPU] rsqrt is cheap and should be fused. #40998

[XLA/GPU] rsqrt is cheap and should be fused. #40998

Conversation

trentlo commented Jul 1, 2020

Lukious left a comment

Choose a reason for hiding this comment

trentlo commented Jul 7, 2020

sanjoy left a comment • edited

Choose a reason for hiding this comment

trentlo commented Jul 8, 2020

sanjoy Jul 9, 2020

Choose a reason for hiding this comment

trentlo Jul 9, 2020

Choose a reason for hiding this comment

sanjoy Jul 9, 2020

Choose a reason for hiding this comment

trentlo Jul 9, 2020

Choose a reason for hiding this comment

trentlo left a comment

Choose a reason for hiding this comment

trentlo Jul 9, 2020

Choose a reason for hiding this comment

trentlo Jul 9, 2020

Choose a reason for hiding this comment

trentlo commented Jul 9, 2020

sanjoy commented Jul 15, 2020

trentlo commented Jul 15, 2020

trentlo commented Jul 16, 2020 • edited

sanjoy commented Jul 17, 2020

sanjoy left a comment •

edited

trentlo commented Jul 16, 2020 •

edited