1.12.0-rc2 cherry-pick request: Various XLA scatter improvements. #23235

tatatodd · 2018-10-24T23:53:59Z

There are various piper origin CLs cherrypicked into this PR:

PiperOrigin-RevId: 215687800
PiperOrigin-RevId: 216412467
PiperOrigin-RevId: 216437329
PiperOrigin-RevId: 216448063
PiperOrigin-RevId: 216624225
PiperOrigin-RevId: 216798034
PiperOrigin-RevId: 216921512
PiperOrigin-RevId: 216968475

PiperOrigin-RevId: 215687800

This simple has a kernel that runs on every element of the updates tensor, figure out the right indices to perform the update, and applies it with an atomic operation. Currently we emit a CAS for plain (i.e. non-add) updates, which is inefficient. Also TuplePointsToAnalysis doesn't know that it should alias the operand and output buffers of a scatter, which would avoid a copy. PiperOrigin-RevId: 216412467

This avoids a copy. PiperOrigin-RevId: 216437329

We have a 1-element thunk sequence if we're not copying. That's still two thunks and hlo profiling gets confused if it sees two thunks for the same instruction and one of them claims to be the whole instruction. PiperOrigin-RevId: 216448063

We fuse everything into the scatter now, and emit two kernels. The first kernel fills the output buffer with the computation fused into the scatter operand. The second kernel is a regular scatter, which also contains the fused operations from the updates and scatter_indices inputs. PiperOrigin-RevId: 216624225

PiperOrigin-RevId: 216798034

This was comparing the index after adding it to the window, and then comparing against the window dimension. This means that the bounds check was only correct for the first element of a window. Instead compare the scatter index, which is the same for all elements of a window. PiperOrigin-RevId: 216921512

The tuple buffer is never read, so stop emitting code to fill it. A typical root tuple consists of a H2D memcpy and a host callback, both of which are somewhat slow. This helps tiny models and inference benchmarks, where the host/device syncs can be a significant part of the runtime of the entire computation. PiperOrigin-RevId: 216968475

tatatodd · 2018-10-25T00:45:34Z

I'm ignoring the clang-format check, since it's suggesting weird formatting changes, and it's not critical anyways.

tensorflower-gardener and others added 8 commits October 24, 2018 14:49

[XLA] Update Tf2Xla bridge to use Scatter HLO.

0584008

PiperOrigin-RevId: 215687800

[XLA] Allow scatter to share the operand buffer with the output

cc3d9f6

This avoids a copy. PiperOrigin-RevId: 216437329

[XLA:GPU] Adding a test case for Scatter where GPU implementation fails.

7648f35

PiperOrigin-RevId: 216798034

googlebot added the cla: yes label Oct 24, 2018

tatatodd requested review from reedwm, annarev, goldiegadde and d0k October 24, 2018 23:54

tatatodd self-assigned this Oct 24, 2018

Merge branch 'r1.12' into cherrypicks_DUUUS

1c5cb40

annarev approved these changes Oct 25, 2018

View reviewed changes

reedwm approved these changes Oct 25, 2018

View reviewed changes

tatatodd merged commit e72c9eb into tensorflow:r1.12 Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.12.0-rc2 cherry-pick request: Various XLA scatter improvements. #23235

1.12.0-rc2 cherry-pick request: Various XLA scatter improvements. #23235

tatatodd commented Oct 24, 2018

tatatodd commented Oct 25, 2018

1.12.0-rc2 cherry-pick request: Various XLA scatter improvements. #23235

1.12.0-rc2 cherry-pick request: Various XLA scatter improvements. #23235

Conversation

tatatodd commented Oct 24, 2018

tatatodd commented Oct 25, 2018