Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This follows a similar approach to the earlier work by Derek Murray
(change ID I9058c55898079cb3ab6011513b4468ddb293f59f) and allows using atomic operations on the pending counts. It extends that work to also apply to graphs that include control/merge nodes (eg the 'wide_deep' model in the tf models repo) The approach here is to note that even in a graph that has control/merge nodes, it's likely that many nodes do not have any merge/control outputs. There's already a fast path for this case with simpler logic, and that simple logic has the advantage of only accessing the pending count in one spot where it decrements the incoming pending values. Only the last such decrementer propagates the output and activates the node. Given this, we can use atomic operations for all such "fast path" nodes and avoid holding an exclusive lock. We fall back to the exclusive lock for any nodes that have slow-path outputs. As a conservative measure, this patch acquires the shared lock for the fast path, though it may not be necessary. The other bit in this patch is the management of the 'outstanding_ops' member. In order to allow concurrent completion of ops without the exclusive lock, this also becomes atomic. Simply making it atomic is sufficient but leaves a lot of performance on the table. This patch batches the updates of this variable so that each op completion only touches it at most once, and no-ops out in the case that the pending op count doesn't need to be modified (eg when an op completion activates exactly one downstream node). I benchmarked this change using tensorflow-models/official/r1/wide_deep and collecting the distribution of "steps/sec" reported using commands like: # switch to new code $ python -u census_main.py -ebe=20 -te 20 -mt wide 2>&1 | tee /tmp/before/wide.txt # switch to new code $ python -u census_main.py -ebe=20 -te 20 -mt wide 2>&1 | tee /tmp/after/wide.txt ... for the 'wide', 'wide_deep', and 'deep' models. I then exported the 'steps/second' metrics to TSV with: $ grep 'tensorflow:global_step/sec' \ /tmp/{before,after}/*.txt | perl -p -e \ 's,/tmp/(.+)/(.+).txt:.*?([\d\.]+)$,$1 $2 $3,g' > /tmp/data.tsv Mean steps/sec are as follows on my AMD Ryzen 9 3900X desktop (12 physical cores, 'performance' CPU governor enabled): Before: model steps/sec ---------------------- deep 876 wide 776 wide_deep 625 After: model steps/sec improvement ----------------------------------- deep 897 (+2.5%) wide 928 (+19.5%) wide_deep 760 (+21.6%) A few notes worth considering: Using atomic operations has some fixed overhead compared to non-atomics, and particularly when the cache line being operated on is in M "modified" state in another core. This increased latency of operating on a cache-line in remote M state (aka "HitM" access) is also true with non-atomic operations, but non-atomics can potentially allow for instruction level parallelism and pipeline multiple such accesses together, whereas atomics do not on current x86 architectures[1]. So, there is possibly a cross-over point on some graph shapes where the mutex-based synchronization with non-atomic operations actually beats the atomic-based implementation here. That said, the current code path for the non-atomic (mutex-protected) case is complex/branchy enough that it's not clear that any significant pipelining of reads can actually occur without some more explicit unrolling, etc. It's also worth noting that, in uncontended cases, the atomics will be slower than the non-atomics. However, this is unlikely an issue in real workloads, considering that these code paths are only uncontended when the actual work of op execution dominates the runtime. In other words, this is only a perf bottleneck when it's under contention, and a small regression for uncontended cases is likely in the noise. [1] https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf PiperOrigin-RevId: 331889409 Change-Id: I8763d09f060f5d550e596d63f037f95b5ff48e64
- Loading branch information
1 parent
a873a3e
commit a97b0c8
Showing
4 changed files
with
365 additions
and
114 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.