Concurrency Semantics For TensorFlow Resource Variables
|Author(s)||Sanjoy Das (email@example.com)|
|Sponsor||Alexandre Passos (firstname.lastname@example.org)|
This document aims to assign semantics to concurrent accesses to resource variables in TensorFlow. Concurrent accesses refer to resource variable loads and stores that are not totally ordered by control or data dependencies.
The semantics presented in this doc will probably be unsurprising to people familiar with TensorFlow. However a concrete written down semantics is a useful foundation for writing transformations and analyses over TensorFlow graphs.
We do not discuss ref variables, which are deprecated.
We define an operation instance as a
(node, frame) tuple, where
node is a node in the TensorFlow graph and
frame is a frame (see Dynamic control flow in large-scale machine learning) in which the node was executed.
The operation instances in an execution of a graph are put in a total order such that:
- The total order is consistent with the partial order imposed by the control and data dependencies within the graph. That is, if
(n1, f1)depends on
(n0, f0)then in the total order
(n1, f1)will be ordered after
- The total order respects critical sections. That is, if two nodes,
n1are in the same critical section, then the following total order is illegal: [(
f1)]. On the other hand, the following total order is fine: [(
- If there are
Ngraphs running in the same TensorFlow session, with source nodes
sub_source_iand sink nodes
[0, N)) then the total order is consistent with a “supergraph” with a source node that has control edges to all
sub_source_iand a sink node that has control edges from all
- The total order is consistent with blocking semantics of TensorFlow ops. For example if an instance of a
QueueDequeueV2can complete only after a matching
QueueEnqueueV2has completed then the total order cannot put the
QueueDequeueV2instance before the
Resource variable reads and writes follow sequential semantics as if they were executed in this total order. The writes and reads themselves are atomic -- a read never sees a partial write, even when it is racing with the write.
This total order is not fixed run to run and can vary depending on various external factors like operating system thread scheduling, hardware behavior etc.
Operations that read and write resource variables at the same time, like
ResourceApplyGradientDescent, are modeled by decomposing them into a read, compute and write sequence which does not execute atomically as a unit. This behavior is weaker than what the executor implements and this weakening simplifies XLA’s auto-clustering.
Examples & Consequences
Simplest Non-Trivial Example
The graph above does two resource writes and two resource reads. Assuming that
0 to start with, control dependencies on the graph guarantee that if
1. That is, the result
r0 == 2 && r1 == 0 is disallowed.
Note that this means stores that are constrained to run in a specific order via control or data dependencies cannot be reordered.
Resource Update Operations
Resource update operations, like
ResourceApplyGradientDescent, are modeled as a read, compute, write sequence. For simplicity we invent a resource update operation,
*X+=1, that increments the tensor stored in the resource variable
1. With this framing, two concurrent
*X+=1 operations can be modeled as the graph below. All total orders that are consistent with the partial order in the graph are valid.
This behavior is weaker than the memory model actually implemented by the graph executor. Since
ResourceApplyGradientDescent et. al. lock out the resource variable being updated for the entire duration of their execution the graph executor guarantees that the two increments happen atomically and
*X is always
2 after the graph execution.
Just like two stores ordered via control dependence cannot be reordered, a load and a store ordered via control edges cannot be reordered either. For instance, given the semantics in this document, when the execution of graph above ends if
r0 must be
5. This invariant breaks if we allow the
r0 = *Y to happen before
*X = 1.
This invariant is relevant for XLA; it means wrapping an arbitrary TensorFlow graph in an
_XlaRun sequence or an
XlaLaunch op is generally incorrect -- these XLA ops hoist all the resource variable reads to the beginning of the computation and sink all the writes to the end of the computation which may violate load/store ordering present in the TensorFlow graph.