Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Backward pass of broadcasting on GPU is non-deterministic #2652
import tensorflow as tf def run(on_gpu): tf.reset_default_graph() tf.set_random_seed(42) with tf.device('/gpu:0' if on_gpu else '/cpu:0'): a = tf.random_normal([16, 16]) b = tf.get_variable('b', shape = , initializer = tf.constant_initializer(value = 0.0)) c = a*b grad = tf.gradients(c, [b], gate_gradients=tf.train.Optimizer.GATE_GRAPH) sess = tf.Session() sess.run(tf.initialize_all_variables()) grad_val = sess.run(grad) return grad_val for i in xrange(20): print repr(run(on_gpu=True)), print '' for i in xrange(20): print repr(run(on_gpu=False)),
As you can see, consistent result across CPU runs but inconsistent result across GPU runs.
No doubt a CUDA reduction order issue, but it'd be really nice if we can have deterministic reduction. I am using tf 0.8.0 (self-compiled against CuDNN v5). CuDNN version is 5005 (not rc)
Unfortunately, the reduction ops on GPU use asynchronous atomic adds, and are therefore fundamentally nondeterministic for floating point. Making them deterministic would require either tree-structured reductions or integer math, both significantly slower.
I can leave this open with contributions welcome if you'd like (with an adjusted title), but it'll be a lot of work if someone tries to take it on, and it's unclear how best to make it happen automatically. Even if one added deterministic reductions as an option (either as a separate op or as an attr on the existing ops), we'd need an unpleasant global flag to turn this on when building the backward pass.
referenced this issue
Jun 8, 2016
Here's the link: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
The shfl_down results are only useful within a single wrap. That technique itself would take a second pass to accumulate the results for each block.
In general, there is no guarantee of determinism on GPU. Therefore, we are not sure how much effort we want to spend on it. Even if we can fix this particular kernel, we have other Cudnn kernels that do have non-determinism.
Which op exactly is non-deterministic here? These are the ops in the graph:
Do you expect that
For reference, I tried to run this (with both TF 1.4.1, and also TF 1.12.0), and it seems deterministic to me (980 GTX, CUDA 9.1).