New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention that GPU reductions are nondeterministic in docs #2732

Closed
shiviser opened this Issue Jun 8, 2016 · 47 comments

Comments

Projects
None yet
@shiviser
Copy link

shiviser commented Jun 8, 2016

The problem

I am trying out the MNIST for experts tutorial and I have inconsistent results on the GPU.

What do I mean by inconsistent?

With the exactly same network parameters (and randomness removed: read below in the post) every time I run the complete train-then-test process the accuracy is slightly different.

What have I done to visualize this problem?

For each iteration, I have calculated the differences between the variables (weights, biases) from two independent but identical runs and computed the L1 norm of those differences -

  • plot of L1 norm for the first 1000 iterations in steps of 20.

In a consistent world, these differences should be always zero!

How did I remove randomness in the code?

  • Removed dropout entirely
  • added a graph level seed (tf.set_random_seed(1234)). With this the variable initialization is deterministic and also any other randomization in the code.
  • The MNIST for experts tutorial uses this script to download/load the MNIST data. I have added numpy.random.seed(3) in DataSet.__init__(self, images, labels, fake_data=False, one_hot=False, dtype=dtypes.float32) in this script to remove randomness during the shuffling process (line 154 in DataSet.next_batch(self, batch_size, fake_data=False))
  • config = tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1) which goes into the creation of session as sess = tf.Session(config=config)

What system am I using?

  • tensorflow 0.8 gpu version (installed via pip)
  • OpenSUSE LEAP 42.1 (x86_64)
  • Cuda Toolkit 7.5
  • CuDNN 4.0
  • Tesla K20c card with Nvidia driver 352.79
@yaroslavvb

This comment has been minimized.

Copy link
Contributor

yaroslavvb commented Jun 8, 2016

Do you get deterministic results if you run this on CPU? Results of optimized GPU computations for NN ops are usually a little bit non-deterministic, I think it's a nature of how modern GPUs work, @zheng-xq may have more understanding of this

@jiaboli007

This comment has been minimized.

Copy link

jiaboli007 commented Jun 8, 2016

I have the same observation on both GPU and CPU. I think whenever parallel computing is use (either multiple core CPU or GPU), the results will not be deterministic. This is due to the randomness of the order of collecting the partial results from all threads, and this can lead to very small difference within machine accuracy at the beginning and then this tiny tiny difference gets amplified during iterations.

@shiviser

This comment has been minimized.

Copy link

shiviser commented Jun 8, 2016

@yaroslavvb Yes I do get deterministic results on the CPU.

The problem is, in a bigger network that I have designed for another dataset the inconsistencies are quite large (up to +/-25% around the mean).

@shiviser

This comment has been minimized.

Copy link

shiviser commented Jun 8, 2016

@jiaboli007 I have seen that it is possible to have parallel computations and yet have deterministic behaviour (e.g.: Matlab simulink can optimize models for parallel computations yet assuring deterministic behaviour).

@zheng-xq

This comment has been minimized.

Copy link
Contributor

zheng-xq commented Jun 8, 2016

On GPU, small amount of non-deterministic results is expected. TensorFlow uses the Eigen library, which uses Cuda atomics to implement reduction operations, such as tf.reduce_sum etc. Those operations are non-determnistical. Each operation can introduce a small difference. If your model is not stable, it could accumulate into large errors, after many steps.

If you see a large difference, after one or two operations, it would be problematic. Otherwise, it is somewhat expected. Regularizers such as dropout helps the model tolerate that.

@girving

This comment has been minimized.

Copy link
Contributor

girving commented Jun 8, 2016

This is expected behavior: see #2652 for more discussion.

@girving girving closed this Jun 8, 2016

@shiviser

This comment has been minimized.

Copy link

shiviser commented Jun 9, 2016

@zheng-xq I have already tried dropout, it didn't help. After a bit of research over the internet, I have come to know that torch had a similar non-deterministic behaviour with their SpatialMaxPooling operation but it has been fixed now. Perhaps something along the lines?

@girving Also on the same post one can read that Caffe has already acomplished deterministic backward passes. Having said these, and that this "non-deterministic behaviour" is neither fixed nor mentioned in the documentation yet, I would like to re-open the issue.

@girving girving changed the title Tensorflow GPU inconsistent results Mention that GPU reductions are nondeterministic in docs Jun 9, 2016

@girving girving reopened this Jun 9, 2016

@girving

This comment has been minimized.

Copy link
Contributor

girving commented Jun 9, 2016

Reopened: We'd be happy to accept a PR adding a note to this effect.

@shiviser

This comment has been minimized.

Copy link

shiviser commented Jun 9, 2016

@girving Thank you for re-opening the issue but I believe tensorflow could do better than just documenting this, what seems to be an unintended non-deterministic behaviour. Also, I would have loved to work on the PR but will be too much digression from my work at the moment.

P.S.: Don't know if you saw my last edit - on the same post one can read that Caffe has already accomplished deterministic backward passes. Perhaps this is interesting to you!

@shiviser

This comment has been minimized.

Copy link

shiviser commented Jun 9, 2016

@girving Also we are not yet sure if it is actually the GPU reductions and not something else which is causing the non-deterministic behaviour - this need verification before allowing entry into the docs.

@girving

This comment has been minimized.

Copy link
Contributor

girving commented Jun 9, 2016

@shiviser Let us know what you find out!

@zheng-xq

This comment has been minimized.

Copy link
Contributor

zheng-xq commented Jun 9, 2016

@shiviser, that particular kernel was fixed to be determnistics. I believe TensorFlow picked up the same fix. However, other Cudnn conv algorithms are still non-deterministic, since they use atomics inherently. Most frameworks may encounter them depending on the input and kernel shapes.

That being said, if your investigation reveals something else that is causing the problem, a PR to address the doc and the code is welcome.

@jkschin

This comment has been minimized.

Copy link

jkschin commented May 12, 2017

BUILD
TensorFlow GPU 1.1.0
Ubuntu 16.04
CUDA 8.0.61
CUDNN 5.1
NVIDIA Drivers 375
2 GTX 1080Ti, with 1 1080Ti used for 2 monitors. I set CUDA_VISIBLE_DEVICES=0 for all experiments. This uses the GPU with nothing attached.

@zheng-xq why is the forward pass deterministic but the very first backward pass not deterministic?

On a simple MNIST example, the logits are 100% deterministic over 100 runs (I did more runs than that but plotted 100).

logits

However, when we look at the gradients computed on the last set of variables, we see small errors of scale 1e-8.

3d

I proceeded to do a simpler example of simply trying to train a neural network to add.

Inputs: 10x1 (numpy generated with same seed)
Weights: 10x1 (random normal initialized with same seed)
Labels: Sum of the 10x1 input.

Effectively, the neural network is trying to tune the weights such that they all become 1.0.

In fact, it's strange because the randomness has some form of determinism, as seen in the graph below.

run1
run5
run7
run11

Furthermore, the anomalous gradients are consistently the same throughout the runs!

screenshot from 2017-05-12 11-45-15

And they have the exact same error.

screenshot from 2017-05-12 11-54-42

@asimshankar mentioned here that mismatches between CUDA and Drivers could be the problem. So I upgraded my driver to 381, and then I got these results. 1 - 7 gradients have errors, but previously, only 5 had errors and when errors happened, these 5 gradients had exactly the same values.

11

I haven't looked deeply into the exact 7 gradients that have errors but a brief glance showed that they are the same as the 5 before and more.

Could it be the GPU reduce order? Or could there be errors in the computation of gradients themselves?
If anyone has any insight into what experiments to try next, I'll be glad to do them and post the results here.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

yaroslavvb commented May 12, 2017

You could try a simple feed-forward network with squared loss. Training can be implemented with only matmul and reduce_sum, if there's non-determinism there, then pretty much everything is potentially non-deterministic. Addition of floating point numbers is not-associative, so if the other of summing things together in matmul/reduce_sum changes, that can affect results. Typically this is not considered a bug as long as relative error of results stays within machine epsilon (1e-7 for float32). Note that multiplication changes the scale of relative error -- ie, if you have an initial error in result of 1e-7, and multiply result by 10^6, the absolute error blows up, but relative error stays the same

@jkschin

This comment has been minimized.

Copy link

jkschin commented May 12, 2017

@yaroslavvb I'll try that out and post some results here.
@girving any other things to try so we can include in the docs? I'd be happy to do a PR for this if it's within my capabilities.

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 12, 2017

@jkschin Not sure what you mean by other things. If the PR is just adding a note that reductions are nondeterministic, keeping is small seems good.

More broadly, I'd love a determinism push to make everything we can actually deterministic, but that will take more work.

@jkschin

This comment has been minimized.

Copy link

jkschin commented May 15, 2017

@yaroslavvb I realized I'm already doing a simple feed-forward network above.

def add_model(v):
    w = tf.get_variable(name='w', shape=[vector_size],
            dtype=tf.float32,
            initializer=tf.random_normal_initializer)
    output = tf.reduce_sum(tf.multiply(w, v), axis=1)
    tf.add_to_collection('weights', w)
    return output

@girving yeah you mentioned adding a note to this effect. Where should it be added? Agree on a deterministic push!

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

yaroslavvb commented May 15, 2017

@jkschin as an end-user who got tripped up by non-determinism, you might have an idea where this note should go, so that other people like you would see it

@jkschin

This comment has been minimized.

Copy link

jkschin commented May 16, 2017

Would a post here titled "Non-determinism in TensorFlow" with some example code help?

@girving

This comment has been minimized.

Copy link
Contributor

girving commented May 16, 2017

Yes! @martinwicke: That a good place for an overview of TensorFlow nondeterminism?

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented May 16, 2017

I think it should go into the programmer's guide section, but @wolffg has a better overview of where to put this.

Either way, the content would be the same, and we'd be enthusiastic about it.

@benjrossi

This comment has been minimized.

Copy link

benjrossi commented May 27, 2017

Check out a workaround for training a simple fully-connected net with repeatable results on the GPU. It's no panacea, but it seems relevant to the discussion! One caveat is that the reduce_sum replacement in the workaround does not handle partial reduction (i.e. it does not accept an axis argument). This is more a proof of concept, but shows it can be done. Perhaps a slower but deterministic reduce_sum backend for the GPU can be added to TensorFlow that does not rely on CUDA atomics, and an option to Session can determine whether reduce_sum uses the deterministic or fast backend (but haven't studied the TF architecture enough to know if this is feasible).

@rahuljha

This comment has been minimized.

Copy link

rahuljha commented Aug 30, 2017

The link from benjrossi doesn't work anymore, here is an updated link to the same page: https://www.twosigma.com/insights/a-workaround-for-non-determinism-in-tensorflow.

@ekelsen

This comment has been minimized.

Copy link
Contributor

ekelsen commented Sep 14, 2017

Reductions are now deterministic.

@ekelsen ekelsen closed this Sep 14, 2017

@huanghoujing

This comment has been minimized.

Copy link

huanghoujing commented Sep 16, 2017

Although I was not using TensorFlow, I think my experience may be useful for some case.

I found that multi-thread pre-fetching training samples also introduces randomness. In the multi-thread way, in a new run the samples are put into the queue in a new order, determined by the relative speed of the threads. I had to set the number of pre-fetching threads to 1 to solve the problem.

BTW, I implemented the multi-threading in my own way using package threading.

@wwwyn

This comment has been minimized.

Copy link

wwwyn commented Sep 27, 2017

@ekelsen which tensorflow version is deterministic?

@ekelsen

This comment has been minimized.

Copy link
Contributor

ekelsen commented Sep 27, 2017

It's been in head for a few weeks now.

@ekelsen

This comment has been minimized.

Copy link
Contributor

ekelsen commented Sep 27, 2017

so currently only master has deterministic GPU reductions; the next release, 1.4 should have them.

@jaekyeom

This comment has been minimized.

Copy link

jaekyeom commented Jan 10, 2018

@ekelsen Could you please elaborate on how it went?

@ranshadmi

This comment has been minimized.

Copy link

ranshadmi commented Feb 12, 2018

As far as I can tell, TF 1.5 still shows non-deterministic results on GPU (and deterministic on CPU). However, I'm using reduce_mean not reduce_sum. I assume it is the same problem.
I would love to know if there are concrete plans to improve the reproduciblity of training.

@lucb

This comment has been minimized.

Copy link

lucb commented Feb 19, 2018

From my own experiences non-determinism can be found in the backward pass of the convolution filter and in the computation of the softmax cross-entropy loss function.

The convolution issue can be solved by using a deterministic algorithm from cudnn instead of letting tensorflow decide the "best" algorithm to use.

@yaroslavvb

This comment has been minimized.

Copy link
Contributor

yaroslavvb commented Feb 19, 2018

Perhaps export TF_CUDNN_USE_AUTOTUNE=0

@nikonikolov

This comment has been minimized.

Copy link

nikonikolov commented Apr 25, 2018

As far as I can tell TF 1.6 also shows non-deterministic results on GPU. Has this been fixed in the master branch or a later release? If not, are there any plans to "fix" it? Even if not, in the mean time it seems very necessary to at least note this in the official documentation. Sorry if it is already there, but I was not able to find it (which suggests others might also not be able to).

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented Apr 25, 2018

@ekelsen is this still expected?

It's fine for these to be non-deterministic, but we should mention it in the docs if so.

@martinwicke martinwicke reopened this Apr 25, 2018

@jkschin

This comment has been minimized.

Copy link

jkschin commented Apr 25, 2018

Happy to re-open this PR and refine it: #10636. Thoughts @martinwicke?

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented May 4, 2018

Sounds good. Thank you!

@jkschin

This comment has been minimized.

Copy link

jkschin commented May 7, 2018

#10636 is outdated so I won't be refining it.

It seems like both tf.reduce_sum and tf.reduce_mean are deterministic now. @nikonikolov do you have a small example to reproduce the non-deterministic behaviour?

@nikonikolov

This comment has been minimized.

Copy link

nikonikolov commented May 7, 2018

@jkschin Unfortunately I do not have a simple one. Are all operators determinisitc on GPU now? If they are supposed to be, I can dig into it and compose an example. I am using much more operators than tf.reduce_sum and tf.reduce_mean so there is a chance there is another operator causing the non-deterministic behavior.

Also which version exactly has the deterministic behavior?

@nikonikolov

This comment has been minimized.

Copy link

nikonikolov commented May 14, 2018

@jkschin Assuming that tf.reduce_sum and tf.reduce_mean are now deterministic, are all gradient based operations deterministic too (assuming the same data is passed to the neural network). I am still experiencing the problem in TF 1.8 and the non-determinism happens after a gradient step is taken. I can try to provide example, but first wanted to make sure determinism is expected with gradients.

@martinwicke

This comment has been minimized.

Copy link
Member

martinwicke commented May 14, 2018

The gradient pass often contains ops that are rare in the forward pass. Is there a way to narrow this down further? There may be a reduction or something else which is non-deterministic.

We probably cannot claim that all of TF is deterministic yet.

@nikonikolov

This comment has been minimized.

Copy link

nikonikolov commented May 14, 2018

Hey, so here is a relatively simple example to reproduce. I tried it on CPU and results were the same. However, on GPU they were not. When I decreased number of iterations to 1 or 2, all the time I was able to get the same results (although I did not do a lot of runs). When the number of iterations is higher, say >10, I almost always get different results. Also when there is only 1 dense layer, the results are almost always reproducible.

import numpy as np
import tensorflow as tf

ITERATIONS=20

tf.set_random_seed(42)
np.random.seed(42)

x_data = np.random.normal(size=[32, 10])
y_data = np.random.normal(size=[32, 1])
x_test = np.random.normal(size=[32, 10])

x_in  = tf.placeholder(tf.float32, [None, 10])
y_in  = tf.placeholder(tf.float32, [None, 1])
x     = x_in
x     = tf.layers.dense(x, 200, tf.nn.relu)
x     = tf.layers.dense(x, 1, tf.nn.relu)
loss  = tf.losses.mean_squared_error(y_in, x)

mvars = tf.get_default_graph().get_collection(tf.GraphKeys.GLOBAL_VARIABLES)

opt   = tf.train.AdamOptimizer(use_locking=True)
train = opt.minimize(loss)
config= tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1)
sess  = tf.Session(config=config)

allvars = tf.get_default_graph().get_collection(tf.GraphKeys.GLOBAL_VARIABLES)

sess.run(tf.global_variables_initializer())
init_vals = sess.run(allvars)

def run():
  for val, v in zip(init_vals, allvars):
    sess.run(tf.assign(v, val))
  
  ivals = sess.run(allvars)
  out = []
  allvals = []

  for i in range(ITERATIONS):
    l, _ = sess.run([loss, train], feed_dict={x_in: x_data, y_in: y_data})
    out.append(sess.run(x, feed_dict={x_in: x_test}))
    allvals.append(sess.run(allvars))

  fvals = sess.run(allvars)
  # return np.asarray(ivals), np.asarray(fvals), np.asarray(out)
  return np.asarray(ivals), np.asarray(fvals), np.asarray(out), allvals

ivals1, fvals1, out1, all1 = run()
ivals2, fvals2, out2, all2 = run()

same_init = [np.all(v1 == v2) for v1, v2 in zip(ivals1, ivals2)] 
same_fin = [np.all(v1 == v2) for v1, v2 in zip(fvals1, fvals2)] 
print("Forward passes were the same: {}".format( np.all(out1 == out2) ))
print("Final value of variables are the same: {}".format( np.all(same_fin) ))
print("Variables initialized to same values: {}".format( np.all(same_init) ))

Unfortunately I am really busy with experiments at the moment and do not have the time to narrow this down further, but I hope it will be a good starting point. I tested with python 3.6.5, CUDA 9.0, cudnn 7.1
and TF 1.8.

It will be great if we manage to make all operations deterministic. I am running some Reinforcement Learning algorithms and over time I get huge difference in performance even if I use the same seed.

@tensorflowbutler

This comment has been minimized.

Copy link
Member

tensorflowbutler commented Aug 31, 2018

Please remove the assignee, as this issue is inviting external contributions. Otherwise, remove the contributions welcome label. Thank you.

1 similar comment
@tensorflowbutler

This comment has been minimized.

Copy link
Member

tensorflowbutler commented Sep 15, 2018

Please remove the assignee, as this issue is inviting external contributions. Otherwise, remove the contributions welcome label. Thank you.

@ekelsen

This comment has been minimized.

Copy link
Contributor

ekelsen commented Sep 18, 2018

GPU reductions are now deterministic.

@albertz

This comment has been minimized.

Copy link

albertz commented Nov 29, 2018

@ekelsen Which ops are deterministic exactly? Since when / which TF version? Is this documented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment