New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model not deterministic, even though os.environ['TF_DETERMINISTIC_OPS'] = '1' is set #38197
Comments
I don't see how this error is related to the code? Seems to be a jupyter notebook kernel issue, no? |
@duncanriach Any ideas what could be going wrong here? |
Will take a look at this, hopefully today. Feel free to assign it to me, @sanjoy. |
I'm now actively working on this issue ... |
Hey @Zethson, I repo'd your issue and found a solution. To get determinism, you need to do the following: In both calls to
Also, note that the code given in the original comment is almost the same as what's provided for Custom training with tf.distribute.Strategy except that everything from |
@duncanriach |
You're welcome. Yes, with these changes you should see the CPU training become reproducible as well. (Let me know the outcome of that.) The sources of non-determinism that we are addressing here are not related to the ops and therefore not related to which type of processor the ops are running on. Therefore, |
@duncanriach Is this to be expected? If yes, what is the reason for this behavior? |
To add some numbers: System 1:
Run 2:
System 2:
Run 2:
|
Hey @Zethson, from the GPU standpoint, bit-exact reproducibility between two systems is only guaranteed if the hardware-software stack is the same. Any changes in the stack could lead to differences in the way the computation workload is partitioned for (massively) parallel processing. The change in this partitioning will inevitably lead to differences in the accumulation of floating-point rounding errors in the computations. You can learn more about this by watching my GTC talk on the topic. While a different version of anything in the hardware-software stack (e.g. different CUDA driver versions) could lead to slightly different results, you're most likely to see a difference if the GPU architecture is different, if the cuDNN version is different, or if the TensorFlow version is different. Since you're using the same container, I can infer that you're using the same versions of both cuDNN and TensorFlow on both machines. That leaves the GPU architecture. Does one of these machines contain a Pascal GPU and the other a Volta GPU perhaps? Please share the output from It's also possible that there are hardware-software differences in the CPU-related stack that are introducing slightly different floating-point rounding errors, but that's less likely due to much less (or no) parallel computation on the CPU. |
Thank you very much for your detailed response. Yes, the GPU architecture is very likely to be different. I am working on my own Laptop (1050M) and a VM (2 K80s).
System 2:
So judging from your answer I determine that for full reproducibility, the same GPU architecture has to be used. Hence, for a ML model (Say, a 'terrorist detection model' or a 'cancer detection model') to be verifiable and reproducible, we would not only need the same code (solved by git), the same environment (solved by containers), but also the same hardware (solved as long as the hardware exists)? Naive question: would it technically be possible to improve reproducibility (at the cost of precise model training) by decreasing the floating point precision and introducing a more eager rounding procedure? I am a bioinformatician and currently, the state of reproducibility of data analysis has dramatically improved with the introduction of https://anaconda.org/bioconda/ and workflow languages such as Nextflow, which facilitate their usage. As a result, the results are not only fully reproducible, but also portable. Therefore, researchers can very easily verify the results, which is very important for the peer review process. Nevertheless, I am aware of Nvidias efforts of speeding up the very computationally expensive bioinformatics analyses (and support that!) and fear that we may lose the portability, since the very same GPU architecture would be required. If my last two paragraphs are off topic, then please tell me and I will remove them and would be happy to move the discussion elsewhere (if you are interested). |
I'm happy to discuss this here. I suspect that our discussion may be helpful to others. Thanks for all the additional information. The GeForce GTX 1050 contains a GPU that is based on the Pascal architecture and the Tesla K80s contain GPUs that are based on the Kepler architecture. So bit-exact reproducibility is not guaranteed, and in fact unlikely, between those two machines based solely on the GPU architecture they use. However, a more significant factor is that you're doing multi-GPU training and on your laptop you have only a single GPU (one Kepler) while on the remote machine you have two GPUs (two Pascals). Because of the different number of GPUs, even if all those GPUs were from same architecture, you would definitely not get bit-exact reproducibility between the laptop and the remote machine. The reason for this (again) is that the extensive floating-point computations are parallelized by being distributed in different ways on these two systems. This distribution in necessary and inherent in the process of maximally parallelizing (and therefore maximally accelerating) these computations. Computations that involve reducing the partial results from these compute partitions will include slightly different rounding errors depending on the way that the computation was partitioned. In the case of data-parallel multi-GPU (or multi-node) training, there is always going to be a reduction of the partial gradients produced on each of the GPUs (or nodes).
You're on the right track. Theoretically, there are four different possible ways around this that I can think of: 1. Use integersWhile floating-point operations are not perfectly associative (rounding errors differ based on the order of operations), integers are perfectly associative. Integers (e.g. INT8) can be used for inference, and they often are used because they result in increased performance and reduced memory footprint. However, integers cannot (currently) easily be used for training because both range and precision are required, especially in the gradients. 2. Use double-precision floating-point (i.e. 64-bit floating point)This will reduce the amount of floating-point rounding error that accumulates but there would still be a difference between GPU architectures and/or number of GPUs. This will also reduce performance a lot (at least 4x ?) and will at least double memory footprint. I've never trained a model with 64-bit integers through, and I don't know if it's possible in TensorFlow and whether the precision propagates all the way through, including through the back-prop. Based on my experience with TensorFlow's source code, I think it's very unlikely for typical cases. 3. Quantize after trainingIt's not possible to train all the way through in regular floating-point and then convert to integer or a reduced-precision floating point format at the end to get (probably reduced-accuracy) between-stack reproducible training results (i.e. trainable variables) because it's fundamentally not possible to reproducibly quantize-away the accumulated error differences. This is challenging concept to understand or explain in text form, sorry. 4. Final-train on CPUAnother option to think about is to report results from running on a single thread on a CPU. You would do all your development using the massive amount of acceleration provided by GPUs (or other accelerators) and then run once to get values to report. However, it's going to take a long time for that final run. Also, since the exact implementation of the underlying math can change on different CPUs (especially when using MKL), even when only using a single thread, you should still really include the CPU architecture that you used along with your results. Someone could run the same container and git repo hashes on a different CPU architecture and theoretically get slightly different results (just as with GPUs). I imagine that none of the above are feasible for your needs. We now have run-to-run training reproducibility on GPUs in TensorFlow. This is a relatively new achievement. I, and others, are now working on extending this support. In reality, I think it's going to be totally practical for you to provide bit-exact results for one or more GPU architectures (e.g. Pascal or Volta). In terms of reproducibility, as far as I am aware, this goes way further than the current state-of-the-art in the ML/DL research community. I recommend that you qualify your bit-exact results as being achieved on a given hardware-software version stack, including the type and number of GPUs used. Given the underlying technical constraints, this is a reasonable compromise. And with all of that said, it's important to remember that in most SGD-DL applications the amount of variance in the final result (e.g. test-set accuracy) is relatively small due to these differences in floating-point rounding error propagation. The advantage of a particular git hash running in a particular container image hash will, and should, attain most of the reproducibility demanded by peer-review. You can step-up the game even further by specifying the GPU architecture that the results were produced on. |
@duncanriach
Yes, I was aware. Hence, I restricted my Docker container on my multi-GPU machine to only make a single of those 2 GPUs available. Both do of course show up when running I will now ensure that any of my pipelines will output the CPU and GPU architecture and will advocate this whenever appropriate.
Are you aware of any studies related to this? Any 'hard' numbers? It would be nice to able to have an expected variance between different GPU architectures. Cheers! |
Good job.
Great. Thanks.
No, but it's on my roadmap to do this.
Yes. When there is non-determinism, this can result in training randomly and non-reproducibly failing (or not doing as well). Luckily, mini-batch training has the effect of avoiding local minima and finding the global minimum. If there is non-reproducible gradient explosion or disappearance on one of the effectively infinite paths to that global minimum, however, then that can make debugging almost impossible. In training regimes in which there is no negative feedback (or where there is actually positive feedback, as with reinforcement learning), non-determinism will lead to completely different results on every run. Note, and remember, that any system that does not have an end-to-end negative feedback loop can, and often will, amplify small differences in input to produce large differences in output. These concepts apply, of course, to changing the hardware-software stack versioning and thereby potentially changing bit-accurate results, but it's less critical than run-to-run reproducibility (what we call determinism).
Something we plan to do sooner is to characterize the variance due to non-determinism for different model architectures on a given GPU architecture. This variance will have a similar order of magnitude to the variance between GPU architectures for that model architecture (it should be small). The goal of deterministic operation of TensorFlow on GPUs is run-to-run determinism. What this then gives us, as a side-effect, is the ability to characterize the effect of changes to hardware-software stack versioning on model accuracy. Right now, however, there is a lot of work to be done to consolidate and broaden run-to-run reproducibility (in all DL frameworks). |
@duncanriach I am looking forward to reading/hearing about your results of the variance of non-deterministic models. Very interested in this matter.
I consider this issue solved, but we could also keep it open for your mentioned bug. |
You're welcome, @Zethson. It's been a pleasure.
You might want to star or watch https://github.com/NVIDIA/tensorflow-determinism because progress will be reported there first.
Let's keep this current issue open for now. Once I've opened a new issue, with minimal repro code for the re-shuffle problem, then I'll inform you. Then you can close this current issue. |
Update: TensorFlow version 2.3.0 no longer exhibits the nondeterminism associated using |
@duncanriach |
System information
example script provided in TensorFlow): Pretty much the MirroredStrategy fmnist example
binary): tensorflow/tensorflow:2.2.0rc2-gpu-py3
Describe the current behavior
Model is not deterministic/reproducible.
Two runs:
Describe the expected behavior
I expect the model to be reproducible with the same loss, accuracy etc
Standalone code to reproduce the issue
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
I guess this should cover everything?
The code is currently running on a SINGLE GPU, even though I'm planning to run it on several GPUs.
The text was updated successfully, but these errors were encountered: