-
Notifications
You must be signed in to change notification settings - Fork 130
/
deterministic_training.rst
31 lines (23 loc) · 1.56 KB
/
deterministic_training.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
.. _deterministic_training:
======================
Deterministic training
======================
There are couple of TF operations which have a non-deterministic GPU implementation (for efficiency reasons),
i.e. the result when executed on the GPU is non-deterministic.
See also `here <https://www.twosigma.com/insights/article/a-workaround-for-non-determinism-in-tensorflow/>`_.
Non-deterministic ops:
* ``reduce_mean``, ``reduce_sum`` (`see here <https://github.com/tensorflow/tensorflow/issues/3103>`_).
Or now deterministic? (`see here <https://github.com/tensorflow/tensorflow/issues/2732>`_)
* convolutional ops (via cuDNN) can be (`see here <https://github.com/tensorflow/tensorflow/issues/18096>`_)
* ``BiasAddGrad`` (`see here <https://github.com/tensorflow/tensorflow/issues/22398>`_)
* ...
E.g. however ``matmul`` is deterministic. From the CUDA doc:
By design, all CUBLAS API routines from a given toolkit version,
generate the same bit-wise results at every run
when executed on GPUs with the same architecture and the same number of SMs.
However, bit-wise reproducibility is not guaranteed across toolkit version
because the implementation might differ due to some implementation changes.
The option ``deterministic_train`` controls whether Returnn should use deterministic ops as far as possible.
So far this uses e.g. ``aggregation_method = tf.AggregationMethod.ADD_N``
and not ``aggregation_method = tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N`` for the TF optimizer.
We plan to extend this by replacing some of the non-deterministic ops by deterministic ones.