[determinism] Add non-sparse softmax/xent GPU-determinism #49178

duncanriach · 2021-05-14T04:49:51Z

This PR adds and tests deterministic forward and backward operation of tf.nn.softmax_cross_entropy_with_logits when running on a GPU.

Note that there are changes and enhancements to the existing tests that may be obscured by the restructuring of the test files.

Thanks to @reedwm for providing support and guidance on this PR, including looking into the arithmetic equivalence of the forward and backward operation of the python-level solution.

Note that a naive implementation of softmax followed by cross-entropy is not as numerically stable as the version implemented here (and in the existing Eigen-based/C-level implementation) in which the log in the cross-entropy function is moved back into the softmax, changing it into a log-softmax and changing the cross-entropy function into a dot-product. log-softmax does not demand as large dynamic ranges as softmax.

Note that the following tests do not pass on this deterministic implementation (and have been disabled):

Backprop to logits when there is only a single class (the forward path passes). See testSingleClass.
Backprop to logits when labels are broadcast (the forward path passes). See testLabelsBroadcast.

I have not yet been able to determine the reason for this, and I don't know if it's because the existing functionality is incorrect or if the new, deterministic functionality is incorrect. For the single class case, for example, it seems to me that the correct gradients should all be zero (which is what the new, deterministic implementation provides). It seems as though the above two use cases (single class and broadcast labels) would rarely be used; it's not obvious to me what the applications of these use cases would be, and these functionalities are also not documented. I have added TODO comments for me to look into this more deeply. @reedwm, feel free to explore.

UPDATE: After further investigation, it has been revealed that the gradients only mismatch between the nondeterministic and deterministic implementations when the labels vector is not a valid probability distribution, as required (but not enforced) by the API. See this comment for more information.

This PR is related to RFC: Enabling Determinism in TensorFlow. For status and history of GPU-determinism for this op, see here.

cc @sanjoy @nluehr

reedwm · 2021-05-14T16:10:57Z

tensorflow/core/kernels/xent_op.cc

                                        "2-dimensional, or broadcasted to be "
                                        "2-dimensional"));

-    if (std::is_same<Device, GPUDevice>::value) {


Technically, this op is still nondeterministic. It's still likely used in practice with the non-Python APIs (e.g. with the C API or C++ API). So I would keep the error message, but add a comment stating the Python API does not use this op when determinism is enabled.

I don't think we ever have to bother fixing the op though. C++ users can always manually use the nonfused version as well.

Good point. The next commit will put the exception-throwing back (with your suggested comment), as well as the tests (but running on the gen_nn_ops version).

reedwm · 2021-05-14T19:24:05Z

tensorflow/python/kernel_tests/xent_op_test_base.py

+    deterministic_ops = os.getenv('TF_DETERMINISTIC_OPS', '0')
+    return deterministic_ops == '1' or deterministic_ops == 'true'
+
+  def _coreOp(self, features, labels):


I think instead of having a _coreOp function, we should just test the nn_ops version instead of the gen_nn_ops version for most tests. This avoids the awkwardness of having to create a Python implementation that has the same interface as the C++ op in XentDeterministicTest.

For test methods that are specific to the gen_nn_ops version, like testRankTooLarge, you can define a XentOpTest class subclassing XentOpTestBase and put them there. That way, you don't have to override methods in XentDeterministicTest to do nothing.

Once we have a flag to enable determinism instead of the environmental variable, we can have just a single test file and a testWithDeterminismOnAndOff test decorator, which will simply things. But we're not quite there yet.

I think instead of having a _coreOp function, we should just test the nn_ops version instead of the gen_nn_ops version for most tests. This avoids the awkwardness of having to create a Python implementation that has the same interface as the C++ op in XentDeterministicTest.

Sounds good. I was trying to minimize changes and be done quicker. What you propose is a good technical debt minimization step.

For test methods that are specific to the gen_nn_ops version, like testRankTooLarge, you can define a XentOpTest class subclassing XentOpTestBase and put them there. That way, you don't have to override methods in XentDeterministicTest to do nothing.

Yep. Sounds good.

Once we have a flag to enable determinism instead of the environmental variable, we can have just a single test file and a testWithDeterminismOnAndOff test decorator, which will simply things. But we're not quite there yet.

Absolutely. The (temporary) environment variable implementation makes organization of test files and classes unusually cumbersome. Looking forward to the ensuing tidiness.

reedwm · 2021-05-14T19:25:07Z

tensorflow/python/kernel_tests/xent_op_test_base.py

+  def _testXentWrapper(self, np_features, np_labels, dim=-1, use_gpu=False):
+    np_loss, _ = self._npXent(np_features, np_labels, dim=dim)
+    with self.cached_session(), test_util.device(use_gpu):
+      # Even in eager mode, the above line will be able to pin ops to CPU.


What does this mean? What is wrong with passing use_gpu to cached_session, as was done before?

Maybe I'm mistaken, but from my experience of writing, running, and checking tests (and from looking at the test_util code), my understanding is that if you pass use_gpu=False into self.session() or self.cached_session() and run it on a GPU in eager mode, the ops will not be pinned to the CPU. This is because self.session() and self.cached_session() are only relevant in graph mode (they do nothing in eager mode). When we write test code that can be run in either graph or eager mode, there needs to be both a self.session() or self.cached_session() context and a separate, nested test_util.device() context.

The existing code is currently only running in eager mode, and, the way it's written, each case will be run twice on GPU if there is a GPU present (this is clearly not the intention of the test writer). The change in the PR makes it run each case on both CPU and GPU if there is a GPU present, but does not make it less compatible with graph mode (in case we want to run this test in graph mode).

The next commit will remove the self.cached_session() and the comment on the next line. None of that is needed.

reedwm · 2021-05-14T19:35:37Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

+            self.assertAllEqual(result_a, result_b)
+
+  def _testBackward(self, labels_not_logits=False):
+    for use_gpu in [False, True]:


Our CI runs tests both with and without a GPU, so explicitly running on the CPU is not necessary.

I'm aware that all the tests get run on machines with and without GPUs and that, therefore, the tests can be written so that each case will run on a GPU if a GPU is present and on a CPU if a GPU is not present. The reason I write tests like this, and (I presume) why others do (including the existing op test), is that it simplifies the development process: I can develop for both CPU and GPU on one machine with one build-test container.

Should I change my development practice to open both a GPU container and a non-GPU container on my machine, with the source code mounted into both, and then execute the build/run in both containers in parallel? The development procedure is already very cumbersome and slow; this would double the amount of build/re-build time and things to keep track of.

The next commit will remove this loop. It makes sense to remove it after development, when it is no longer needed.

The next commit also removes all the extra cycles that were in the pre-existing test code to run test cases on both the CPU and GPU on a machine with a GPU (and which ran the same cases twice on machines without GPUs).

You can run with CUDA_VISIBLE_DEVICES= (setting the env var to an empty string) if you want to test without a GPU. Admittedly, this is a pain since you cannot run both with and without the GPU in a single bazel test command.

/CC @sanjoy would it be feasible to have the non-GPU versions of test run with CUDA_VISIBLE_DEVICES= automatically?

reedwm · 2021-05-14T19:38:48Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

+          labels, logits = self._generateInputs(dtype, seed=456)
+          output_shape = labels.shape[0]
+
+          def gradients(seed=789):


No reason for seed to have a default value here

Agreed. It will be removed by the next commit.

reedwm · 2021-05-14T19:41:14Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

+            seed = 456 + trial
+            labels_grad_a, logits_grad_a = gradients(seed=seed)
+            labels_grad_b, logits_grad_b = gradients(seed=seed)
+            if labels_not_logits:


Instead of having this parameter, i would just assert both labels and logits.

Right. That's how this test started out. It evolved so that each op-path is covered independently (with code re-use but less efficient use of compute cycles). For one thing, this made it easy for me to confirm (without modification to the tests) that the tests will catch nondeterminism in each path independently. With these tests, for example, I disabled the determinism solution and confirmed that both testBackwardLogits and testBackwardLabels failed. I believe that tests quit on the first assert, which makes it impossible to fully validate their intention without modifying them (albeit, just the addition or subtraction of a #).

I'm considering simplifying this.

The next commit will remove this complexity.

If you want to test multiple things independently (allowing the second to run even if the first fails), consider using self.subtest. I don't think it's necessary in this case though.

reedwm · 2021-05-26T17:48:40Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

+class XentOpDeterminismExceptionsTest(test.TestCase):
  """Test d9m-unimplemented exceptions from SoftmaxCrossEntropyWithLogits.

  Test that tf.errors.UnimplementedError is thrown or not thrown, as


This issue isn't introduced by this PR, but it seems this test only tests that UnimplementedError is thrown. It does not test that UnimplementedError is not thrown, contrary to the comment "...or not thrown...".

Nice catch. This comment will be adjusted by the next commit.

reedwm · 2021-05-26T18:02:15Z

tensorflow/python/kernel_tests/xent_op_test_base.py

+
+  def _opDeterminismEnabled(self):
+    deterministic_ops = os.getenv('TF_DETERMINISTIC_OPS', '0')
+    return deterministic_ops == '1' or deterministic_ops == 'true'


Change to deterministic_ops in ('1', 'true')

Lovely simplification. Thank you. This will be included in the next commit.

reedwm · 2021-05-26T19:41:38Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

  @test_util.run_in_graph_and_eager_modes
  def testForward(self):
-    with self.session(), test_util.force_cpu():
+    with self.cached_session(), test_util.device(use_gpu=True):


I think test_util.device is unnecessary, since the test will use the GPU by default if available.

I agree, and I've thought about this. use_gpu=True explicitly pins the ops to the GPU, if there is a GPU. I understand that the expected, default behavior is for the ops to auto-pin to the GPU (if there is one), but I've wondered if we should rely on that. Considering again ...

The next commit will remove almost all instances of use_gpu=True, in xent_op_test_base.py, xent_op_test.py, and xent_op_deterministic_test.py.

reedwm · 2021-05-26T19:51:42Z

tensorflow/python/kernel_tests/xent_op_deterministic_test.py

+    TODO(duncanriach): Identify the source of the difference in gradients for
+                       this case.
+    """
+    self._testSingleClass(test_backprop=False)


How incorrect are the gradients here and the next test? I am slightly worried there is a correctness problem, although the single class case is seems fairly unimportant.

Agreed, it is concerning, and the gradients are very different (if I remember correctly). I was already intending to change these tests so that they do check expected values for both deterministic and nondeterministic cases. The next commit will include that, so we'll have documentation in the test of the difference between the gradients for deterministic vs nondeterministic.

After further investigation, it has been revealed that the gradients only mismatch between the nondeterministic and deterministic implementations when the labels vector is not a valid probability distribution, as required (but not enforced) by the API. Below are the docstring notes from xent_op_deterministic_test.py in its current state (after the most recent commit).

testSingleClass:

The most recent commit also adds the fourth minibatch item: labels=[1.], logits=[1.].

The deterministic implementation does not produce the gradients expected by the original test (for the nondeterministic functionality) when the labels vector is not a valid probability distribution. labels: [[-1.], [0.], [1.], [1.]] logits: [[1.], [-1.], [0.], [1.]] nondeterministic deterministic dloss/dlogits: [[2.0], [1.0], [0.0], [0.0]] [[0.0], [0.0], [0.0], [0.0]] Note that only the second two label vectors are valid probability distributions (as required by the API) and that the gradient matches for those cases.

testLabelsBroadcast:

The most recent commit also adds the third minibatch item: labels=[0.25], logits=[1., 2., 3., 4.].

The deterministic implementation does not produce the gradients expected by the original test (for the nondeterministic functionality) when the labels vector (after broadcasting) is not a valid probability distribution. labels: [[0.], [2.], [0.25]] logits: [[1., 1., 1., 1.], [1., 2., 3., 4.], [1., 2., 3., 4.]] dloss/dlogits (nondeterministic): [[ 0.25 , 0.25 , 0.25 , 0.25 ], [-1.968, -1.913, -1.763, -1.355], [-0.218, -0.163, -0.013, 0.394]] dloss/dlogits (determinsitic): [[ 0. , 0. , 0. , 0. ], [-1.743, -1.303, -0.105, 3.150], [-0.218, -0.163, -0.013, 0.394]] Note that neither of the first two broadcast label vectors is a valid probability distribution (as required by the API) and that these are the cases that yield different gradients for nondeterministic vs determinsitic implementations.

I propose that we remove these illegal label cases from the tests, replacing them with legal cases, thereby making the tests run the same way on both the original implementation and the deterministic implementation.

I am slightly worried some people might accidentally rely on the current behavior when given illegal label cases. Also in practice, the labels might not add up exactly to 1 due to rounding errors. I think we should keep it as is for now, with the TODO to investigate this. I'll later try to think if there is a valid interpretation of softmax-cross-entropy when the labels do not add to 1.

reedwm · 2021-06-02T19:41:11Z

Sorry for long periods between reviews, I had taken some time off.

duncanriach · 2021-06-02T22:54:37Z

Sorry for long periods between reviews, I had taken some time off.

No problem, @reedwm. I hope you had a relaxing and enjoyable break.

[determinism] Add softmax/xent GPU-determinism.

137af41

google-ml-butler bot added the size:L CL Change Size: Large label May 14, 2021

google-cla bot added the cla: yes label May 14, 2021

[determinism] Improve softmax/xent testing.

abfd464

duncanriach force-pushed the softmax-xent-gpu-determinism branch from e6c4756 to abfd464 Compare May 14, 2021 04:56

gbaned self-assigned this May 14, 2021

gbaned requested a review from sanjoy May 14, 2021 06:54

reedwm suggested changes May 14, 2021

View reviewed changes

[determinism] Address review (round 1) on PR 49178.

c30a93b

duncanriach requested a review from reedwm May 19, 2021 00:46

[determinism] Run test on non-CUDA GPUs too.

da9f367

gbaned requested review from reedwm and removed request for reedwm May 20, 2021 13:53

[determinism] Make additional improvements for PR 49178.

8e866b9

sanjoy removed their request for review May 26, 2021 06:36

reedwm suggested changes May 26, 2021

View reviewed changes

[determinism] Address review (round 2) on PR 49178.

4eb0049

duncanriach requested a review from reedwm May 28, 2021 00:10

duncanriach force-pushed the softmax-xent-gpu-determinism branch from 5240ac7 to 05e70a4 Compare May 28, 2021 01:34

[determinism] Fix typos in comments.

8fb4285

duncanriach force-pushed the softmax-xent-gpu-determinism branch from 05e70a4 to 8fb4285 Compare May 28, 2021 05:16

reedwm approved these changes Jun 2, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jun 2, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 2, 2021

[determinism] Remove unused import in test

fad40e3

google-ml-butler bot removed the ready to pull PR ready for merge process label Jun 2, 2021

duncanriach requested a review from reedwm June 2, 2021 23:01

reedwm approved these changes Jun 2, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jun 2, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 2, 2021

pkanwar23 added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Jun 4, 2021

duncanriach mentioned this pull request Jun 4, 2021

[determinism] Add sparse softmax/xent GPU-determinism #50070

Closed

copybara-service bot merged commit 8c2b12a into tensorflow:master Jun 4, 2021

duncanriach mentioned this pull request Jun 5, 2021

Add GPU-deterministic back-prop for fused softmax/cross-entropy ops #38185

Closed

duncanriach deleted the softmax-xent-gpu-determinism branch June 7, 2021 22:29

[determinism] Add non-sparse softmax/xent GPU-determinism #49178

[determinism] Add non-sparse softmax/xent GPU-determinism #49178

Uh oh!

Conversation

duncanriach commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach May 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanriach May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reedwm commented Jun 2, 2021

Uh oh!

duncanriach commented Jun 2, 2021

Uh oh!

Uh oh!

duncanriach commented May 14, 2021 •

edited

Loading

duncanriach May 17, 2021 •

edited

Loading

duncanriach May 17, 2021 •

edited

Loading

duncanriach May 17, 2021 •

edited

Loading

duncanriach May 28, 2021 •

edited

Loading