[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

deven-amd · 2020-10-22T15:20:54Z

The following commit introduces a failure on in the //tensorflow/python/kernel_tests:linalg_grad_test_gpu test on the ROCm Platform

3ce466a

The failure is in the MatrixExponentialGradient subtest, and the errors we get are of the following form

======================================================================
ERROR: test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
	 [[PartitionedCall/gradients/matrix_exponential_1/cond_grad/StatelessIf/then/_22/gradients/Neg_grad/Neg/_73]]
  (1) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
0 successful operations.
0 derived errors ignored.

The regression was fixed by the subsequent commit, 7d57263

and then re-introduced when parts of the above commits were rolled back by the following commit

fb22dff

The regression seems to occur on the ROCm platform because the "MatrixSolve" operator is currently only enabled for the CUDA platform for GPUs, and not on the ROCm platform.

This commit is to temporarily disable the subtest on the ROCm platform, to get the ROCm CSB to pass. It can be reverted once the reverted changes are put back in.

/cc @cheshire @chsigg @nvining-work

The following commit introduces a failure on in the `//tensorflow/python/kernel_tests:linalg_grad_test_gpu` test on the ROCm Platform tensorflow@3ce466a The failure is in the `MatrixExponentialGradient` subtest, and the errors we get are of the following form ``` ====================================================================== ERROR: test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest) test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1375, in _do_call return fn(*args) File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1360, in _run_fn target_list, run_metadata) File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}} . Registered: device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF] device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF] device='CPU'; T in [DT_COMPLEX128] device='CPU'; T in [DT_COMPLEX64] device='CPU'; T in [DT_DOUBLE] device='CPU'; T in [DT_FLOAT] [[MatrixSolve]] [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]] [[PartitionedCall/gradients/matrix_exponential_1/cond_grad/StatelessIf/then/_22/gradients/Neg_grad/Neg/_73]] (1) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}} . Registered: device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF] device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF] device='CPU'; T in [DT_COMPLEX128] device='CPU'; T in [DT_COMPLEX64] device='CPU'; T in [DT_DOUBLE] device='CPU'; T in [DT_FLOAT] [[MatrixSolve]] [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]] 0 successful operations. 0 derived errors ignored. ``` The regression was fixed by the subsequent commit, tensorflow@7d57263 and then re-introduced when parts of the above commits were rolled back by the following commit tensorflow@fb22dff The regression seems to occur on the ROCm platform because the "MatrixSolve" operator is currently only enabled for the CUDA platform for GPUs, and not on the ROCm platform. This commit is to temporarily disable the subtest on the ROCm platform, to get the ROCm CSB to pass. It can be reverted once the reverted changes are put back in.

google-ml-butler bot added the size:S CL Change Size: Small label Oct 22, 2020

google-cla bot added the cla: yes label Oct 22, 2020

mihaimaruseac approved these changes Oct 22, 2020

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Oct 22, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 22, 2020

gbaned self-assigned this Oct 23, 2020

gbaned added the comp:gpu GPU related issues label Oct 23, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Oct 23, 2020

copybara-service bot merged commit 7de0398 into tensorflow:master Oct 23, 2020

PR Queue automation moved this from Assigned Reviewer to Merged Oct 23, 2020

deven-amd deleted the google_upstream_rocm_csb_fix_201022 branch January 4, 2021 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

deven-amd commented Oct 22, 2020

[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

Conversation

deven-amd commented Oct 22, 2020