Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Fix for ROCm CSB Breakage - 201022 #44233

Merged

Conversation

deven-amd
Copy link
Contributor

The following commit introduces a failure on in the //tensorflow/python/kernel_tests:linalg_grad_test_gpu test on the ROCm Platform

3ce466a

The failure is in the MatrixExponentialGradient subtest, and the errors we get are of the following form

======================================================================
ERROR: test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
	 [[PartitionedCall/gradients/matrix_exponential_1/cond_grad/StatelessIf/then/_22/gradients/Neg_grad/Neg/_73]]
  (1) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
0 successful operations.
0 derived errors ignored.

The regression was fixed by the subsequent commit, 7d57263

and then re-introduced when parts of the above commits were rolled back by the following commit

fb22dff

The regression seems to occur on the ROCm platform because the "MatrixSolve" operator is currently only enabled for the CUDA platform for GPUs, and not on the ROCm platform.

This commit is to temporarily disable the subtest on the ROCm platform, to get the ROCm CSB to pass. It can be reverted once the reverted changes are put back in.


/cc @cheshire @chsigg @nvining-work

The following commit introduces a failure on in the `//tensorflow/python/kernel_tests:linalg_grad_test_gpu` test on the ROCm Platform

tensorflow@3ce466a

The failure is in the `MatrixExponentialGradient` subtest, and the errors we get are of the following form

```
======================================================================
ERROR: test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
test_MatrixExponentialGradient_float64_5_5 (__main__.MatrixUnaryFunctorGradientTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1360, in _run_fn
    target_list, run_metadata)
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/linalg_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
	 [[PartitionedCall/gradients/matrix_exponential_1/cond_grad/StatelessIf/then/_22/gradients/Neg_grad/Neg/_73]]
  (1) Not found: No registered 'MatrixSolve' OpKernel for 'GPU' devices compatible with node {{node MatrixSolve}}
	.  Registered:  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_HALF]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

	 [[MatrixSolve]]
	 [[matrix_exponential_1/cond/PartitionedCall/matrix_exponential_1/cond]]
0 successful operations.
0 derived errors ignored.

```

The regression was fixed by the subsequent commit, tensorflow@7d57263

and then re-introduced when parts of the above commits were rolled back by the following commit

tensorflow@fb22dff

The regression seems to occur on the ROCm platform because the "MatrixSolve" operator is currently only enabled for the CUDA platform for GPUs, and not on the ROCm platform.

This commit is to temporarily disable the subtest on the ROCm platform, to get the ROCm CSB to pass. It can be reverted once the reverted changes are put back in.
@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label Oct 22, 2020
@google-cla google-cla bot added the cla: yes label Oct 22, 2020
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Oct 22, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Oct 22, 2020
@gbaned gbaned self-assigned this Oct 23, 2020
@gbaned gbaned added the comp:gpu GPU related issues label Oct 23, 2020
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Oct 23, 2020
@copybara-service copybara-service bot merged commit 7de0398 into tensorflow:master Oct 23, 2020
PR Queue automation moved this from Assigned Reviewer to Merged Oct 23, 2020
@deven-amd deven-amd deleted the google_upstream_rocm_csb_fix_201022 branch January 4, 2021 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:gpu GPU related issues ready to pull PR ready for merge process size:S CL Change Size: Small
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants