Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Fix for ROCM CSB breakage - 210322 #47980

Merged

Conversation

deven-amd
Copy link
Contributor

1

The following commit adds GPU support for int32/int64 support for the Unique/UniqueWithCounts ops, but breaks ROCm build in the process

02585ac

tensorflow/core/kernels/unique_op_gpu.cu.cc:292:9: error: no matching constructor for initialization of 'gpuprim::TransformInputIterator<int, SegmentIndicatorFunctor<unsigned long, int>, gpuprim::CountingInputIterator<int>>' (aka 'transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::(anonymous namespace)::SegmentIndicatorFunctor<unsigned long, int>, int>')
        segment_indicator_iter(0, {sorted_input_ptr});
        ^                      ~~~~~~~~~~~~~~~~~~~~~
tensorflow/core/kernels/unique_op_gpu.cu.cc:176:12: note: in instantiation of member function 'tensorflow::UniqueOpGPU<unsigned long, int>::ComputeAsync' requested here
  explicit UniqueOpGPU(OpKernelConstruction* context)
           ^
tensorflow/core/kernels/unique_op_gpu.cu.cc:461:27: note: in instantiation of member function 'tensorflow::UniqueOpGPU<unsigned long, int>::UniqueOpGPU' requested here
TF_CALL_REAL_NUMBER_TYPES(REGISTER_UNIQUE_GPU);

This PR/commit disables ROCm support for newly added ops to get the CSB passing again. We are looking into resolving the build errors, and will file a separate PR to re-enable ROCm functionality for the same. This PR commit also adds the no_rocm tag to a couple of unit tests that start failing as a consequence of lack of ROCm support for these ops.

//tensorflow/python/keras/optimizer_v2:adamax_test_gpu                   FAILED in 3 out of 3 in 7.0s
//tensorflow/python/training:adam_test_gpu                               FAILED in 3 out of 3 in 6.3s

2

The following commit adds a new unit-test which is failing on the ROCm platform

50f8897

//tensorflow/python/keras/distribute:dataset_creator_model_fit_test_gpu  FAILED in 3 out of 3 in 112.5s

This commit adds a no_rocm tag to temporarily disable that unit-test on ROCm.


/cc @cheshire @chsigg

@google-ml-butler google-ml-butler bot added the size:XS CL Change Size: Extra Small label Mar 22, 2021
@google-cla google-cla bot added the cla: yes label Mar 22, 2021
@gbaned gbaned self-assigned this Mar 22, 2021
@gbaned gbaned added the comp:gpu GPU related issues label Mar 22, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Mar 22, 2021
@gbaned gbaned requested a review from chsigg March 22, 2021 17:19
@deven-amd
Copy link
Contributor Author

@cheshire @chsigg gentle ping. Need this PR to get ROCm builds passing again.

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Mar 24, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 24, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 24, 2021
@gbaned gbaned added the kokoro:force-run Tests on submitted change label Mar 24, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 24, 2021
@gbaned
Copy link
Contributor

gbaned commented Mar 24, 2021

@deven-amd Can you please check build failures. Thanks!

@gbaned gbaned removed the ready to pull PR ready for merge process label Mar 24, 2021
@deven-amd
Copy link
Contributor Author

@gbaned all the build error point to

ERROR: /tmpfs/bazel_output/_bazel_kbuilder/f2d52ca1f092ccbe254cc98c3dc90790/external/llvm-project/llvm/BUILD

don't think they are related to the change in this PR.

can you trigger the tests again...maybe whatever was causing them, has been fixed.

@gbaned gbaned added the kokoro:force-run Tests on submitted change label Mar 24, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 24, 2021
@deven-amd deven-amd force-pushed the google_upstream_rocm_csb_fix_210322 branch from 11e2175 to f06a8e1 Compare March 25, 2021 02:18
@deven-amd
Copy link
Contributor Author

@gbaned , rebased my branch to resolve the merge conflict

deven-amd referenced this pull request Mar 25, 2021
Doing this to make the ROCm build green.

PiperOrigin-RevId: 364982542
Change-Id: Idf5f94fbcbebf0b6d1da6d468905b86385fb2392
@deven-amd
Copy link
Contributor Author

@gbaned , rebased (again) my branch to resolve the merge conflict

@deven-amd deven-amd force-pushed the google_upstream_rocm_csb_fix_210322 branch from f06a8e1 to 2d96b80 Compare March 25, 2021 11:07
@gbaned gbaned requested a review from cheshire March 25, 2021 15:14
@deven-amd
Copy link
Contributor Author

@cheshire @chsigg gentle ping

@gbaned gbaned added the awaiting review Pull request awaiting review label Mar 30, 2021
@deven-amd deven-amd force-pushed the google_upstream_rocm_csb_fix_210322 branch from 2d96b80 to 6441d6b Compare April 2, 2021 18:23
@deven-amd
Copy link
Contributor Author

rebased PR to resolve merge conflict

@cheshire @chsigg gentle ping

@gbaned gbaned removed the request for review from chsigg April 5, 2021 14:31
@gbaned gbaned requested a review from chsigg April 5, 2021 14:31
@deven-amd deven-amd force-pushed the google_upstream_rocm_csb_fix_210322 branch from 6441d6b to df0dc4a Compare April 6, 2021 03:05
@deven-amd
Copy link
Contributor Author

rebased PR again to resolve merge conflict

@cheshire @chsigg gentle ping

@gbaned gbaned requested review from chsigg and removed request for chsigg April 6, 2021 14:00
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 6, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 6, 2021
@rthadur
Copy link
Contributor

rthadur commented Apr 6, 2021

@deven-amd can you please resolve conflicts

The following commit adds GPU support for int32/int64 support for the Unique/UniqueWithCounts ops, but breaks ROCm build in the process

tensorflow@02585ac

```
tensorflow/core/kernels/unique_op_gpu.cu.cc:292:9: error: no matching constructor for initialization of 'gpuprim::TransformInputIterator<int, SegmentIndicatorFunctor<unsigned long, int>, gpuprim::CountingInputIterator<int>>' (aka 'transform_iterator<rocprim::counting_iterator<int, long>, tensorflow::(anonymous namespace)::SegmentIndicatorFunctor<unsigned long, int>, int>')
        segment_indicator_iter(0, {sorted_input_ptr});
        ^                      ~~~~~~~~~~~~~~~~~~~~~
tensorflow/core/kernels/unique_op_gpu.cu.cc:176:12: note: in instantiation of member function 'tensorflow::UniqueOpGPU<unsigned long, int>::ComputeAsync' requested here
  explicit UniqueOpGPU(OpKernelConstruction* context)
           ^
tensorflow/core/kernels/unique_op_gpu.cu.cc:461:27: note: in instantiation of member function 'tensorflow::UniqueOpGPU<unsigned long, int>::UniqueOpGPU' requested here
TF_CALL_REAL_NUMBER_TYPES(REGISTER_UNIQUE_GPU);
```

This PR/commit disables ROCm support for newly added ops to get the CSB passing again. We are looking into resolving the build errors, and will file a separate PR to re-enable ROCm functionality for the same. This PR commit also adds the `no_rocm` tag to a couple of unit tests that start failing as a consequence of lack of ROCm support for these ops.

```
//tensorflow/python/keras/optimizer_v2:adamax_test_gpu                   FAILED in 3 out of 3 in 7.0s
//tensorflow/python/training:adam_test_gpu                               FAILED in 3 out of 3 in 6.3s
```
…m platform

tensorflow@50f8897

```
//tensorflow/python/keras/distribute:dataset_creator_model_fit_test_gpu  FAILED in 3 out of 3 in 112.5s
```

This commit adds a `no_rocm` tag to temporarily disable that unit-test on ROCm.
@deven-amd deven-amd force-pushed the google_upstream_rocm_csb_fix_210322 branch from df0dc4a to 43124b6 Compare April 6, 2021 22:41
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Apr 6, 2021
@deven-amd
Copy link
Contributor Author

@rthadur rebased the PR again to resolve the merge conflict.

@chsigg will need your approval again

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Apr 6, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 6, 2021
@deven-amd deven-amd added the kokoro:force-run Tests on submitted change label Apr 7, 2021
@gbaned gbaned added kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review kokoro:force-run Tests on submitted change labels Apr 7, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 7, 2021
@copybara-service copybara-service bot merged commit 37c5dfb into tensorflow:master Apr 7, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Apr 7, 2021
@deven-amd deven-amd deleted the google_upstream_rocm_csb_fix_210322 branch April 16, 2021 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:gpu GPU related issues ready to pull PR ready for merge process size:XS CL Change Size: Extra Small
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

6 participants