[r2.5 port][ROCm] Port PR 48187 to r2.5 by deven-amd · Pull Request #48507 · tensorflow/tensorflow

deven-amd · 2021-04-13T17:26:15Z

For ROCm 4.0 and prior, hipfft was a header-only "library" that was shipped along with the rocfft library. ROCm TF code uses the hipfft APIs defined in the "hipfft.h" header file, which was part of the package for rocfft library. ROCm TF would dynamically load the "rocfft" library at runtime. From ROCm 4.1 onwards, it seems hipfft is a separate library from rocfft, and ROCm TF needs to use the "hipfft" library insteaf of the "rocfft" library. Specifcally, ROCm TF needs to * pick up the "hipfft.h" header from the hipfft install area (instead of the rocfft install area). * dynamically load the the "hipfft" library instead of the "rocfft" library This commit has the two above changes This change is required because without this change, the unit-tests that called the hipfft API ( `hipfftSetWorkArea` ) were failing when run with ROCm 4.1 (and without the changes in this commit). ROCm TF continues to build (with ROCm 4.1) without the changes in this commit, because the old "hipfft.h" file in the rocfft install area is still present (do not know whether that is by design or accident) unit-tests known to call the `hipfftSetWorkArea` API ``` //tensorflow/python/kernel_tests/linalg:linear_operator_circulant_test_gpu //tensorflow/python/kernel_tests/signal:dct_ops_test //tensorflow/python/kernel_tests/signal:mel_ops_test //tensorflow/python/kernel_tests/signal:mfcc_ops_test //tensorflow/python/kernel_tests/signal:spectral_ops_test //tensorflow/python/ops/parallel_for:control_flow_ops_test_gpu ```

Prior to this commit, the GCN Arch Name "gfx908:+sramecc", would get mapped to * { target : "gfx908", feature_string : "+sram-ecc"} This commit changes the mapping to the following * { target : "gfx908", feature_string : "+sramecc"} (Notice the dash "-" is gone from "+sram-ecc") The LLVM version being picked up by TF in this branch ( `develop-upstream-QA-rocm42` ) seems have changes withing it (as compared to the version being picked up in the `develop-upstream-QA-rocm41` branch), which require this change in the mapping. This mapping is expected to be in a state of flux until all the AMDGPU target-id related changes have been pushed out to the upstream LLVM repo, and TF LLVM pointer is updated to pick those changes.

Revert "Adding no_rocm tag to `rocm` unit-tests that regress with ROCm 3.9" This reverts commit 7301217.

They were disabled in the following commit 5d3bf87

…e_patches_grad_test_gpu This unit-test starts to fail when we switch from ROCm 4.0.1 to ROCm 4.1.0 QA did not detect this regression because this test passes in the branch they were using to qualify ROCm 4.1.0 ( `develop-upstream-QA-rocm41` ) In addition to other changes (between `develop-upstream-QA-rocm41` and tip of `develop-upstream`), the testcase itself seems to have changed. We need to root-cause the regression, fix it and re-enable the test. Copy-pasting the error log here for reference ``` ====================================================================== FAIL: test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest) test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 185, in test_AllNone_Gradient self._VariableShapeGradient([None, None, None, None]) File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient self.assertLess(err, 1e-4) AssertionError: nan not less than 0.0001 ====================================================================== FAIL: test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest) test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 176, in test_BxxC_Gradient self._VariableShapeGradient([-1, None, None, -1]) File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient self.assertLess(err, 1e-4) AssertionError: 0.4175793528556824 not less than 0.0001 ====================================================================== FAIL: test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest) test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 179, in test_xHWx_Gradient self._VariableShapeGradient([None, -1, -1, None]) File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient self.assertLess(err, 1e-4) AssertionError: 0.2977283000946045 not less than 0.0001 ---------------------------------------------------------------------- Ran 7 tests in 406.984s FAILED (failures=3, skipped=1) ================================================================================ ```

these tests ``` //tensorflow/compiler/xla/tests:convolution_test_1d_gpu_alternative_layout_gpu FAILED in 3 out of 3 in 16.0s //tensorflow/compiler/xla/tests:convolution_test_1d_no_vmodule_gpu FAILED in 3 out of 3 in 16.1s ```

…et - 263833 The following is list of unit-tests from which we are removing the `no_rocm` tag in this commit, because we suspect/hope that the issue being fixed by JIRA ticket 263833, will make these tests pass. All of these tests are passing locally when tested with ROCm 4.1rc1, but given the flaky nature of the failure (these tests would only fail on some CI nodes...sometimes), we need to have them pass consistently on all CI nodes, over some periosd of time (couple of weeks) before we can claim victory ``` //tensorflow/python/distribute:cross_device_ops_test_gpu //tensorflow/python/distribute:distribute_utils_test_gpu //tensorflow/python/distribute:strategy_common_test_gpu //tensorflow/python/distribute:strategy_gather_test_gpu //tensorflow/python/distribute:test_util_test_gpu //tensorflow/python/distribute:values_test_gpu //tensorflow/python/distribute:vars_test_gpu //tensorflow/python/keras/distribute:custom_training_loop_metrics_test_gpu //tensorflow/python/keras/distribute:custom_training_loop_models_test_gpu //tensorflow/python/keras/distribute:keras_rnn_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_utils_test_gpu ``` The following two tests fell into the same category as above, but their `no_rocm` tag ios staying put, because flaku failures were observed for them in local testing with ROCm 4.1 ``` //tensorflow/python/keras/distribute:distribute_strategy_test_gpu //tensorflow/python/keras/distribute:keras_dnn_correctness_test_gpu //tensorflow/python/keras/distribute:keras_embedding_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_image_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_save_load_test_gpu ``` Also some of the tests take longer than 900s to complete on the ROCm platform (no sharding on ROCm!)...bumping them up to size `large` in this commit

these tests ``` //tensorflow/python/keras/distribute:saved_model_save_load_test_gpu //tensorflow/python/kernel_tests:tensordot_op_test_gpu ``` See details here - #1301

…upstream_rocm_switch_to_rocm41 PiperOrigin-RevId: 367580713 Change-Id: I5289492260ea8c6fde64848ba895c1ea736019d8

deven-amd and others added 16 commits April 9, 2021 21:55

Updating Dockerfile.rocm to use ROCm 4.1

8ef616c

Updating CI scripts to use ROCm 4.1

e662e50

add clang 13 header for ROCm configure

8f6e282

Re-enabling subtests that were disabled due to JIRA ticket - 236756

c9ea9d0

Re-enabling unit-tests that were disabled due to JIRA ticket - 248713

22f36e4

Revert "Adding no_rocm tag to `rocm` unit-tests that regress with ROCm 3.9" This reverts commit 7301217.

Re-enabling some unit-tests that regressed starting ROCm 3.8

c6120ae

They were disabled in the following commit 5d3bf87

Adding no_rocm tag to tests that seem to have done flaky with ROCm 4.1

234b4a9

these tests ``` //tensorflow/python/keras/distribute:saved_model_save_load_test_gpu //tensorflow/python/kernel_tests:tensordot_op_test_gpu ``` See details here - #1301

Syncing the contents of the run_gpu_multi.sh script with the ROCm fork

27abfcd

Fixing errors in "buildifier check"

4360dbd

Fixing errors in "bazel query"

8f735eb

Merge pull request tensorflow#48187 from ROCmSoftwarePlatform:google_…

92bbb09

…upstream_rocm_switch_to_rocm41 PiperOrigin-RevId: 367580713 Change-Id: I5289492260ea8c6fde64848ba895c1ea736019d8

google-cla bot added the cla: yes label Apr 13, 2021

gbaned self-assigned this Apr 14, 2021

gbaned added the size:L CL Change Size: Large label Apr 14, 2021

gbaned assigned mihaimaruseac and unassigned gbaned Apr 14, 2021

gbaned requested a review from mihaimaruseac April 14, 2021 03:48

mihaimaruseac approved these changes Apr 20, 2021

View reviewed changes

mihaimaruseac assigned goldiegadde and geetachavan1 Apr 22, 2021

mihaimaruseac merged commit 8e0516e into tensorflow:r2.5 Apr 22, 2021

deven-amd deleted the google_upstream_r25_port_pr_48187 branch May 12, 2021 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r2.5 port][ROCm] Port PR 48187 to r2.5#48507

[r2.5 port][ROCm] Port PR 48187 to r2.5#48507
mihaimaruseac merged 16 commits intotensorflow:r2.5from
ROCm:google_upstream_r25_port_pr_48187

deven-amd commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

deven-amd commented Apr 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants