[r2.5 port][ROCm] Port PR 48187 to r2.5#48507
Merged
mihaimaruseac merged 16 commits intotensorflow:r2.5from Apr 22, 2021
Merged
[r2.5 port][ROCm] Port PR 48187 to r2.5#48507mihaimaruseac merged 16 commits intotensorflow:r2.5from
mihaimaruseac merged 16 commits intotensorflow:r2.5from
Conversation
For ROCm 4.0 and prior, hipfft was a header-only "library" that was shipped along with the rocfft library. ROCm TF code uses the hipfft APIs defined in the "hipfft.h" header file, which was part of the package for rocfft library. ROCm TF would dynamically load the "rocfft" library at runtime. From ROCm 4.1 onwards, it seems hipfft is a separate library from rocfft, and ROCm TF needs to use the "hipfft" library insteaf of the "rocfft" library. Specifcally, ROCm TF needs to * pick up the "hipfft.h" header from the hipfft install area (instead of the rocfft install area). * dynamically load the the "hipfft" library instead of the "rocfft" library This commit has the two above changes This change is required because without this change, the unit-tests that called the hipfft API ( `hipfftSetWorkArea` ) were failing when run with ROCm 4.1 (and without the changes in this commit). ROCm TF continues to build (with ROCm 4.1) without the changes in this commit, because the old "hipfft.h" file in the rocfft install area is still present (do not know whether that is by design or accident) unit-tests known to call the `hipfftSetWorkArea` API ``` //tensorflow/python/kernel_tests/linalg:linear_operator_circulant_test_gpu //tensorflow/python/kernel_tests/signal:dct_ops_test //tensorflow/python/kernel_tests/signal:mel_ops_test //tensorflow/python/kernel_tests/signal:mfcc_ops_test //tensorflow/python/kernel_tests/signal:spectral_ops_test //tensorflow/python/ops/parallel_for:control_flow_ops_test_gpu ```
Prior to this commit, the GCN Arch Name "gfx908:+sramecc", would get mapped to
* { target : "gfx908", feature_string : "+sram-ecc"}
This commit changes the mapping to the following
* { target : "gfx908", feature_string : "+sramecc"}
(Notice the dash "-" is gone from "+sram-ecc")
The LLVM version being picked up by TF in this branch ( `develop-upstream-QA-rocm42` ) seems have changes withing it (as compared to the version being picked up in the `develop-upstream-QA-rocm41` branch), which require this change in the mapping.
This mapping is expected to be in a state of flux until all the AMDGPU target-id related changes have been pushed out to the upstream LLVM repo, and TF LLVM pointer is updated to pick those changes.
Revert "Adding no_rocm tag to `rocm` unit-tests that regress with ROCm 3.9" This reverts commit 7301217.
They were disabled in the following commit 5d3bf87
…e_patches_grad_test_gpu
This unit-test starts to fail when we switch from ROCm 4.0.1 to ROCm 4.1.0
QA did not detect this regression because this test passes in the branch they were using to qualify ROCm 4.1.0 ( `develop-upstream-QA-rocm41` )
In addition to other changes (between `develop-upstream-QA-rocm41` and tip of `develop-upstream`), the testcase itself seems to have changed. We need to root-cause the regression, fix it and re-enable the test.
Copy-pasting the error log here for reference
```
======================================================================
FAIL: test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest)
test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 185, in test_AllNone_Gradient
self._VariableShapeGradient([None, None, None, None])
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
self.assertLess(err, 1e-4)
AssertionError: nan not less than 0.0001
======================================================================
FAIL: test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest)
test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 176, in test_BxxC_Gradient
self._VariableShapeGradient([-1, None, None, -1])
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
self.assertLess(err, 1e-4)
AssertionError: 0.4175793528556824 not less than 0.0001
======================================================================
FAIL: test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest)
test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 179, in test_xHWx_Gradient
self._VariableShapeGradient([None, -1, -1, None])
File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
self.assertLess(err, 1e-4)
AssertionError: 0.2977283000946045 not less than 0.0001
----------------------------------------------------------------------
Ran 7 tests in 406.984s
FAILED (failures=3, skipped=1)
================================================================================
```
these tests ``` //tensorflow/compiler/xla/tests:convolution_test_1d_gpu_alternative_layout_gpu FAILED in 3 out of 3 in 16.0s //tensorflow/compiler/xla/tests:convolution_test_1d_no_vmodule_gpu FAILED in 3 out of 3 in 16.1s ```
…et - 263833 The following is list of unit-tests from which we are removing the `no_rocm` tag in this commit, because we suspect/hope that the issue being fixed by JIRA ticket 263833, will make these tests pass. All of these tests are passing locally when tested with ROCm 4.1rc1, but given the flaky nature of the failure (these tests would only fail on some CI nodes...sometimes), we need to have them pass consistently on all CI nodes, over some periosd of time (couple of weeks) before we can claim victory ``` //tensorflow/python/distribute:cross_device_ops_test_gpu //tensorflow/python/distribute:distribute_utils_test_gpu //tensorflow/python/distribute:strategy_common_test_gpu //tensorflow/python/distribute:strategy_gather_test_gpu //tensorflow/python/distribute:test_util_test_gpu //tensorflow/python/distribute:values_test_gpu //tensorflow/python/distribute:vars_test_gpu //tensorflow/python/keras/distribute:custom_training_loop_metrics_test_gpu //tensorflow/python/keras/distribute:custom_training_loop_models_test_gpu //tensorflow/python/keras/distribute:keras_rnn_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_utils_test_gpu ``` The following two tests fell into the same category as above, but their `no_rocm` tag ios staying put, because flaku failures were observed for them in local testing with ROCm 4.1 ``` //tensorflow/python/keras/distribute:distribute_strategy_test_gpu //tensorflow/python/keras/distribute:keras_dnn_correctness_test_gpu //tensorflow/python/keras/distribute:keras_embedding_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_image_model_correctness_test_gpu //tensorflow/python/keras/distribute:keras_save_load_test_gpu ``` Also some of the tests take longer than 900s to complete on the ROCm platform (no sharding on ROCm!)...bumping them up to size `large` in this commit
these tests ``` //tensorflow/python/keras/distribute:saved_model_save_load_test_gpu //tensorflow/python/kernel_tests:tensordot_op_test_gpu ``` See details here - #1301
…upstream_rocm_switch_to_rocm41 PiperOrigin-RevId: 367580713 Change-Id: I5289492260ea8c6fde64848ba895c1ea736019d8
mihaimaruseac
approved these changes
Apr 20, 2021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
/cc @mihaimaruseac @angerson
#48187