Skip to content

[r2.5 port][ROCm] Port PR 48187 to r2.5#48507

Merged
mihaimaruseac merged 16 commits intotensorflow:r2.5from
ROCm:google_upstream_r25_port_pr_48187
Apr 22, 2021
Merged

[r2.5 port][ROCm] Port PR 48187 to r2.5#48507
mihaimaruseac merged 16 commits intotensorflow:r2.5from
ROCm:google_upstream_r25_port_pr_48187

Conversation

@deven-amd
Copy link
Contributor

deven-amd and others added 16 commits April 9, 2021 21:55
For ROCm 4.0 and prior, hipfft was a header-only "library" that was shipped along with the rocfft library. ROCm TF code uses the hipfft APIs defined in the "hipfft.h" header file, which was part of the package for rocfft library. ROCm TF would dynamically load the "rocfft" library at runtime.

From ROCm 4.1 onwards, it seems hipfft is a separate library from rocfft, and ROCm TF needs to use the "hipfft" library insteaf of the "rocfft" library. Specifcally, ROCm TF needs to
* pick up the "hipfft.h" header from the hipfft install area (instead of the rocfft install area).
* dynamically load the the "hipfft" library instead of the "rocfft" library
This commit has the two above changes

This change is required because without this change, the unit-tests that called the hipfft API ( `hipfftSetWorkArea` ) were failing when run with ROCm 4.1 (and without the changes in this commit).

ROCm TF continues to build (with ROCm 4.1) without the changes in this commit, because the old "hipfft.h" file in the rocfft install area is still present (do not know whether that is by design or accident)

unit-tests known to call the `hipfftSetWorkArea` API

```
//tensorflow/python/kernel_tests/linalg:linear_operator_circulant_test_gpu
//tensorflow/python/kernel_tests/signal:dct_ops_test
//tensorflow/python/kernel_tests/signal:mel_ops_test
//tensorflow/python/kernel_tests/signal:mfcc_ops_test
//tensorflow/python/kernel_tests/signal:spectral_ops_test
//tensorflow/python/ops/parallel_for:control_flow_ops_test_gpu
```
Prior to this commit, the GCN Arch Name "gfx908:+sramecc", would get mapped to
* { target : "gfx908", feature_string : "+sram-ecc"}

This commit changes the mapping to the following
* { target : "gfx908", feature_string : "+sramecc"}

(Notice the dash "-" is gone from "+sram-ecc")

The LLVM version being picked up by TF in this branch ( `develop-upstream-QA-rocm42` ) seems have changes withing it (as compared to the version being picked up in the `develop-upstream-QA-rocm41` branch), which require this change in the mapping.

This mapping is expected to be in a state of flux until all the AMDGPU target-id related changes have been pushed out to the upstream LLVM repo, and TF LLVM pointer is updated to pick those changes.
Revert "Adding no_rocm tag to `rocm` unit-tests that regress with ROCm 3.9"

This reverts commit 7301217.
They were disabled in the following commit

5d3bf87
…e_patches_grad_test_gpu

This unit-test starts to fail when we switch from ROCm 4.0.1 to ROCm 4.1.0

QA did not detect this regression because this test passes in the branch they were using to qualify ROCm 4.1.0 ( `develop-upstream-QA-rocm41` )

In addition to other changes (between `develop-upstream-QA-rocm41` and tip of `develop-upstream`), the testcase itself seems to have changed. We need to root-cause the regression, fix it and re-enable the test.

Copy-pasting the error log here for reference

```
======================================================================
FAIL: test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest)
test_AllNone_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 185, in test_AllNone_Gradient
    self._VariableShapeGradient([None, None, None, None])
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
    self.assertLess(err, 1e-4)
AssertionError: nan not less than 0.0001

======================================================================
FAIL: test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest)
test_BxxC_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 176, in test_BxxC_Gradient
    self._VariableShapeGradient([-1, None, None, -1])
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
    self.assertLess(err, 1e-4)
AssertionError: 0.4175793528556824 not less than 0.0001

======================================================================
FAIL: test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest)
test_xHWx_Gradient (__main__.ExtractImagePatchesGradTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 179, in test_xHWx_Gradient
    self._VariableShapeGradient([None, -1, -1, None])
  File "/root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/python/kernel_tests/extract_image_patches_grad_test_gpu.runfiles/org_tensorflow/tensorflow/python/kernel_tests/extract_image_patches_grad_test.py", line 173, in _VariableShapeGradient
    self.assertLess(err, 1e-4)
AssertionError: 0.2977283000946045 not less than 0.0001

----------------------------------------------------------------------
Ran 7 tests in 406.984s

FAILED (failures=3, skipped=1)
================================================================================
```
these tests

```
//tensorflow/compiler/xla/tests:convolution_test_1d_gpu_alternative_layout_gpu FAILED in 3 out of 3 in 16.0s
//tensorflow/compiler/xla/tests:convolution_test_1d_no_vmodule_gpu       FAILED in 3 out of 3 in 16.1s
```
…et - 263833

The following is list of unit-tests from which we are removing the `no_rocm` tag in this commit, because we suspect/hope that the issue being fixed by JIRA ticket 263833, will make these tests pass.

All of these tests are passing locally when tested with ROCm 4.1rc1, but given the flaky nature of the failure (these tests would only fail on some CI nodes...sometimes), we need to have them pass consistently on all CI nodes, over some periosd of time (couple of weeks) before we can claim victory

```
//tensorflow/python/distribute:cross_device_ops_test_gpu
//tensorflow/python/distribute:distribute_utils_test_gpu
//tensorflow/python/distribute:strategy_common_test_gpu
//tensorflow/python/distribute:strategy_gather_test_gpu
//tensorflow/python/distribute:test_util_test_gpu
//tensorflow/python/distribute:values_test_gpu
//tensorflow/python/distribute:vars_test_gpu

//tensorflow/python/keras/distribute:custom_training_loop_metrics_test_gpu
//tensorflow/python/keras/distribute:custom_training_loop_models_test_gpu
//tensorflow/python/keras/distribute:keras_rnn_model_correctness_test_gpu
//tensorflow/python/keras/distribute:keras_utils_test_gpu
```

The following two tests fell into the same category as above, but their `no_rocm` tag ios staying put, because flaku failures were observed for them in local testing with ROCm 4.1

```
//tensorflow/python/keras/distribute:distribute_strategy_test_gpu
//tensorflow/python/keras/distribute:keras_dnn_correctness_test_gpu
//tensorflow/python/keras/distribute:keras_embedding_model_correctness_test_gpu
//tensorflow/python/keras/distribute:keras_image_model_correctness_test_gpu
//tensorflow/python/keras/distribute:keras_save_load_test_gpu
```

Also some of the tests take longer than 900s to complete on the ROCm platform (no sharding on ROCm!)...bumping them up to size `large` in this commit
these tests

```
//tensorflow/python/keras/distribute:saved_model_save_load_test_gpu
//tensorflow/python/kernel_tests:tensordot_op_test_gpu
```

See details here - #1301
…upstream_rocm_switch_to_rocm41

PiperOrigin-RevId: 367580713
Change-Id: I5289492260ea8c6fde64848ba895c1ea736019d8
@google-cla google-cla bot added the cla: yes label Apr 13, 2021
@gbaned gbaned self-assigned this Apr 14, 2021
@gbaned gbaned added the size:L CL Change Size: Large label Apr 14, 2021
@gbaned gbaned assigned mihaimaruseac and unassigned gbaned Apr 14, 2021
@gbaned gbaned requested a review from mihaimaruseac April 14, 2021 03:48
@mihaimaruseac mihaimaruseac merged commit 8e0516e into tensorflow:r2.5 Apr 22, 2021
@deven-amd deven-amd deleted the google_upstream_r25_port_pr_48187 branch May 12, 2021 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes size:L CL Change Size: Large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants