[ROCm] Fix for ROCm CSB Breakage - 200630 #40954

deven-amd · 2020-06-30T19:56:49Z

The following commit (which switched G's internal CI to use ROCm 3.5) breaks the ROCm CSB build (which still uses ROCm 3.3)

22def20

This PR/commit simply puts back a couple of codes that were removed the the previous commit, and makes them conditional on ROCm 3.5.

Note that the ROCm CSB build will be switching to ROCm 3.5 or higher in the near future, at which point all codes within the true block for #if TENSORFLOW_COMPILER_IS_HIP_CLANG will become default, and those in the false / #else block will be removed.

/cc @chsigg @cheshire @nvining-work

deven-amd · 2020-07-05T15:10:09Z

@chsigg gentle ping

deven-amd · 2020-07-07T20:21:02Z

@chsigg gentle ping

Need this PR to fix the broken ROCm CSB

deven-amd · 2020-07-10T01:26:04Z

@chsigg gentle ping

chsigg · 2020-07-10T13:29:51Z

tensorflow/core/util/gpu_launch_config.h

+  uint32_t thread_per_block_uint = 0;
+  CHECK_GE(block_size_limit, 0);
+  uint32_t block_size_limit_uint = static_cast<uint32_t>(block_size_limit);
+  hipOccupancyMaxPotentialBlockSize(&block_count_uint, &thread_per_block_uint,


I think it would be cleaner to provide a forward-compatible overload of hipOccupancyMaxPotentialBlockSize.

Just trying to restore the code to its pre-existing state (for ROCm 3.3). The code within the #else block will have a short shelf life. it will be gone with the switch to ROCm 3.5+, which should happen soon (within the next month).

Can we leave it as is for now?

chsigg · 2020-07-10T13:30:24Z

tensorflow/core/util/gpu_launch_config.h

  hipError_t err = hipOccupancyMaxActiveBlocksPerMultiprocessor(
      &block_count, func, fixed_block_size, dynamic_shared_memory_size);
  CHECK_EQ(err, hipSuccess);
+#else
+  // Apply the heuristic in GetGpuLaunchConfig(int, const Eigen::GpuDevice&)


Would it be cleaner to provide a forward-compatible implementation of hipOccupancyMaxActiveBlocksPerMultiprocessor?

same as above

chsigg · 2020-07-10T13:36:10Z

third_party/gpus/rocm_configure.bzl

    inc_dirs = []

    # Add HSA headers (needs to match $HSA_PATH)
-    inc_dirs.append(rocm_config.rocm_toolkit_path + "/hsa/include")
+    inc_dirs.append(rocm_toolkit_path + "/hsa/include")


IIRC, this mix of symlinked and realpath was intentional, but I don't remember the details.
:-(

can we kick off the internal G build for ROCm on this change, and see if it breaks anything.
If nothing breaks, then no harm no foul!
else we can revert it back to what it was

Unfortunately, it does break our CI build.

ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:4136:1: Couldn't build file tensorflow/core/kernels/_objs/cwise_op_gpu/cwise_op_gpu_not_equal_to.cu.pic.o: undeclared inclusion(s) in rule '//tensorflow/core/kernels:cwise_op_gpu': this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/cwise_op_gpu_not_equal_to.cu.cc': '/opt/rocm/hip/include/hip/hip_runtime.h' '/opt/rocm/hip/include/hip/hip_version.h' '/opt/rocm/hip/include/hip/hip_common.h' ...

I think what's happening is that the cxx_builtin_include_directories don't match the directories where clang finds does headers (realpath vs symlink path).

This looks very much like the error I filed the following issue for
bazelbuild/bazel#11163

The way we workedaround it on our end was
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/Dockerfile.rocm#L76-L88

Can you please check if apply the above workaround works for your build as well?

We do use ROCm from the default '/opt/rocm' location. I'm not entirely sure what is happening, but my theory is that in our environment, opt/rocm is a symlink, and the -isystem paths that are added for hcc are:

$HIP_PATH, as-is (potentially symlinked)

$ROCM_PATH/lib/clang/x.y/include, real-path'ed

I don't understand how an empty version file changes any of this though.

/opt/rocm/ is a soft-link in our env too.

ROCm is typically installed in /opt/rocm-x.x.x (for eg /opt/rocm-3.5.0/) and the /opt/rocm is simply a softlink to the ROCm install you want to use.

I don't understand how an empty version file changes any of this though.

That is because of a quirk in hipcc implementation.

hipcc needs to determine the value of the ROCM_PATH. If the env var is not explicity set (which we suspect happens sometimes due the bazel issue I mentioned earlier), then it checks for the presence of the <path_to_invoked_hipcc>/../.info/version file, and if it exists, it will use <path_to_invoked_hipcc>/.. as the ROCM_PATH.

<path_to_invoked_hipcc> will always be $ROCM_PATH/bin/ and is not dependent upon bazel passing the ROCM_PATH env var correctly.

Are you setting the value of ROCM_PATH before running the configure script? All of our CI scripts do set that value now (before running the confgirue script) because it needs to be passed down correctly to rocm_configure.bzl

https://github.com/tensorflow/tensorflow/blob/master/configure.py#L1441-L1443
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/xla/linux/rocm/run_py3.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_cc_core.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_csb_tests.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_py3_core.sh#L33

Our docker sets ROCM_PATH to /opt/rocm, but until this PR ROCM_PATH was not used. We do not run ./configure, but use --config=rbe_linux_rocm_py3.6. Without running ./configure, there is no --action_env=ROCM_PATH either, and rocm_configure.bzl looks (looked) at ROCM_TOOLKIT_PATH.

So what seems to be happending in our internal build is that <path_to_invoked_hipcc> is /opt/rocm/bin/ and hcc will add -isystem /opt/rocm/hip/include, which is a symlink directory. If we change the implementation here to add the realpath to cxx_builtin_include_directories, we will get /opt/rocm-3.5.0/hip/include instead. That violates the bazel layering check.

Shouldn't we support that ROCM_PATH is a symlink? I think we just need to keep the existing mix of rocm_config.rocm_toolkit_path and realpath for that. And probably add a comment why it's handled differently.

undid the changes involved as per your suggestion. I do not fully understand why you are running into the build error that you are, but we can debug that after we get the ROCm CSB up and running again, so reverting that part of my change.

Also I added a new commit to the PR to update Dockerfile.rocm to use Ubuntu 18.04.
When I tested out my changes in the (default) Ubuntu 16.04 based container, I got a compile error which seemed to be because of some new C++ syntax, that the older gcc version in Ubuntu 16.04 does not support.
Our internal CI has already switched to using Ubuntu 18.04 based containers, and I believe you have done the same on your end.

this the error I was running into, with Ubuntu 16.04

tensorflow/stream_executor/tpu/tpu_executable_interface.cc: In member function 'virtual xla::StatusOr<xla::ExecutionOutput> xla::TpuExecutableInterface::Execut\ eAsyncOnStream(const xla::ServiceExecutableRunOptions*, std::vector<xla::ExecutionInput>, xla::HloExecutionProfile*)': tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:22: error: expected unqualified-id before '[' token for (const auto& [parameter, index] : ^ tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:22: error: expected ';' before '[' token tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:23: error: 'parameter' was not declared in this scope for (const auto& [parameter, index] : ^ tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:34: error: capture by copy of incomplete type '<unresolved overloaded function type>' for (const auto& [parameter, index] : ^ tensorflow/stream_executor/tpu/tpu_executable_interface.cc: In lambda function: tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:41: error: expected '{' before ':' token for (const auto& [parameter, index] : ^

the same source code builds fine with gcc on Ubuntu 18.04

@chsigg please review / approve...thanks again

deven-amd · 2020-07-15T02:59:44Z

@chsigg I do not havbe visibility into the results for the internal ROCm build at your end. Let me know if it passed and whether we can take this PR

deven-amd · 2020-07-17T00:07:49Z

@chsigg gentle ping

deven-amd · 2020-07-21T12:21:51Z

@chsigg gentle ping

The following commit (which switched G's internal CI to use ROCm 3.5) breaks the ROCm CSB build (which still uses ROCm 3.3) tensorflow@22def20 This PR/commit simply puts back a couple of codes that were removed the the previous commit, and makes them condition on ROCm 3.5. Note that the ROCm CSB build will be switching to ROCm 3.5 or higher in the near future, at which point all codes the `true` block for `#if TENSORFLOW_COMPILER_IS_HIP_CLANG` will become default, and those in eht `false / #else` block will be removed.

google-ml-butler bot added the size:S CL Change Size: Small label Jun 30, 2020

googlebot added the cla: yes label Jun 30, 2020

gbaned self-assigned this Jul 1, 2020

gbaned added this to Assigned Reviewer in PR Queue via automation Jul 1, 2020

gbaned requested a review from chsigg July 1, 2020 05:12

gbaned added the comp:gpu GPU related issues label Jul 1, 2020

gbaned added the awaiting review Pull request awaiting review label Jul 6, 2020

gbaned requested a review from cheshire July 8, 2020 12:09

chsigg reviewed Jul 10, 2020

View reviewed changes

tensorflowbutler removed the awaiting review Pull request awaiting review label Jul 12, 2020

gbaned added the kokoro:force-run Tests on submitted change label Jul 12, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 12, 2020

gbaned added the awaiting review Pull request awaiting review label Jul 17, 2020

deven-amd added 2 commits July 24, 2020 03:02

Updating Dockerfile.rocm to use Ubuntu 18.04

ddafc33

deven-amd force-pushed the google_upstream_rocm_platform_fix_200630 branch from b2aa121 to ddafc33 Compare July 24, 2020 03:06

chsigg approved these changes Jul 24, 2020

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 24, 2020

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 24, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 24, 2020

gbaned removed the awaiting review Pull request awaiting review label Jul 24, 2020

tensorflow-copybara merged commit cbc87da into tensorflow:master Jul 24, 2020

PR Queue automation moved this from Approved by Reviewer to Merged Jul 24, 2020

deven-amd deleted the google_upstream_rocm_platform_fix_200630 branch August 26, 2020 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix for ROCm CSB Breakage - 200630 #40954

[ROCm] Fix for ROCm CSB Breakage - 200630 #40954

deven-amd commented Jun 30, 2020

deven-amd commented Jul 5, 2020

deven-amd commented Jul 7, 2020

deven-amd commented Jul 10, 2020

chsigg Jul 10, 2020

deven-amd Jul 10, 2020

chsigg Jul 10, 2020

deven-amd Jul 10, 2020

chsigg Jul 10, 2020

deven-amd Jul 10, 2020

chsigg Jul 23, 2020

deven-amd Jul 23, 2020

chsigg Jul 23, 2020

deven-amd Jul 23, 2020

chsigg Jul 23, 2020

deven-amd Jul 24, 2020

deven-amd Jul 24, 2020

deven-amd commented Jul 15, 2020

deven-amd commented Jul 17, 2020

deven-amd commented Jul 21, 2020

[ROCm] Fix for ROCm CSB Breakage - 200630 #40954

[ROCm] Fix for ROCm CSB Breakage - 200630 #40954

Conversation

deven-amd commented Jun 30, 2020

deven-amd commented Jul 5, 2020

deven-amd commented Jul 7, 2020

deven-amd commented Jul 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deven-amd commented Jul 15, 2020

deven-amd commented Jul 17, 2020

deven-amd commented Jul 21, 2020