New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Fix for ROCm CSB Breakage - 200630 #40954
[ROCm] Fix for ROCm CSB Breakage - 200630 #40954
Conversation
@chsigg gentle ping |
@chsigg gentle ping Need this PR to fix the broken ROCm CSB |
@chsigg gentle ping |
uint32_t thread_per_block_uint = 0; | ||
CHECK_GE(block_size_limit, 0); | ||
uint32_t block_size_limit_uint = static_cast<uint32_t>(block_size_limit); | ||
hipOccupancyMaxPotentialBlockSize(&block_count_uint, &thread_per_block_uint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to provide a forward-compatible overload of hipOccupancyMaxPotentialBlockSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to restore the code to its pre-existing state (for ROCm 3.3). The code within the #else
block will have a short shelf life. it will be gone with the switch to ROCm 3.5+, which should happen soon (within the next month).
Can we leave it as is for now?
hipError_t err = hipOccupancyMaxActiveBlocksPerMultiprocessor( | ||
&block_count, func, fixed_block_size, dynamic_shared_memory_size); | ||
CHECK_EQ(err, hipSuccess); | ||
#else | ||
// Apply the heuristic in GetGpuLaunchConfig(int, const Eigen::GpuDevice&) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to provide a forward-compatible implementation of hipOccupancyMaxActiveBlocksPerMultiprocessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
third_party/gpus/rocm_configure.bzl
Outdated
inc_dirs = [] | ||
|
||
# Add HSA headers (needs to match $HSA_PATH) | ||
inc_dirs.append(rocm_config.rocm_toolkit_path + "/hsa/include") | ||
inc_dirs.append(rocm_toolkit_path + "/hsa/include") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this mix of symlinked and realpath was intentional, but I don't remember the details.
:-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we kick off the internal G build for ROCm on this change, and see if it breaks anything.
If nothing breaks, then no harm no foul!
else we can revert it back to what it was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it does break our CI build.
ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:4136:1: Couldn't build file tensorflow/core/kernels/_objs/cwise_op_gpu/cwise_op_gpu_not_equal_to.cu.pic.o: undeclared inclusion(s) in rule '//tensorflow/core/kernels:cwise_op_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/cwise_op_gpu_not_equal_to.cu.cc':
'/opt/rocm/hip/include/hip/hip_runtime.h'
'/opt/rocm/hip/include/hip/hip_version.h'
'/opt/rocm/hip/include/hip/hip_common.h'
...
I think what's happening is that the cxx_builtin_include_directories don't match the directories where clang finds does headers (realpath vs symlink path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very much like the error I filed the following issue for
bazelbuild/bazel#11163
The way we workedaround it on our end was
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/Dockerfile.rocm#L76-L88
Can you please check if apply the above workaround works for your build as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do use ROCm from the default '/opt/rocm' location. I'm not entirely sure what is happening, but my theory is that in our environment, opt/rocm is a symlink, and the -isystem
paths that are added for hcc are:
$HIP_PATH
, as-is (potentially symlinked)$ROCM_PATH/lib/clang/x.y/include
, real-path'ed
I don't understand how an empty version file changes any of this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/opt/rocm/
is a soft-link in our env too.
ROCm is typically installed in /opt/rocm-x.x.x
(for eg /opt/rocm-3.5.0/
) and the /opt/rocm
is simply a softlink to the ROCm install you want to use.
I don't understand how an empty version file changes any of this though.
That is because of a quirk in hipcc implementation.
hipcc needs to determine the value of the ROCM_PATH
. If the env var is not explicity set (which we suspect happens sometimes due the bazel issue I mentioned earlier), then it checks for the presence of the <path_to_invoked_hipcc>/../.info/version
file, and if it exists, it will use <path_to_invoked_hipcc>/..
as the ROCM_PATH
.
<path_to_invoked_hipcc>
will always be $ROCM_PATH/bin/
and is not dependent upon bazel passing the ROCM_PATH
env var correctly.
Are you setting the value of ROCM_PATH
before running the configure
script? All of our CI scripts do set that value now (before running the confgirue script) because it needs to be passed down correctly to rocm_configure.bzl
https://github.com/tensorflow/tensorflow/blob/master/configure.py#L1441-L1443
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/xla/linux/rocm/run_py3.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_cc_core.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_csb_tests.sh#L33
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/linux/rocm/run_py3_core.sh#L33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docker sets ROCM_PATH
to /opt/rocm
, but until this PR ROCM_PATH was not used. We do not run ./configure, but use --config=rbe_linux_rocm_py3.6
. Without running ./configure, there is no --action_env=ROCM_PATH either, and rocm_configure.bzl looks (looked) at ROCM_TOOLKIT_PATH.
So what seems to be happending in our internal build is that <path_to_invoked_hipcc>
is /opt/rocm/bin/
and hcc will add -isystem /opt/rocm/hip/include
, which is a symlink directory. If we change the implementation here to add the realpath to cxx_builtin_include_directories, we will get /opt/rocm-3.5.0/hip/include
instead. That violates the bazel layering check.
Shouldn't we support that ROCM_PATH is a symlink? I think we just need to keep the existing mix of rocm_config.rocm_toolkit_path and realpath for that. And probably add a comment why it's handled differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undid the changes involved as per your suggestion. I do not fully understand why you are running into the build error that you are, but we can debug that after we get the ROCm CSB up and running again, so reverting that part of my change.
Also I added a new commit to the PR to update Dockerfile.rocm to use Ubuntu 18.04.
When I tested out my changes in the (default) Ubuntu 16.04 based container, I got a compile error which seemed to be because of some new C++ syntax, that the older gcc version in Ubuntu 16.04 does not support.
Our internal CI has already switched to using Ubuntu 18.04 based containers, and I believe you have done the same on your end.
this the error I was running into, with Ubuntu 16.04
tensorflow/stream_executor/tpu/tpu_executable_interface.cc: In member function 'virtual xla::StatusOr<xla::ExecutionOutput> xla::TpuExecutableInterface::Execut\
eAsyncOnStream(const xla::ServiceExecutableRunOptions*, std::vector<xla::ExecutionInput>, xla::HloExecutionProfile*)':
tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:22: error: expected unqualified-id before '[' token
for (const auto& [parameter, index] :
^
tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:22: error: expected ';' before '[' token
tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:23: error: 'parameter' was not declared in this scope
for (const auto& [parameter, index] :
^
tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:34: error: capture by copy of incomplete type '<unresolved overloaded function type>'
for (const auto& [parameter, index] :
^
tensorflow/stream_executor/tpu/tpu_executable_interface.cc: In lambda function:
tensorflow/stream_executor/tpu/tpu_executable_interface.cc:179:41: error: expected '{' before ':' token
for (const auto& [parameter, index] :
^
the same source code builds fine with gcc on Ubuntu 18.04
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chsigg please review / approve...thanks again
@chsigg I do not havbe visibility into the results for the internal ROCm build at your end. Let me know if it passed and whether we can take this PR |
@chsigg gentle ping |
@chsigg gentle ping |
The following commit (which switched G's internal CI to use ROCm 3.5) breaks the ROCm CSB build (which still uses ROCm 3.3) tensorflow@22def20 This PR/commit simply puts back a couple of codes that were removed the the previous commit, and makes them condition on ROCm 3.5. Note that the ROCm CSB build will be switching to ROCm 3.5 or higher in the near future, at which point all codes the `true` block for `#if TENSORFLOW_COMPILER_IS_HIP_CLANG` will become default, and those in eht `false / #else` block will be removed.
b2aa121
to
ddafc33
Compare
The following commit (which switched G's internal CI to use ROCm 3.5) breaks the ROCm CSB build (which still uses ROCm 3.3)
22def20
This PR/commit simply puts back a couple of codes that were removed the the previous commit, and makes them conditional on ROCm 3.5.
Note that the ROCm CSB build will be switching to ROCm 3.5 or higher in the near future, at which point all codes within the
true
block for#if TENSORFLOW_COMPILER_IS_HIP_CLANG
will become default, and those in thefalse / #else
block will be removed./cc @chsigg @cheshire @nvining-work