New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Fixes for broken --config=rocm
build, post PR #26722
#28116
[ROCm] Fixes for broken --config=rocm
build, post PR #26722
#28116
Conversation
tensorflow/core/kernels/BUILD
Outdated
@@ -4029,7 +4029,7 @@ tf_kernel_library( | |||
tf_kernel_library( | |||
name = "bias_op", | |||
prefix = "bias_op", | |||
deps = NN_DEPS + [":redux_functor"] + if_cuda([ | |||
deps = NN_DEPS + [":redux_functor"] + if_cuda_is_configured([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work internally.
I am not sure what the difference between if_cuda
and if_cuda_is_configured
is.
But to be able to merge this, someone will need to take a look into this and maybe even change the macros internally.
CC @chsigg @ezhulenev @jlebar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @gunan,
If possible, can you please shed some light on the nature of the problems introduced by this change. We would like to understand the problem(s) associated with this change, to see if there is anything we can do on our end to avoid/workaround them.
thanks
deven
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally, there is no configure script. So if_cuda_is_configured
semantically does not make sense for us internally.
Also, I am trying to understand what the differences between if_cuda
and if_cuda_is_configured
are. We are trying to move to a state where we do not have to run configure script, so only if_cuda
will make sense in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion between @chsigg and myself here for some info on the differences between if_cuda
and if_cuda_is_configured
: #26753 (comment)
Internally, there is no configure script.
Do you still need to specify the env var TF_NEED_CUDA=1
to build TF with CUDA support?
- If that is the case then
if_cuda_is_configured
should still work. - Else if there is some different mechanism to specify building TF with CUDA support, then we should update this routine (https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl#L337) to trigger using that mechanism in addition to the env var. That should make the
if_cuda_is_configured
work without running the configure script.
@chsigg please confirm the accuracy of above. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need to specify the env var TF_NEED_CUDA=1 to build TF with CUDA support?
No, we do not. Even in opensource, we would like to get to a state where all you need to do is bazel build --config=cuda //tensorflow/...
Internally, and soon externally, all you need to do for building with CUDA support is add --config=cuda
flag. That is all that is available to bazel to decide what to use when building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the purpose of this PR, either someone internal will need to take a look into this to make it work, or we need to find a way to make your code work with if_cuda. Unfortunately, I do not have cycles to investigate this.
Due to the nature of the error, it does not seem that this is an issue we can solve by changing things exclusively on the ROCm side of the fence..we are going to need help. Given that, and also that this is (one of the) blocking issues preventing us from filing further PRs (for upstreaming ROCm support), we need to resolve this PR ASAP. What are our options towards that end?
@chsigg, your suggestion of if_gpu(cuda_arg, rocm_arg, defaut_arg)
would be one potential solution here. How would we go about implementing + testing it out?
thanks again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes most sense if we clean this up from our end. I agree with Gunhan that if_cuda_is_configured (and .._compat) should go away. As far as I understand that macro was introduced to allow linking against CUDA libraries (say, cuDNN) while not invoking nvcc to build CUDA code. I'm not sure this is useful, but probably it won't be trivial to unify them again more than two years later. I will take a look and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chsigg @gunan, let me know if / how I can help develop+test the new solution, will be more than happy to help out.
In the meantime, i will push out a commit soon, that works around this issue by introducing a new if_cuda_or_rocm
function. This function works the same as if_cuda
/ if_rocm
, i.e. it will return the if_true
arg when either CUDA or ROCM is enabled. Using this function to specify the common dependencies, works around the bazel error. However, this solution is less than ideal because it needs to duplicate the logic within if_cuda
/ if_rocm
. Please review the commit and let me know if this is acceptable as a temporary workaround, while we work on the proper fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good as a temporary solution, thanks. I will run our internal tests and try to merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chsigg thank you!
I have pushed out a rebase to keep just the last commit and sync up to the tip.
For cases when we have the same dependency being specified by both CUDA and ROCm, using `if_cuda` / `if_rocm` to specify that dependency leads to a "duplicate dependency" bazel error. Switching `if_cuda` / `if_rocm` to `if_cuda_is_configured` / `if_rocm_is_configured` is not an acceptable solution, because the `*_is_configured` functions are being phased out. The preferred solution here would be a new `if_gpu(cuda_arg, rocm_arg, default_arg)` function. That (or something alon those lines) solution is in the works. This workaround is meant to be bandage that allows progress to be made, while we wait for the real solution. This workaround introduces a `if_cuda_or_rocm` function which should be used to specify dependencies that are common to both CUDA and ROCM. While this solution works, it is less than ideal because it needs to duplicate the logic inside the `if_cuda` / `if_rocm` functions.
0080a8e
to
6fd3452
Compare
…ost PR tensorflow#26722 Imported from GitHub PR tensorflow#28116 This is a follow up to PR tensorflow#26722 . There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same. 1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths 2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error. @tatianashp , just FYI. @gunan , @chsigg, @whchung Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934 Copybara import of the project: - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com> - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com> - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com> - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com> COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e PiperOrigin-RevId: 246469606
@chsigg. thank you for getting the workaround merged...it will help us make progress with upstreaming our code. Let me know if / when you want us to try out the actual solution once it becomes available. There are quite a few uses of don't know why this PR is still showing as "Open". Assuming you will take care of that part. thanks again deven |
Thanks Deven, I'm working on simplifying the if_cuda/rocm* situation. My current plan:
Not sure why the PR still shows as open, I'm closing it manually. |
…ost PR tensorflow#26722 Imported from GitHub PR tensorflow#28116 This is a follow up to PR tensorflow#26722 . There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same. 1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths 2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error. @tatianashp , just FYI. @gunan , @chsigg, @whchung Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934 Copybara import of the project: - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com> - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com> - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com> - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com> COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e PiperOrigin-RevId: 246469606
This is a follow up to PR #26722 .
There were a couple of changes introduced in that PR that break the
--config=rocm
build, and this PR has fixes for the same.changing
if_cuda
toif_cuda_is_configured
in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm pathscorrecting a filename typo for
@rocprim_archive//:LICENSE.txt
that was leading to compile error.@tatianashp , just FYI.
@gunan , @chsigg, @whchung
Link to discussion regarding
if_cuda
vsif_cuda_is_configured
from another PR : #26753 (comment)