[ROCm] Fixes for broken `--config=rocm` build, post PR #26722 #28116

deven-amd · 2019-04-24T18:18:04Z

This is a follow up to PR #26722 .

There were a couple of changes introduced in that PR that break the --config=rocm build, and this PR has fixes for the same.

changing if_cuda to if_cuda_is_configured in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths
correcting a filename typo for @rocprim_archive//:LICENSE.txt that was leading to compile error.

@tatianashp , just FYI.

@gunan , @chsigg, @whchung

Link to discussion regarding if_cuda vs if_cuda_is_configured from another PR : #26753 (comment)

gunan · 2019-04-24T19:40:10Z

tensorflow/core/kernels/BUILD

@@ -4029,7 +4029,7 @@ tf_kernel_library(
 tf_kernel_library(
    name = "bias_op",
    prefix = "bias_op",
-    deps = NN_DEPS + [":redux_functor"] + if_cuda([
+    deps = NN_DEPS + [":redux_functor"] + if_cuda_is_configured([


This does not work internally.
I am not sure what the difference between if_cuda and if_cuda_is_configured is.
But to be able to merge this, someone will need to take a look into this and maybe even change the macros internally.
CC @chsigg @ezhulenev @jlebar

hi @gunan,

If possible, can you please shed some light on the nature of the problems introduced by this change. We would like to understand the problem(s) associated with this change, to see if there is anything we can do on our end to avoid/workaround them.

thanks

deven

Internally, there is no configure script. So if_cuda_is_configured semantically does not make sense for us internally.

Also, I am trying to understand what the differences between if_cuda and if_cuda_is_configured are. We are trying to move to a state where we do not have to run configure script, so only if_cuda will make sense in such a case.

See the discussion between @chsigg and myself here for some info on the differences between if_cuda and if_cuda_is_configured : #26753 (comment)

Internally, there is no configure script.

Do you still need to specify the env var TF_NEED_CUDA=1 to build TF with CUDA support?

If that is the case then if_cuda_is_configured should still work.

Else if there is some different mechanism to specify building TF with CUDA support, then we should update this routine (https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl#L337) to trigger using that mechanism in addition to the env var. That should make the if_cuda_is_configured work without running the configure script.

@chsigg please confirm the accuracy of above. thanks.

Do you still need to specify the env var TF_NEED_CUDA=1 to build TF with CUDA support?

No, we do not. Even in opensource, we would like to get to a state where all you need to do is bazel build --config=cuda //tensorflow/...

Internally, and soon externally, all you need to do for building with CUDA support is add --config=cuda flag. That is all that is available to bazel to decide what to use when building.

So for the purpose of this PR, either someone internal will need to take a look into this to make it work, or we need to find a way to make your code work with if_cuda. Unfortunately, I do not have cycles to investigate this.

Due to the nature of the error, it does not seem that this is an issue we can solve by changing things exclusively on the ROCm side of the fence..we are going to need help. Given that, and also that this is (one of the) blocking issues preventing us from filing further PRs (for upstreaming ROCm support), we need to resolve this PR ASAP. What are our options towards that end?

@chsigg, your suggestion of if_gpu(cuda_arg, rocm_arg, defaut_arg) would be one potential solution here. How would we go about implementing + testing it out?

thanks again

I think it makes most sense if we clean this up from our end. I agree with Gunhan that if_cuda_is_configured (and .._compat) should go away. As far as I understand that macro was introduced to allow linking against CUDA libraries (say, cuDNN) while not invoking nvcc to build CUDA code. I'm not sure this is useful, but probably it won't be trivial to unify them again more than two years later. I will take a look and report back.

@chsigg @gunan, let me know if / how I can help develop+test the new solution, will be more than happy to help out.

In the meantime, i will push out a commit soon, that works around this issue by introducing a new if_cuda_or_rocm function. This function works the same as if_cuda / if_rocm, i.e. it will return the if_true arg when either CUDA or ROCM is enabled. Using this function to specify the common dependencies, works around the bazel error. However, this solution is less than ideal because it needs to duplicate the logic within if_cuda / if_rocm. Please review the commit and let me know if this is acceptable as a temporary workaround, while we work on the proper fix.

This looks good as a temporary solution, thanks. I will run our internal tests and try to merge it.

@chsigg thank you!

I have pushed out a rebase to keep just the last commit and sync up to the tip.

tensorflow/tools/pip_package/BUILD

For cases when we have the same dependency being specified by both CUDA and ROCm, using `if_cuda` / `if_rocm` to specify that dependency leads to a "duplicate dependency" bazel error. Switching `if_cuda` / `if_rocm` to `if_cuda_is_configured` / `if_rocm_is_configured` is not an acceptable solution, because the `*_is_configured` functions are being phased out. The preferred solution here would be a new `if_gpu(cuda_arg, rocm_arg, default_arg)` function. That (or something alon those lines) solution is in the works. This workaround is meant to be bandage that allows progress to be made, while we wait for the real solution. This workaround introduces a `if_cuda_or_rocm` function which should be used to specify dependencies that are common to both CUDA and ROCM. While this solution works, it is less than ideal because it needs to duplicate the logic inside the `if_cuda` / `if_rocm` functions.

@tatianashp

…ost PR tensorflow#26722 Imported from GitHub PR tensorflow#28116 This is a follow up to PR tensorflow#26722 . There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same. 1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths 2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error. @tatianashp , just FYI. @gunan , @chsigg, @whchung Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934 Copybara import of the project: - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com> - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com> - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com> - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com> COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e PiperOrigin-RevId: 246469606

deven-amd · 2019-05-04T00:34:09Z

@chsigg. thank you for getting the workaround merged...it will help us make progress with upstreaming our code. Let me know if / when you want us to try out the actual solution once it becomes available. There are quite a few uses of if_rocm_is_configured and I would like to remove them all, given that the _is_configured variant will be deprecated sooner or later.

don't know why this PR is still showing as "Open". Assuming you will take care of that part.

thanks again

deven

chsigg · 2019-05-06T07:54:41Z

Thanks Deven, I'm working on simplifying the if_cuda/rocm* situation. My current plan:

if_*_is_configured() should refer to whether the gpu libraries are available. Searching for CUDA libraries will be enabled through --config=using_cuda, which has the effect of --define=using_cuda=true and --action_env=TF_NEED_CUDA=1. The former is the standard way to pick among config_settings, and I would like if_cuda_is_configured() to return a select statement (to match the internal implementation, which returns a select distinguishing between platforms where CUDA is available). The latter is the only way to affect a repository_rule. ROCm should have an equivalent setup, and I want to add an if_gpu_is_configured(if_cuda, if_rocm, otherwise) macro.
if_*() should refer to whether we build op kernels gpus (this excludes XLA, which should only rely on the mechanics above). CUDA op kernels are enabled through --config=cuda, which again triggers a define (--define=using_cuda_nvcc=true or --define=using_cuda_clang=true) which triggers a config_setting which is used in a select statement. This is already the case today. Again, same for ROCm and add an if_gpu(if_cuda, if_rocm, otherwise).

Not sure why the PR still shows as open, I'm closing it manually.

@tatianashp

…ost PR tensorflow#26722 Imported from GitHub PR tensorflow#28116 This is a follow up to PR tensorflow#26722 . There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same. 1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths 2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error. @tatianashp , just FYI. @gunan , @chsigg, @whchung Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934 Copybara import of the project: - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com> - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com> - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com> - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com> COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e PiperOrigin-RevId: 246469606

tensorflow-bot bot added the size:S CL Change Size: Small label Apr 24, 2019

googlebot added the cla: yes label Apr 24, 2019

rthadur self-assigned this Apr 24, 2019

rthadur added this to Assigned Reviewer in PR Queue via automation Apr 24, 2019

rthadur requested a review from gunan April 24, 2019 18:22

rthadur added the subtype:bazel Bazel related Build_Installation issues label Apr 24, 2019

gunan reviewed Apr 24, 2019

View reviewed changes

tensorflow/tools/pip_package/BUILD Outdated Show resolved Hide resolved

deven-amd mentioned this pull request Apr 24, 2019

[ROCm] Correcting license filename for rocprim #28119

Merged

rthadur requested a review from gunan April 24, 2019 21:05

chsigg added the ready to pull PR ready for merge process label May 2, 2019

deven-amd force-pushed the google_upstream_pr_26722_followup branch from 0080a8e to 6fd3452 Compare May 2, 2019 13:41

rthadur requested a review from chsigg May 2, 2019 16:33

jerryyin mentioned this pull request May 2, 2019

[ROCm] Adding Cuda Alias to Gpu functions #28343

Closed

chsigg closed this May 6, 2019

PR Queue automation moved this from Assigned Reviewer to Closed/Rejected May 6, 2019

deven-amd deleted the google_upstream_pr_26722_followup branch May 6, 2019 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fixes for broken `--config=rocm` build, post PR #26722 #28116

[ROCm] Fixes for broken `--config=rocm` build, post PR #26722 #28116

deven-amd commented Apr 24, 2019

gunan Apr 24, 2019

deven-amd Apr 25, 2019

gunan Apr 25, 2019

deven-amd Apr 25, 2019

gunan Apr 25, 2019 •

edited

deven-amd Apr 29, 2019

chsigg Apr 30, 2019

deven-amd Apr 30, 2019

chsigg May 2, 2019

deven-amd May 2, 2019

deven-amd commented May 4, 2019

chsigg commented May 6, 2019

[ROCm] Fixes for broken --config=rocm build, post PR #26722 #28116

[ROCm] Fixes for broken --config=rocm build, post PR #26722 #28116

Conversation

deven-amd commented Apr 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunan Apr 25, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deven-amd commented May 4, 2019

chsigg commented May 6, 2019

[ROCm] Fixes for broken `--config=rocm` build, post PR #26722 #28116

[ROCm] Fixes for broken `--config=rocm` build, post PR #26722 #28116

gunan Apr 25, 2019 •

edited