Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] Fixes for broken --config=rocm build, post PR #26722 #28116

Closed

Conversation

deven-amd
Copy link
Contributor

This is a follow up to PR #26722 .

There were a couple of changes introduced in that PR that break the --config=rocm build, and this PR has fixes for the same.

  1. changing if_cuda to if_cuda_is_configured in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths

  2. correcting a filename typo for @rocprim_archive//:LICENSE.txt that was leading to compile error.

@tatianashp , just FYI.

@gunan , @chsigg, @whchung

Link to discussion regarding if_cuda vs if_cuda_is_configured from another PR : #26753 (comment)

@@ -4029,7 +4029,7 @@ tf_kernel_library(
tf_kernel_library(
name = "bias_op",
prefix = "bias_op",
deps = NN_DEPS + [":redux_functor"] + if_cuda([
deps = NN_DEPS + [":redux_functor"] + if_cuda_is_configured([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work internally.
I am not sure what the difference between if_cuda and if_cuda_is_configured is.
But to be able to merge this, someone will need to take a look into this and maybe even change the macros internally.
CC @chsigg @ezhulenev @jlebar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @gunan,

If possible, can you please shed some light on the nature of the problems introduced by this change. We would like to understand the problem(s) associated with this change, to see if there is anything we can do on our end to avoid/workaround them.

thanks

deven

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally, there is no configure script. So if_cuda_is_configured semantically does not make sense for us internally.

Also, I am trying to understand what the differences between if_cuda and if_cuda_is_configured are. We are trying to move to a state where we do not have to run configure script, so only if_cuda will make sense in such a case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion between @chsigg and myself here for some info on the differences between if_cuda and if_cuda_is_configured : #26753 (comment)

Internally, there is no configure script.

Do you still need to specify the env var TF_NEED_CUDA=1 to build TF with CUDA support?

@chsigg please confirm the accuracy of above. thanks.

Copy link
Contributor

@gunan gunan Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to specify the env var TF_NEED_CUDA=1 to build TF with CUDA support?

No, we do not. Even in opensource, we would like to get to a state where all you need to do is bazel build --config=cuda //tensorflow/...

Internally, and soon externally, all you need to do for building with CUDA support is add --config=cuda flag. That is all that is available to bazel to decide what to use when building.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the purpose of this PR, either someone internal will need to take a look into this to make it work, or we need to find a way to make your code work with if_cuda. Unfortunately, I do not have cycles to investigate this.

Due to the nature of the error, it does not seem that this is an issue we can solve by changing things exclusively on the ROCm side of the fence..we are going to need help. Given that, and also that this is (one of the) blocking issues preventing us from filing further PRs (for upstreaming ROCm support), we need to resolve this PR ASAP. What are our options towards that end?

@chsigg, your suggestion of if_gpu(cuda_arg, rocm_arg, defaut_arg) would be one potential solution here. How would we go about implementing + testing it out?

thanks again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes most sense if we clean this up from our end. I agree with Gunhan that if_cuda_is_configured (and .._compat) should go away. As far as I understand that macro was introduced to allow linking against CUDA libraries (say, cuDNN) while not invoking nvcc to build CUDA code. I'm not sure this is useful, but probably it won't be trivial to unify them again more than two years later. I will take a look and report back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chsigg @gunan, let me know if / how I can help develop+test the new solution, will be more than happy to help out.

In the meantime, i will push out a commit soon, that works around this issue by introducing a new if_cuda_or_rocm function. This function works the same as if_cuda / if_rocm, i.e. it will return the if_true arg when either CUDA or ROCM is enabled. Using this function to specify the common dependencies, works around the bazel error. However, this solution is less than ideal because it needs to duplicate the logic within if_cuda / if_rocm. Please review the commit and let me know if this is acceptable as a temporary workaround, while we work on the proper fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good as a temporary solution, thanks. I will run our internal tests and try to merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chsigg thank you!

I have pushed out a rebase to keep just the last commit and sync up to the tip.

@rthadur rthadur requested a review from gunan April 24, 2019 21:05
@chsigg chsigg added the ready to pull PR ready for merge process label May 2, 2019
For cases when we have the same dependency being specified by both CUDA and ROCm, using `if_cuda` / `if_rocm` to specify that dependency leads to a "duplicate dependency" bazel error.

Switching `if_cuda` / `if_rocm` to `if_cuda_is_configured` / `if_rocm_is_configured` is not an acceptable solution, because the `*_is_configured` functions are being phased out.

The preferred solution here would be a new `if_gpu(cuda_arg, rocm_arg, default_arg)` function. That (or something alon those lines) solution is in the works. This workaround is meant to be bandage that allows progress to be made, while we wait for the real solution.

This workaround introduces a `if_cuda_or_rocm` function which should be used to specify dependencies that are common to both CUDA and ROCM. While this solution works, it is less than ideal because it needs to duplicate the logic inside the `if_cuda` / `if_rocm` functions.
@deven-amd deven-amd force-pushed the google_upstream_pr_26722_followup branch from 0080a8e to 6fd3452 Compare May 2, 2019 13:41
@rthadur rthadur requested a review from chsigg May 2, 2019 16:33
pull bot pushed a commit to testkevinbonz/tensorflow that referenced this pull request May 3, 2019
…ost PR tensorflow#26722

Imported from GitHub PR tensorflow#28116

This is a follow up to PR tensorflow#26722 .

There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same.

1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths

2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error.

@tatianashp , just FYI.

@gunan , @chsigg, @whchung

Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934

Copybara import of the project:

  - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com>
  - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com>
  - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com>
  - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com>

COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e
PiperOrigin-RevId: 246469606
@deven-amd
Copy link
Contributor Author

@chsigg. thank you for getting the workaround merged...it will help us make progress with upstreaming our code. Let me know if / when you want us to try out the actual solution once it becomes available. There are quite a few uses of if_rocm_is_configured and I would like to remove them all, given that the _is_configured variant will be deprecated sooner or later.

don't know why this PR is still showing as "Open". Assuming you will take care of that part.

thanks again

deven

@chsigg
Copy link
Contributor

chsigg commented May 6, 2019

Thanks Deven, I'm working on simplifying the if_cuda/rocm* situation. My current plan:

  • if_*_is_configured() should refer to whether the gpu libraries are available. Searching for CUDA libraries will be enabled through --config=using_cuda, which has the effect of --define=using_cuda=true and --action_env=TF_NEED_CUDA=1. The former is the standard way to pick among config_settings, and I would like if_cuda_is_configured() to return a select statement (to match the internal implementation, which returns a select distinguishing between platforms where CUDA is available). The latter is the only way to affect a repository_rule. ROCm should have an equivalent setup, and I want to add an if_gpu_is_configured(if_cuda, if_rocm, otherwise) macro.

  • if_*() should refer to whether we build op kernels gpus (this excludes XLA, which should only rely on the mechanics above). CUDA op kernels are enabled through --config=cuda, which again triggers a define (--define=using_cuda_nvcc=true or --define=using_cuda_clang=true) which triggers a config_setting which is used in a select statement. This is already the case today. Again, same for ROCm and add an if_gpu(if_cuda, if_rocm, otherwise).

Not sure why the PR still shows as open, I'm closing it manually.

@chsigg chsigg closed this May 6, 2019
PR Queue automation moved this from Assigned Reviewer to Closed/Rejected May 6, 2019
@deven-amd deven-amd deleted the google_upstream_pr_26722_followup branch May 6, 2019 13:57
ashwin pushed a commit to ashwin/tensorflow that referenced this pull request May 14, 2019
…ost PR tensorflow#26722

Imported from GitHub PR tensorflow#28116

This is a follow up to PR tensorflow#26722 .

There were a couple of changes introduced in that PR that break the `--config=rocm` build, and this PR has fixes for the same.

1. changing `if_cuda` to `if_cuda_is_configured` in cases where a "duplicate dependency" is introduced, due to a common dependency in the CUDA + ROCm paths

2. correcting a filename typo for `@rocprim_archive//:LICENSE.txt` that was leading to compile error.

@tatianashp , just FYI.

@gunan , @chsigg, @whchung

Link to discussion regarding `if_cuda` vs `if_cuda_is_configured` from another PR : tensorflow#26753#discussion_r267022934

Copybara import of the project:

  - caf27f4 changing if_cuda back to if_cuda_is_configured. by Deven Desai <deven.desai.amd@gmail.com>
  - d48421b changing LICENSE.TXT to LICENSE.txt for rocprim_archive by Deven Desai <deven.desai.amd@gmail.com>
  - 0080a8e Workaround for the duplicate dependency bazel error. by Deven Desai <deven.desai.amd@gmail.com>
  - 5ef1431 Merge 0080a8e into 503da... by Deven Desai <36858332+deven-amd@users.noreply.github.com>

COPYBARA_INTEGRATE_REVIEW=tensorflow#28116 from ROCmSoftwarePlatform:google_upstream_pr_26722_followup 0080a8e
PiperOrigin-RevId: 246469606
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:S CL Change Size: Small subtype:bazel Bazel related Build_Installation issues
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

5 participants