Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI build issue with bazel #20147

Merged
merged 3 commits into from Jun 29, 2018

Conversation

yongtang
Copy link
Member

While trying to build tensorflow with MPI through:

Do you wish to build TensorFlow with MPI support [y/N] y
...
$ bazel build -s --verbose_failures --config=opt //tensorflow/tools/pip_package:build_pip_package

, the following failure happens:

tensorflow/contrib/mpi_collectives/kernels/mpi_ops.cc:76:18: error: 'se' does not name a type
 using StatusOr = se::port::StatusOr<T>;
                  ^
tensorflow/contrib/mpi_collectives/kernels/mpi_ops.cc:139:28: error: 'StatusOr' was not declared in this scope
 typedef std::function<void(StatusOr<Tensor>)> CommunicationDoneCallback;
...

This fix addresses the build failures for MPI build.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

@qlzh727 qlzh727 requested a review from jlebar June 20, 2018 16:04
@qlzh727 qlzh727 self-assigned this Jun 20, 2018
@qlzh727
Copy link
Member

qlzh727 commented Jun 20, 2018

Assigning to @jlebar since the code was recently migrated to use new namespace.

@qlzh727 qlzh727 added the awaiting review Pull request awaiting review label Jun 20, 2018
@jlebar
Copy link
Contributor

jlebar commented Jun 20, 2018

I won't have my laptop until Monday, so apologies for the somewhat vague comment, but rather than depend on XLA to get StatusOr, I'd recommend depending on the relevant header and BUILD dep within StreamExecutor. That's less of a layering violation.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jun 21, 2018
While trying to build tensorflow with MPI through:
```
...
$ bazel build -s --verbose_failures --config=opt //tensorflow/tools/pip_package:build_pip_package
```
, the following failure happens:
```
tensorflow/contrib/mpi_collectives/kernels/mpi_ops.cc:76:18: error: 'se' does not name a type
 using StatusOr = se::port::StatusOr<T>;
                  ^
tensorflow/contrib/mpi_collectives/kernels/mpi_ops.cc:139:28: error: 'StatusOr' was not declared in this scope
 typedef std::function<void(StatusOr<Tensor>)> CommunicationDoneCallback;
...
...
```

This fix addresses the build failures for MPI build.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang
Copy link
Member Author

@jlebar Thanks for the review. I updated the PR and used stream_executor_headers_lib instead. Please take a look and let me know if there are any issues.

deps = [
":stream_executor",
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problem is this new build rule solving? I thought this was the purpose of //tensorflow/core:stream_executor_headers_lib.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlebar Without the headers_lib, the following error will show up:

ERROR: /home/ubuntu/tensorflow/tensorflow/contrib/mpi_collectives/BUILD:40:1: undeclared inclusion(s) in rule '//tensorflow/contrib/mpi_collectives:python/ops/_mpi_ops.so':
this rule is missing dependency declarations for the following files included by 'tensorflow/contrib/mpi_collectives/kernels/mpi_ops.cc':
  'tensorflow/stream_executor/lib/statusor.h'
  'tensorflow/stream_executor/lib/status.h'
  'tensorflow/stream_executor/lib/error.h'
  'tensorflow/stream_executor/lib/stringpiece.h'
  'tensorflow/stream_executor/platform/port.h'
  'tensorflow/stream_executor/platform/logging.h'
  'tensorflow/stream_executor/lib/statusor_internals.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

So I added the stream_executor_headers_lib to address the above undeclared inclusion(s) issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the headers_lib, the following error will show up:

That explains why you need to add a new dependency to tensorflow/contrib/mpi_collectives/BUILD.

My question is, can that dependency be //tensorflow/core:stream_executor_headers_lib instead of this new rule? If not, why not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlebar With //tensorflow/core:stream_executor_headers_lib the following error will show up:

ERROR: /home/ubuntu/tensorflow/tensorflow/contrib/mpi_collectives/BUILD:40:1: Label '//tensorflow/core:stream_executor_headers_lib' is duplicated in the 'deps' attribute of rule 'python/ops/_mpi_ops_gpu'
ERROR: /home/ubuntu/tensorflow/tensorflow/contrib/mpi_collectives/BUILD:40:1: Label '//tensorflow/core:stream_executor_headers_lib' is duplicated in the 'deps' attribute of rule 'python/ops/_mpi_ops.so_check_deps'
ERROR: /home/ubuntu/tensorflow/tensorflow/contrib/mpi_collectives/BUILD:40:1: Label '//tensorflow/core:stream_executor_headers_lib' is duplicated in the 'deps' attribute of rule 'python/ops/_mpi_ops.so'

I am not sure about the error, though from _mpi_ops_gpu I guess stream_executor may have gpu and no-gpu version and that might be the conflict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed, tensorflow/core:stream_executor_headers_lib depends on CUDA stuff. So the new build rule you added is different in this important way.

These build rules are extremely messed up.

I'm happy to take this for now, with the caveat that if and when we clean up the SE build rules, we may break you again. I don't like that, but apparently we have no continuous builds for MPI. :-/

@jlebar
Copy link
Contributor

jlebar commented Jun 29, 2018

@qlzh727, can you help get this landed?

@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 29, 2018
@qlzh727 qlzh727 merged commit 5ac1bd7 into tensorflow:master Jun 29, 2018
@yongtang yongtang deleted the 06202018-mpi_collectives branch June 29, 2018 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants