Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't build MacOS GPU pip because odf nccl_archive errors #7364

Closed
yaroslavvb opened this issue Feb 8, 2017 · 8 comments
Closed

Can't build MacOS GPU pip because odf nccl_archive errors #7364

yaroslavvb opened this issue Feb 8, 2017 · 8 comments
Assignees
Labels
type:build/install Build and install issues

Comments

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Feb 8, 2017

Looks like Google CI build is having the same issue:
https://ci.tensorflow.org/view/Nightly/job/nightly-matrix-mac-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3,label=gpu-mac/lastFailedBuild/console

90 errors detected in the compilation of "/var/folders/9l/c8y8z62s0kjgnpgx6jwh0g9r0000gn/T//tmpxft_0000aebc_00000000-7_reduce_scatter.cu.cpp1.ii".
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: output 'external/nccl_archive/_objs/nccl/external/nccl_archive/src/all_reduce.cu.pic.o' was not created.
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: output 'external/nccl_archive/_objs/nccl/external/nccl_archive/src/reduce.cu.pic.o' was not created.
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: Couldn't build file external/nccl_archive/_objs/nccl/external/nccl_archive/src/all_reduce.cu.pic.o: not all outputs were created or valid.
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: Couldn't build file external/nccl_archive/_objs/nccl/external/nccl_archive/src/reduce.cu.pic.o: not all outputs were created or valid.
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: output 'external/nccl_archive/_objs/nccl/external/nccl_archive/src/reduce_scatter.cu.pic.o' was not created.
ERROR: /private/var/tmp/_bazel_yaroslav/8430f3ac1504aea2a8d4e6b016af31c5/external/nccl_archive/BUILD.bazel:33:1: Couldn't build file external/nccl_archive/_objs/nccl/external/nccl_archive/src/reduce_scatter.cu.pic.o: not all outputs were created or valid.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2.220s, Critical Path: 1.47s
@yaroslavvb
Copy link
Contributor Author

fixing this will enable me to add the first macos XLA wheel to https://github.com/yaroslavvb/tensorflow-community-wheels

@gunan
Copy link
Contributor

gunan commented Feb 8, 2017

This seems to be an issue in nccl.
We have conributed a fix, and testing the mac gpu build with it right now:
http://ci.tensorflow.org/job/tensorflow-pull-requests-mac-gpu/7/

@yaroslavvb
Copy link
Contributor Author

After getting master, those errors are gone, but now pip package won't build with a bunch of errors like below

2 errors detected in the compilation of "/var/folders/9l/c8y8z62s0kjgnpgx6jwh0g9r0000gn/T//tmpxft_00015b3d_00000000-7_one_hot_op_gpu.cu.cpp1.ii".
ERROR: /Users/yaroslav/git/tensorflow/tensorflow/core/kernels/BUILD:612:1: output 'tensorflow/core/kernels/_objs/one_hot_op_gpu/tensorflow/core/kernels/one_hot_op_gpu.cu.pic.o' was not created.
ERROR: /Users/yaroslav/git/tensorflow/tensorflow/core/kernels/BUILD:612:1: Couldn't build file tensorflow/core/kernels/_objs/one_hot_op_gpu/tensorflow/core/kernels/one_hot_op_gpu.cu.pic.o: not all outputs were created or valid.
INFO: From Compiling tensorflow/core/kernels/reverse_op_gpu.cu.cc:
external/protobuf/src/google/protobuf/stubs/atomicops_internals_generic_c11_atomic.h(52): error: static assertion failed with "incompatible 32-bit atomic layout"

external/protobuf/src/google/protobuf/stubs/atomicops_internals_generic_c11_atomic.h(145): error: static assertion failed with "incompatible 64-bit atomic layout"

2 errors detected in the compilation of "/var/folders/9l/c8y8z62s0kjgnpgx6jwh0g9r0000gn/T//tmpxft_000159ab_00000000-7_reverse_op_gpu.cu.cpp1.ii".
ERROR: /Users/yaroslav/git/tensorflow/tensorflow/core/kernels/BUILD:642:1: output 'tensorflow/core/kernels/_objs/reverse_op_gpu/tensorflow/core/kernels/reverse_op_gpu.cu.pic.o' was not created.
ERROR: /Users/yaroslav/git/tensorflow/tensorflow/core/kernels/BUILD:642:1: Couldn't build file tensorflow/core/kernels/_objs/reverse_op_gpu/tensorflow/core/kernels/reverse_op_gpu.cu.pic.o: not all outputs were created or valid.
INFO: From Compiling tensorflow/core/kernels/slice_op_gpu.cu.cc:
external/protobuf/src/google/protobuf/stubs/atomicops_internals_generic_c11_atomic.h(52): error: static assertion failed with "incompatible 32-bit atomic layout"

external/protobuf/src/google/protobuf/stubs/atomicops_internals_generic_c11_atomic.h(145): error: static assertion failed with "incompatible 64-bit atomic layout"

2 errors detected in the compilation of "/var/folders/9l/c8y8z62s0kjgnpgx6jwh0g9r0000gn/T//tmpxft_0001597f_00000000-7_slice_op_gpu.cu.cpp1.ii".
ERROR: /Users/yaroslav/git/tensorflow/tensorflow/core/kernels/BUILD:660:1: output 'tensorflow/core/kernels/_objs/slice_op_gpu/tensorflow/core/kernels/slice_op_gpu.cu.pic.o' was not created.

@gunan
Copy link
Contributor

gunan commented Feb 9, 2017

@jhseu Looks like protobuf update broke mac gpu build?
Could you take a look?

@jhseu
Copy link
Contributor

jhseu commented Feb 9, 2017

Looks like it's protocolbuffers/protobuf#2545 and protocolbuffers/protobuf@ecc460a.

Can't really revert that change because macOS atomics are deprecated in favor of C++11 atomics. Not sure why nvcc is hitting both static_asserts when it should only reach one... Will investigate later today.

@jhseu
Copy link
Contributor

jhseu commented Feb 9, 2017

Fixed in protocolbuffers/protobuf#2699. Will have to update workspace.bzl after that's merged.

@jhseu
Copy link
Contributor

jhseu commented Feb 10, 2017

#7425

@gunan
Copy link
Contributor

gunan commented Feb 11, 2017

I was able to build the pip package successfully. Now running all tests on it, but as the build is successful, closing this issue.
Thanks @jhseu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

3 participants