Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF1.12 building w/ py3.6.4 on HPC #23803

Closed
ClaudioCimarelli opened this issue Nov 16, 2018 · 10 comments
Closed

TF1.12 building w/ py3.6.4 on HPC #23803

ClaudioCimarelli opened this issue Nov 16, 2018 · 10 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author TF 1.12 Issues related to TF 1.12

Comments

@ClaudioCimarelli
Copy link

Hello,
I am trying to (re-)build Tensorflow on the HPC I use. I already managed to build once TF 1.9 but with Python 3.4 and probably another compiler. For the rest I tried to keep everything the same, such as bazel with the suggested version. Unfortunately I was able to build only GCC 5.5.0 instead of 4.8.0, but I don't think it is the problem here as I am using the flag --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0".

In this case I am trying to build TF 1.12 with Py3.6.4, but I am experiencing problems (a bit different) with TF 1.9 as well. I am not sure of the path to the Python Library that I give at the beginning.
Thanks.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 3.16.51-3
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 1.12
  • Python version: 3.6.4
  • Installed using virtualenv? pip? conda?: virtualenv
  • Bazel version (if compiling from source): 1.5.0
  • GCC/Compiler version (if compiling from source): 5.5.0
  • CUDA/cuDNN version: 9.1.85/7.1.2
  • GPU model and memory: K80

Problem Description
ERROR: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/nasm/BUILD.bazel:8:1: undeclared
inclusion(s) in rule '@nasm//:nasm':
this rule is missing dependency declarations for the following files included by 'external/nasm/x86/regflags.c':
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include-fixed/limits.h'
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include-fixed/syslimits.h'
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include/stddef.h'
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include/stdarg.h'

Sequence of commands / steps executed before running into the problem
$ ./cofigure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.0 installed.
Please specify the location of python. [Default is /home/users/.../virtualenvs/ml-full/bin/python]:
Please input the desired Python library path to use. Default is [/home/users/.../virtualenvs/ml-full/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: n
No Apache Ignite support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 9.1

Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /opt/apps/resif/data/production/v1.1-20180718/default/software/system/CUDA/9.1.85

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.1

Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /opt/apps/resif/data/production/v1.1-20180718/default/software/system/CUDA/9.1.85]: /opt/apps/resif/data/production/v1.1-20180718/default/software/numlib/cuDNN/7.1.2-CUDA-9.1.85

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may
have worse performance with multiple GPUs. [Default is 2.2]: 1.3

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.7,3.7,3.7,3.7,3.7,3.7,3.7,3.7]: 3.5,3.7,7.0

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /home/users/ccimarelli/opt/gcc/GCC-5.5.0/bin/gcc]:

$ bazel build --config=opt --config=cuda --spawn_strategy=standalone --action_env=TMP=/home/users/ccimarelli/tmp --verbose_failures --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package

@Harshini-Gadige Harshini-Gadige added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 21, 2018
@gunan
Copy link
Contributor

gunan commented Nov 21, 2018

It looks like you configured with one GCC version before, so bazel workspace is corrupted.
Could you try this:

bazel clean --expunge_async

Then reconfigure and rebuild?

@gunan gunan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Nov 21, 2018
@ClaudioCimarelli
Copy link
Author

Yes, sorry I had no time this two weeks to test the solution yet. I will try asap. Thanks for your help!!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 8, 2018
@ClaudioCimarelli
Copy link
Author

I tried again using the command to clean before reconfiguring.
I tried both with gcc 6.4.0, 5.5.0 and 4.8.0.
With gcc 5.5.0 I got this error:

ERROR: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/double_conversion/BUILD.bazel:12:1: undeclared inclusion(s) in rule '@double_conversion//:double-conversion':
this rule is missing dependency declarations for the following files included by 'external/double_conversion/double-conversion/diy-fp.cc':
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include/stddef.h'
'/home/users/ccimarelli/opt/gcc/GCC-5.5.0/lib/gcc/x86_64-unknown-linux-gnu/5.5.0/include/stdint.h'

with gcc 6.4.0:

ERROR:
/mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/nasm/BUILD.bazel:8:1: undeclared inclusion(s) in rule '@nasm//:nasm':
this rule is missing dependency declarations for the following files included by 'external/nasm/x86/insnsn.c':
'/opt/apps/resif/data/production/v1.1-20180718/default/software/compiler/GCCcore/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include-fixed/limits.h'
'/opt/apps/resif/data/production/v1.1-20180718/default/software/compiler/GCCcore/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include-fixed/syslimits.h'
'/opt/apps/resif/data/production/v1.1-20180718/default/software/compiler/GCCcore/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include/stddef.h'
'/opt/apps/resif/data/production/v1.1-20180718/default/software/compiler/GCCcore/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0/include/stdarg.h'

'

and with gcc 4.8.0:

gcc-4.8: error trying to exec 'cc1plus': execvp: No such file or directory

I am using this command alway to compile for the compatibility of the compiler:
bazel build --config=opt --config=cuda --action_env=TMP=/home/users/ccimarelli/tmp --verbose_failures --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --spawn_strategy=standalone //tensorflow/tools/pip_package:build_pip_package

always using python 3.6.4 and bazel 0.15.0...

Last thing I alway receive this WARNINGS:

WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
WARNING: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/BUILD:1992:1: in srcs attribute of cc_library rule @grpc//:grpc_nanopb: please do not import '@grpc//third_party/nanopb:pb_common.c' directly. You should either move the file to this package or depend on an appropriate rule there. Since this rule was created by the macro 'grpc_generate_one_off_targets', the error might have been caused by the macro implementation in /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/bazel/grpc_build_system.bzl:172:12
WARNING: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/BUILD:1992:1: in srcs attribute of cc_library rule @grpc//:grpc_nanopb: please do not import '@grpc//third_party/nanopb:pb_decode.c' directly. You should either move the file to this package or depend on an appropriate rule there. Since this rule was created by the macro 'grpc_generate_one_off_targets', the error might have been caused by the macro implementation in /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/bazel/grpc_build_system.bzl:172:12
WARNING: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/BUILD:1992:1: in srcs attribute of cc_library rule @grpc//:grpc_nanopb: please do not import '@grpc//third_party/nanopb:pb_encode.c' directly. You should either move the file to this package or depend on an appropriate rule there. Since this rule was created by the macro 'grpc_generate_one_off_targets', the error might have been caused by the macro implementation in /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/grpc/bazel/grpc_build_system.bzl:172:12
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/learn/BUILD:17:1: in py_library rule //tensorflow/contrib/learn:learn: target '//tensorflow/contrib/learn:learn' depends on deprecated target '//tensorflow/contrib/session_bundle:exporter': No longer supported. Switch to SavedModel immediately.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/learn/BUILD:17:1: in py_library rule //tensorflow/contrib/learn:learn: target '//tensorflow/contrib/learn:learn' depends on deprecated target '//tensorflow/contrib/session_bundle:gc': No longer supported. Switch to SavedModel immediately.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/timeseries/python/timeseries/BUILD:354:1: in py_library rule //tensorflow/contrib/timeseries/python/timeseries:ar_model: target '//tensorflow/contrib/timeseries/python/timeseries:ar_model' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/timeseries/python/timeseries/state_space_models/BUILD:230:1: in py_library rule //tensorflow/contrib/timeseries/python/timeseries/state_space_models:filtering_postprocessor: target '//tensorflow/contrib/timeseries/python/timeseries/state_space_models:filtering_postprocessor' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/timeseries/python/timeseries/state_space_models/BUILD:73:1: in py_library rule //tensorflow/contrib/timeseries/python/timeseries/state_space_models:kalman_filter: target '//tensorflow/contrib/timeseries/python/timeseries/state_space_models:kalman_filter' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/bayesflow/BUILD:17:1: in py_library rule //tensorflow/contrib/bayesflow:bayesflow_py: target '//tensorflow/contrib/bayesflow:bayesflow_py' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/seq2seq/BUILD:23:1: in py_library rule //tensorflow/contrib/seq2seq:seq2seq_py: target '//tensorflow/contrib/seq2seq:seq2seq_py' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
WARNING: /mnt/gaiagpfs/users/workdirs/ccimarelli/git/tensorflow/tensorflow/contrib/BUILD:13:1: in py_library rule //tensorflow/contrib:contrib_py: target '//tensorflow/contrib:contrib_py' depends on deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distributions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.

I really hope you can help. Thanks in advance.

@ClaudioCimarelli
Copy link
Author

tried with gcc 4.9.0 as well..

ERROR: /mnt/gaiagpfs/users/homedirs/ccimarelli/.cache/bazel/_bazel_ccimarelli/9669602cb65c53050fa19def8832a149/external/gif_archive/BUILD.bazel:8:1: undeclared inclusion(s) in rule '@gif_archive//:gif':
this rule is missing dependency declarations for the following files included by 'external/gif_archive/lib/gif_font.c':
'/home/users/ccimarelli/bin/gcc-4.9/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include/stddef.h'
'/home/users/ccimarelli/bin/gcc-4.9/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include/stdbool.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@ClaudioCimarelli
Copy link
Author

Ok, I am advancing finally.
I discovered that the global path I was using on the HPC was not good for finding dependencies by bazel. Hence, I used the full path (starting with /mnt/gaiagpfs in my case) instead of the mounted directory ( I got this hint by luck and trying if something changed ).
After that, I had to follow this guide http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html because the linker that bazel was using was not the correct version.
Now I am experiencing a different problem similar to the issue #20592 and related with Google Cloud integration:

external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/bigtable/internal/rowreaderiterator.h:60:44: error: ambiguous overload for 'operator*' (operand type is 'std::remove_reference<const google::cloud::v0::internal::optionalgoogle::cloud::bigtable::v0::Row&>::type {aka const google::cloud::v0::internal::optionalgoogle::cloud::bigtable::v0::Row}')
Row const&& operator*() const&& { return std::move(row_); }
^
external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/bigtable/internal/rowreaderiterator.h:60:44: note: candidates are:
In file included from external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/bigtable/internal/rowreaderiterator.h:19:0,
from external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/bigtable/internal/rowreaderiterator.cc:15:
external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/internal/optional.h:142:22: note: constexpr const T& google::cloud::v0::internal::optional::operator
() const & [with T = google::cloud::bigtable::v0::Row]
constexpr T const& operator*() const& {
^
external/com_github_googlecloudplatform_google_cloud_cpp/google/cloud/internal/optional.h:147:23: note: constexpr const T&& google::cloud::v0::internal::optional::operator*() const && [with T = google::cloud::bigtable::v0::Row]
constexpr T const&& operator*() const&& {
^

@ClaudioCimarelli
Copy link
Author

@gunan, I applied the fix you suggest in the issue #23771 with the commit 3437098 changing all the files, but I am not able to avoid compiling also GCP : / . Do you have any suggestion?

@gunan
Copy link
Contributor

gunan commented Nov 8, 2019

Completely missed this issue under the mountain of emails.
Looks like initial issue was due to bazel not being able to auto detect the toolchain.

For the latter problem. I remember other reports of the same issue.
I think they were resolved, but I may be mistaken. if not, @mihaimaruseac is working to move all cloud filesystem support out of the main package, which should help with this.

@mihaimaruseac could you close this issue once gcs support is modularized and migrated out of TF?

@mihaimaruseac mihaimaruseac assigned mihaimaruseac and unassigned gunan Nov 11, 2019
@mihaimaruseac
Copy link
Collaborator

Taking ownership of this and will get to solve it in time

@mohantym mohantym self-assigned this Mar 10, 2022
@mohantym
Copy link
Contributor

Hi @ClaudioCimarelli !
You are using older versions(1.x versions) of Tensorflow which is not supported any more. Have you checked this thread on using Tensorflow on HPC cluster though?

@mohantym mohantym added TF 1.12 Issues related to TF 1.12 stat:awaiting response Status - Awaiting response from author labels Mar 10, 2022
@ClaudioCimarelli
Copy link
Author

Hey @mohantym, thank you for your reply. I moved on from this nightmare and found an alternative solution at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author TF 1.12 Issues related to TF 1.12
Projects
None yet
Development

No branches or pull requests

6 participants