Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation issue: py-tensorflow #14488

Closed
pat-s opened this issue Jan 13, 2020 · 32 comments · Fixed by #16077
Closed

Installation issue: py-tensorflow #14488

pat-s opened this issue Jan 13, 2020 · 32 comments · Fixed by #16077

Comments

@pat-s
Copy link
Contributor

pat-s commented Jan 13, 2020

Steps to reproduce the issue

$ spack install py-tensorflow@1.14.0 ^python@3.6.8 ^bazel@0.24.1 

5144    ERROR: /tmp/spack/tf/ae19ea65d53fbc582b50c017da194bb3/external/nasm/BUILD.bazel:8:1: C++ compilation of rule '@@nasm//:nasm' failed (Exit 1): gcc failed: error executing command

6121    /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.6.8-knusttxlspjruvcbrfgyuyrlyfvdspk2/include/python3.6m/eval.h:10:12: error: unknown type name 'PyObject'

Platform and user environment

Please report your OS here:

$ uname -a 
Linux edi 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ CentOS 7.7

Compiler: gcc@9.2.0

Additional information

cc @adamjstewart

@adamjstewart
Copy link
Member

You can try opening an issue on the TensorFlow repo, but they've been pretty unresponsive to my issues so far...

@pat-s
Copy link
Contributor Author

pat-s commented Jan 14, 2020

Thanks. I'll see how far I can get and will report back.

TF does only list instructions for building on Ubuntu, other systems are not supported. Searching for the error there did not help much.
I'll keep trying some other configs and see what I get.

@s-sajid-ali
Copy link
Contributor

This looks very similar to the issue I ran into bazelbuild/bazel#10437.

@pat-s
Copy link
Contributor Author

pat-s commented Jan 14, 2020

Thanks for sharing. Looks complicated. Also 28 days without response. Does not look promising to get this building on CentOS soon.

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 16, 2020

I am able to build v1.14 for both python 2 and 3 using an old version of tf package on Centos7. While this issue is sorted out, you might try using the old tf package file located in branch https://github.com/Sinan81/spack/tree/old_tensorflow_v1.14_builds

I am also able to build tf@2.1 with python3 using this.

The only requirement for building these versions is to use jdk@1.8 as java provider.

@pat-s
Copy link
Contributor Author

pat-s commented Jan 16, 2020

Thanks. I tried it with

  • @1.14.0 ^python3.6.8
  • plain
==> Installing py-tensorflow
==> Searching for binary cache of py-tensorflow
==> No binary for py-tensorflow found: installing from source
==> Error: PermissionError: [Errno 13] Permission denied: '/cache'

/opt/spack/var/spack/repos/builtin/packages/py-tensorflow/package.py:189, in setup_build_environment:
        186        #       stay at least also OSX compatible
        187        tmp_path = '/cache/spack/tf'
        188#        tmp_path = env['SPACK_TMPDIR'] '/tmp/spack') + '/tf' #TODO
  >>    189        mkdirp(tmp_path)
        190        env.set('TEST_TMPDIR', tmp_path)
        191        env.set('HOME', tmp_path)

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 16, 2020

That is just a temporary path used by bazel in building TF. just set it to a path where you have write permission and sufficient space (up to 10GB?) and make sure that path is not an NFS share. If you are building this on your laptop, then any path under your home directory should work.

@pat-s
Copy link
Contributor Author

pat-s commented Jan 17, 2020

@Sinan81 thanks, that was stupid from my side.

Unfortunately I'll arrive at the same errors again :/

Did you use any specific python version?

3 errors found in build log:
     65    WARNING: /tmp/patrick/spack-stage-py-tensorflow-1.14.0-y5h6dpw45cndcpi2wn5uppnuhc2fd4pg/spack-src/tensorflow/contrib/BUILD:12:1: in py_library rule //tensorflow/contrib:contrib_py: target '//tensorflow/contrib:contrib_py' depends o
           n deprecated target '//tensorflow/contrib/distributions:distributions_py': TensorFlow Distributions has migrated to TensorFlow Probability (https://github.com/tensorflow/probability). Deprecated copies remaining in tf.contrib.distr
           ibutions are unmaintained, unsupported, and will be removed by late 2018. You should update all usage of `tf.contrib.distributions` to `tfp.distributions`.
     66    INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (374 packages loaded, 18256 targets configured).
     67    INFO: Found 1 target...
     68    [0 / 6] [-----] Expanding template tensorflow/tools/pip_package/simple_console
     69    ERROR: /home/patrick/tf/_bazel_patrick/593a39a5b039aa2dbd755c47f5736f44/external/nasm/BUILD.bazel:8:1: C++ compilation of rule '@nasm//:nasm' failed (Exit 1)
     70    In file included from external/nasm/output/outcoff.c:52:
  >> 71    /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.6.8-knusttxlspjruvcbrfgyuyrlyfvdspk2/include/python3.6m/eval.h:10:12: error: unknown type name 'PyObject'
     72       10 | PyAPI_FUNC(PyObject *) PyEval_EvalCode(PyObject *, PyObject *, PyObject *);
     73          |            ^~~~~~~~
  >> 74    /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.6.8-knusttxlspjruvcbrfgyuyrlyfvdspk2/include/python3.6m/eval.h:12:12: error: unknown type name 'PyObject'
     75       12 | PyAPI_FUNC(PyObject *) PyEval_EvalCodeEx(PyObject *co,
     76          |            ^~~~~~~~
  >> 77    /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.6.8-knusttxlspjruvcbrfgyuyrlyfvdspk2/include/python3.6m/eval.h:21:12: error: unknown type name 'PyObject'
     78       21 | PyAPI_FUNC(PyObject *) _PyEval_CallTracing(PyObject *func, PyObject *args);
     79          |            ^~~~~~~~
     80    Target //tensorflow/tools/pip_package:build_pip_package failed to build
     81    Use --verbose_failures to see the command lines of failed build steps.
     82    INFO: Elapsed time: 418.588s, Critical Path: 2.54s
     83    INFO: 13 processes: 13 local.

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

I used python@3.7.4 and 2.7.16 (and also gcc@7.4.0, bazel@0.25.2, jdk@1.8.0_202_b08).

Oddly enough when I do spack spec py-tensorflow/<build hash> I am not seeing any bazel dependency. I am using spack dev branch checkout somewhere between 0.13.0 and 0.13.1 i think. I will double check if I am still able to build tensorflow.\

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

Now I confirm that I can still build v1.14 though I ended up with disabling a build option (mkl_dnn) also I had to use cuda@10.0 as opposed to 10.1 (yielding header not found error) updated tensorflow package file is now pushed to Sinan81/spack/old_tensorflow_v1.14_builds branch.

P.S. For some reason, it seems spack spec stopped showing build only dependencies for the dev branch checkout I am using.

Just in case you are wondering about the specifics of the installation, here is the spec output:

$ spack find -l tensorflow
==> 4 installed packages
-- linux-centos7-x86_64 / gcc@7.4.0 -----------------------------
r4xyzrp tensorflow@1.14.0  uxgzcdj tensorflow@1.14.0  zm35imj tensorflow@1.14.0  t3ok3ag tensorflow@2.1.0-rc0
sbulut@ws-067 ~ 
$ spack spec tensorflow/r4xyzrp
Input spec
--------------------------------
tensorflow@1.14.0%gcc@7.4.0+cuda~gcp+nccl arch=linux-centos7-x86_64
    ^cuda@10.0.130%gcc@7.4.0 arch=linux-centos7-x86_64
    ^cudnn@7.5.1-10.0-x86_64%gcc@7.4.0 arch=linux-centos7-x86_64
    ^nccl@2.4.8-1%gcc@7.4.0 patches=42778c78eb9875dacddf5eca20f7f6a077773fcbee41e51174f81b3143684b6d arch=linux-centos7-x86_64
    ^py-absl-py@0.1.6%gcc@7.4.0 arch=linux-centos7-x86_64
        ^py-six@1.12.0%gcc@7.4.0 arch=linux-centos7-x86_64
            ^python@3.7.4%gcc@7.4.0+bz2+ctypes+dbm+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib arch=linux-centos7-x86_64
                ^bzip2@1.0.8%gcc@7.4.0+shared arch=linux-centos7-x86_64
                ^expat@2.2.9%gcc@7.4.0+libbsd arch=linux-centos7-x86_64
                    ^libbsd@0.9.1%gcc@7.4.0 arch=linux-centos7-x86_64
                ^gdbm@1.18.1%gcc@7.4.0 arch=linux-centos7-x86_64
                    ^readline@8.0%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^ncurses@6.1%gcc@7.4.0~symlinks~termlib arch=linux-centos7-x86_64
                ^gettext@0.20.1%gcc@7.4.0+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-centos7-x86_64
                    ^libxml2@2.9.9%gcc@7.4.0~python arch=linux-centos7-x86_64
                        ^libiconv@1.16%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^xz@5.2.4%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^zlib@1.2.11%gcc@7.4.0+optimize+pic+shared arch=linux-centos7-x86_64
                    ^tar@1.32%gcc@7.4.0 arch=linux-centos7-x86_64
                ^libffi@3.2.1%gcc@7.4.0 arch=linux-centos7-x86_64
                ^openssl@1.1.1b%gcc@7.4.0+systemcerts arch=linux-centos7-x86_64
                ^sqlite@3.30.0%gcc@7.4.0+column_metadata+fts~functions+rtree arch=linux-centos7-x86_64
    ^py-astor@0.8.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-future@0.17.1%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-gast@0.3.2%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-grpcio@1.25.0%gcc@7.4.0 arch=linux-centos7-x86_64
        ^c-ares@1.15.0%gcc@7.4.0 build_type=RelWithDebInfo arch=linux-centos7-x86_64
    ^py-h5py@2.9.0%gcc@7.4.0~mpi arch=linux-centos7-x86_64
        ^hdf5@1.10.5%gcc@7.4.0+cxx~debug~fortran+hl~mpi+pic+shared~szip~threadsafe arch=linux-centos7-x86_64
        ^py-numpy@1.17.3%gcc@7.4.0+blas+lapack arch=linux-centos7-x86_64
            ^openblas@0.3.6%gcc@7.4.0+avx2~avx512 cpu_target=auto ~ilp64+pic+shared threads=none ~virtual_machine arch=linux-centos7-x86_64
    ^py-keras-applications@1.0.8%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-keras-preprocessing@1.1.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-mock@3.0.5%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-protobuf@3.6.0%gcc@7.4.0~cpp arch=linux-centos7-x86_64
        ^py-setuptools@41.0.1%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-termcolor@1.1.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-wheel@0.33.1%gcc@7.4.0 arch=linux-centos7-x86_64

Concretized
--------------------------------
tensorflow@1.14.0%gcc@7.4.0+cuda~gcp+nccl arch=linux-centos7-x86_64
    ^cuda@10.0.130%gcc@7.4.0 arch=linux-centos7-x86_64
    ^cudnn@7.5.1-10.0-x86_64%gcc@7.4.0 arch=linux-centos7-x86_64
    ^nccl@2.4.8-1%gcc@7.4.0 patches=42778c78eb9875dacddf5eca20f7f6a077773fcbee41e51174f81b3143684b6d arch=linux-centos7-x86_64
    ^py-absl-py@0.1.6%gcc@7.4.0 arch=linux-centos7-x86_64
        ^py-six@1.12.0%gcc@7.4.0 arch=linux-centos7-x86_64
            ^python@3.7.4%gcc@7.4.0+bz2+ctypes+dbm+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4~uuid+zlib arch=linux-centos7-x86_64
                ^bzip2@1.0.8%gcc@7.4.0+shared arch=linux-centos7-x86_64
                ^expat@2.2.9%gcc@7.4.0+libbsd arch=linux-centos7-x86_64
                    ^libbsd@0.9.1%gcc@7.4.0 arch=linux-centos7-x86_64
                ^gdbm@1.18.1%gcc@7.4.0 arch=linux-centos7-x86_64
                    ^readline@8.0%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^ncurses@6.1%gcc@7.4.0~symlinks~termlib arch=linux-centos7-x86_64
                ^gettext@0.20.1%gcc@7.4.0+bzip2+curses+git~libunistring+libxml2+tar+xz arch=linux-centos7-x86_64
                    ^libxml2@2.9.9%gcc@7.4.0~python arch=linux-centos7-x86_64
                        ^libiconv@1.16%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^xz@5.2.4%gcc@7.4.0 arch=linux-centos7-x86_64
                        ^zlib@1.2.11%gcc@7.4.0+optimize+pic+shared arch=linux-centos7-x86_64
                    ^tar@1.32%gcc@7.4.0 arch=linux-centos7-x86_64
                ^libffi@3.2.1%gcc@7.4.0 arch=linux-centos7-x86_64
                ^openssl@1.1.1b%gcc@7.4.0+systemcerts arch=linux-centos7-x86_64
                ^sqlite@3.30.0%gcc@7.4.0+column_metadata+fts~functions+rtree arch=linux-centos7-x86_64
    ^py-astor@0.8.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-future@0.17.1%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-gast@0.3.2%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-grpcio@1.25.0%gcc@7.4.0 arch=linux-centos7-x86_64
        ^c-ares@1.15.0%gcc@7.4.0 build_type=RelWithDebInfo arch=linux-centos7-x86_64
    ^py-h5py@2.9.0%gcc@7.4.0~mpi arch=linux-centos7-x86_64
        ^hdf5@1.10.5%gcc@7.4.0+cxx~debug~fortran+hl~mpi+pic+shared~szip~threadsafe arch=linux-centos7-x86_64
        ^py-numpy@1.17.3%gcc@7.4.0+blas+lapack arch=linux-centos7-x86_64
            ^openblas@0.3.6%gcc@7.4.0+avx2~avx512 cpu_target=auto ~ilp64+pic+shared threads=none ~virtual_machine arch=linux-centos7-x86_64
    ^py-keras-applications@1.0.8%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-keras-preprocessing@1.1.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-mock@3.0.5%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-protobuf@3.6.0%gcc@7.4.0~cpp arch=linux-centos7-x86_64
        ^py-setuptools@41.0.1%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-termcolor@1.1.0%gcc@7.4.0 arch=linux-centos7-x86_64
    ^py-wheel@0.33.1%gcc@7.4.0 arch=linux-centos7-x86_64

sbulut@ws-067 ~ 
$ 

@s-sajid-ali
Copy link
Contributor

I made a minor hacky edit to run on a workstation and the build went past the original error but the final build (took almost 2 hours!) had some strange errors.

[sajid@xrm spack]$  spack install py-tensorflow@2.1.0-rc0 +cuda +nccl ^py-h5py~mpi ^hdf5~mpi ^/cn7 ^python@3.7.4 %gcc@7.5.0
==> py-tensorflow is already installed in /raid/home/sajid/packages/spack/opt/spack/linux-rhel7-ivybridge/gcc-7.5.0/py-tensorflow-2.1.0-rc0-lkhwvxbbykmps5ztjwbqx6yrpj5h5tp4

Strange link errors in command.log, for example :

2020-01-17 14:44:00.881366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO: From Executing genrule //tensorflow/python/keras/api:keras_python_api_gen:
2020-01-17 14:44:00.887667: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-01-17 14:44:00.887734: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:

and finally, I have no clue why this happened :

[sajid@xrm temp]$ python models/tutorials/image/mnist/convolutional.py
2020-01-17 14:57:50.953836: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX2 instructions, but these aren't available on your machine.
Aborted (core dumped)
[sajid@xrm temp]$ lscpu | grep avx
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

@s-sajid-ali
Copy link
Contributor

s-sajid-ali commented Jan 17, 2020

For reference I've added my build spec and the output of grep -a1b1 avx from command.log here.

I see that --copt=-mavx2 was hard coded and somehow tf's build system never bothered to verify that. I'll try this again later on an avx2 capable machine and report back. But the dlerror with libcudart.so.10.1 is more troubling.

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

For reference I've added my build spec and the output of grep -a1b1 avx from command.log here.

I see that --copt=-mavx2 was hard coded and somehow tf's build system never bothered to verify that. I'll try this again later on an avx2 capable machine and report back. But the dlerror with libcudart.so.10.1 is more troubling.

In the latest version of tensorflow package, harware optimizations are handled perfectly. For now you might want to disable all optional flags just to confirm first the package builds. I would suggest you also try building with cuda@10.0 (which would yield libcudart.10.0 )

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

@s-sajid-ali Do you have an avx512 capable cpu? If so, a patch might be needed.

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

@s-sajid-ali when I tried to compile v1.15.0 I got the same error as yours despite the fact that I was using cuda@10.0. The problem is that somehow tf can't locate libcudart although it's provided in spack cuda package. Let me look into that.

@s-sajid-ali
Copy link
Contributor

s-sajid-ali commented Jan 17, 2020

@s-sajid-ali Do you have an avx512 capable cpu? If so, a patch might be needed.

Yes, I have access to KNL (and skylake) nodes on which I plan to compile tf from source. Would it not be possible to achieve this by just changing the copt flags ? I'm wondering why a patchfile is needed.

Both of these systems have no gpu's so I'm not particuarly bothered by the libcudart error for now.

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 17, 2020

@s-sajid-ali
Copy link
Contributor

Thanks for the pointer! It looks like this was fixed in r2.1 so I'll try building r2.1 for now.

@pat-s
Copy link
Contributor Author

pat-s commented Jan 19, 2020

@Sinan81 Thanks for looking again!

I am still unable to build with the same error.

Just FYI, I am building with variants: ~cuda ~gcp ~nccl ^python@3.6.8

@s-sajid-ali
Copy link
Contributor

s-sajid-ali commented Jan 25, 2020

@Sinan81 : I tried your build recipe on two different workstations with +cuda+nccl and I was able to install tf. But on the same workstation (and on a third one with no gpu), the build again fails with the same error as before.

If bazel creates a crosstool compiler for the +cuda version there is no issue with include errors. However if bazel does not compile for +cuda variant, there are errors. I'm not sure why that happens though.

PS: Trying a simple cpu-only benchmark (on a broadwell cpu) showed a ~230% speedup for spack built tf when compared to the intel-conda build (which was built for avx systems)!

@Sinan81
Copy link
Contributor

Sinan81 commented Jan 26, 2020

@Sinan81 : I tried your build recipe on two different workstations with +cuda+nccl and I was able to install tf.

The million dollar question is what is done differently in my version of TF package vs the latest one so that +cuda build is failing in the latter. By the way, specifically which version of TF did you build successfully?

If bazel creates a crosstool compiler for the +cuda version there is no issue with include errors. However if bazel does not compile for +cuda variant, there are errors. I'm not sure why that happens though.

Don't remember the last time I tried building ~cuda case. I would guess some improvements that were done for +cuda weren't generalized in this version of TF package. However, this should not be the case for the latest version in official Spack repo. I will look into this. It might be a quick fix.

PS: Trying a simple cpu-only benchmark (on a broadwell cpu) showed a ~230% speedup for spack built tf when compared to the intel-conda build (which was built for avx systems)!

That's a big difference. In a sense, I would expect it since conda wouldn't use latest vectorizion and SIMD instructions so that the package is usable for a wider audience. This difference should be even bigger for the official spack TF package since it has built-in micro-arch optimizations.

@s-sajid-ali
Copy link
Contributor

s-sajid-ali commented Jan 27, 2020

The million dollar question is what is done differently in my version of TF package vs the latest one so that +cuda build is failing in the latter. By the way, specifically which version of TF did you build successfully?

I built tf@2.1.0+cuda+nccl with no issues (on a slightly modified version of your old_tensorflow_v1.14_builds branch). Building it on broadwell or skylake takes ~ 2 hours and on KNL it's been running since ~ 10 hours with no end in sight so I haven't been able to find the commit that broke things with git bisect.

All I can say is that for some reason the crosstool compiler toolchain is somehow better than a host only compiler toolchain. The exact command that fails (both at develop with +cuda and on the modified old_tensorflow_v1.14_builds branch with ~cuda ) is here. I've looked at the corresponding statement of the +cuda verbose log and I see no difference, and for reference I'm posting both the failing (host only toolchain) command and a corresponding (crosstool toolchain command) one that succeeds in a gist here (also, I should add that both systems have no gpu's but one can always build tf+cuda and run with --device=cpu!) .

Don't remember the last time I tried building ~cuda case. I would guess some improvements that were done for +cuda weren't generalized in this version of TF package. However, this should not be the case for the latest version in official Spack repo. I will look into this. It might be a quick fix.

Thanks for looking into this! It would be great if this is an easy fix.

EDIT : I've also tried not injecting the miniconda python path in the list of builtin_include_directory_paths but that didn't help either. I carefully cut down on the number of external dependencies by using a system openSSL and packages from the miniconda3-env where possible (zlib, setuptools, etc). I did this hoping that with a reduced set of builtin_include_directory_paths (and at one point having only the compiler-includes and system level /usr/include) would somehow prevent bazel from picking up wrong headers but to no avail.

Let me know if you want me to share any build logs or config files, I'd be happy to post them here.

@pramodk
Copy link
Contributor

pramodk commented Jan 28, 2020

I also saw number of similar issues while building py-tensorflow with latest recipes (which have been reported already here).

Just of curiosity, what I did was I took latest receipe of py-tensorflow and used it our quite old Spack fork and I got following:

image

Honestly, I am surprised by the fact that it successfully installed py-tensorflow@2.1.0. We are preparing merge with this latest develop (upstream) and I have to dig into why it doesn't work.

by the way, we have been using this tensorflow package from ~9 months in production. This was cherry-picked from one of the older PR. Now we are trying to sync with upstream and use new py-tensorflow recipe.

@aweits
Copy link
Contributor

aweits commented Apr 10, 2020

If you look at #15698, specifically the changes to lib/spack/env/cc - that should make this problem (well, the PyObject thing at least) go away as the wrong eval.h is being used here due to jumbling up of the ordering of include paths. Can you let me know if changing that around leads to a successful build for you?

@ajw1980
Copy link
Contributor

ajw1980 commented Apr 15, 2020

Just as a point of reference, I was able to build tensorflow without cuda and nccl with the patch to lib/spack/env/cc. I also needed to add the attached -lrt patch because I am building in a centos6 container. It looks like tensorflow 2.2 may include this:
lrt.txt

tensorflow/tensorflow#37754

I built with gcc 8.3, bazel 0.29.1, and python 3.8.2.

aweits added a commit to aweits/spack that referenced this issue Apr 15, 2020
Spack attempts to inject it's paths just before the actual
system include directories. Currently, if a build utilizes
-isystem, Spack's headers will be injected using -I, and that
effectively inserts them before any specified -isystem headers.
This leads to unexpected failures. Here, we assume that if a build
attempts to use -isystem (implying that the underlying
compiler supports the flag) - we change to injecting the paths
into the end of the -isystem paths. There is a potential
concern if the build includes system paths in -I *and*
uses -isystem, but I think that's probably an unlikely usage.

See:

https://gcc.gnu.org/onlinedocs/gcc/Directory-Options.html

Fixes spack#14488, spack#14234
aweits added a commit to aweits/spack that referenced this issue Apr 15, 2020
Spack attempts to inject it's paths just before the actual
system include directories. Currently, if a build utilizes
-isystem, Spack's headers will be injected using -I, and that
effectively inserts them before any specified -isystem headers.
This leads to unexpected failures. Here, we assume that if a build
attempts to use -isystem (implying that the underlying
compiler supports the flag) - we change to injecting the paths
into the end of the -isystem paths. There is a potential
concern if the build includes system paths in -I *and*
uses -isystem, but I think that's probably an unlikely usage.

See:

https://gcc.gnu.org/onlinedocs/gcc/Directory-Options.html

Fixes spack#14488, Fixes spack#14234
aweits added a commit to aweits/spack that referenced this issue Apr 16, 2020
Spack attempts to inject it's paths just before the actual
system include directories. Currently, if a build utilizes
-isystem, Spack's headers will be injected using -I, and that
effectively inserts them before any specified -isystem headers.
This leads to unexpected failures. Here, we assume that if a build
attempts to use -isystem (implying that the underlying
compiler supports the flag) - we change to injecting the paths
into the end of the -isystem paths. There is a potential
concern if the build includes system paths in -I *and*
uses -isystem, but I think that's probably an unlikely usage.

See:

https://gcc.gnu.org/onlinedocs/gcc/Directory-Options.html

Fixes spack#14488, Fixes spack#14234
@samcom12
Copy link

samcom12 commented Jan 8, 2021

That is just a temporary path used by bazel in building TF. just set it to a path where you have write permission and sufficient space (up to 10GB?) and make sure that path is not an NFS share. If you are building this on your laptop, then any path under your home directory should work.

Which variable does Bazel use to use TMPDIR?

@s-sajid-ali
Copy link
Contributor

s-sajid-ali commented Jan 8, 2021

The logic for setting the temporary directory can be seen at following location in py-tensorflow's recipe :

https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/py-tensorflow/package.py#L517

@adamjstewart
Copy link
Member

I wish there was some way we could use the same stage directory that Spack uses for building, but it won't work if that directory uses NFS.

@samcom12
Copy link

Thanks, @adamjstewart I could change tmp dir from TensorFlow recipe. but now I'm seeing errors like "Spack compiler must be run from Spack! Input 'SPACK_ENV_PATH' is missing."

@adamjstewart
Copy link
Member

Are you using a non-Spack-installed bazel? Spack adds a patch to bazel to allow our compiler wrappers to work. Apparently there's a better way to do this by making some kind of Spack toolchain for bazel, but I haven't had the time to investigate.

@samcom12
Copy link

@adamjstewart I have tried both ways non-spack-bazel and spack-installed-bazel but nothing has worked for me yet.

@adamjstewart
Copy link
Member

Hmm, the bazelruleclassprovider-0.25.patch patch should solve the error message you're seeing, not sure why you would see that error with Spack-installed Bazel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants