Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling 1.14 with MPI support #30703

Closed
berniekirby opened this issue Jul 15, 2019 · 10 comments
Closed

Compiling 1.14 with MPI support #30703

berniekirby opened this issue Jul 15, 2019 · 10 comments
Labels
stat:contribution welcome Status - Contributions welcome subtype:centos Centos Build/Installation issues type:build/install Build and install issues

Comments

@berniekirby
Copy link

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Centos 6.9
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 1.14
  • Python version: 3.6.5
  • Installed using virtualenv? pip? conda?: n/a
  • Bazel version (if compiling from source):0.25.2
  • GCC/Compiler version (if compiling from source): 4.9,3
  • CUDA/cuDNN version:10.0.130/7
  • GPU model and memory: cuda

Compiling with MPI support gives the following build errors:
INFO: From Compiling tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc:
external/com_google_absl/absl/strings/string_view.h(495): warning: expression has no effect
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(109): error: identifier "CudaLaunchKernel" is undefined
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(110): error: identifier "CudaLaunchKernel" is undefined
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(111): error: identifier "CudaLaunchKernel" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_00038d5b_00000000-6_ring.cu.cpp1.ii".

Standard ./configure but answer yes to MPI support

Compiles fine without MPI. Have tried with both openmpi/3.1.3 and cuda enabled openmpi/3.1.3

@gadagashwini-zz gadagashwini-zz self-assigned this Jul 16, 2019
@gadagashwini-zz gadagashwini-zz added subtype:centos Centos Build/Installation issues type:build/install Build and install issues labels Jul 16, 2019
@gadagashwini-zz
Copy link
Contributor

@berniekirby Please provide the exact sequence of commands / steps that you executed before running into the problem.Thanks!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Jul 16, 2019
@berniekirby
Copy link
Author

berniekirby commented Jul 16, 2019

Well, it's slightly complicated as it's a cluster system that is near the end of it's life (Centos6).
Essentially we 'module load' the versions of software we need:
module load cuda/10.0.130 java gcc/4.9.3 python/3.6.5 bazel/0.25.2 binutils openmpi-gcc/3.1.3

Then just run ./configure
./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.25.2- (@non-git) installed.
Please specify the location of python. [Default is /usr/local/python/3.6.5/bin/python]:

Found possible Python library paths:
/usr/local/python/3.6.5/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/usr/local/python/3.6.5/lib/python3.6/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Could not find any cuda.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/lib'
'/lib/i686/nosegneg'
'/lib64'
'/opt/ibutils/lib64'
'/opt/illumina/bsfs'
'/opt/mellanox/fca/lib'
'/opt/mellanox/libibprof/lib'
'/opt/mellanox/mxm/lib'
'/usr'
'/usr/lib'
'/usr/lib64'
'/usr/lib64/R/lib'
'/usr/lib64/atlas'
'/usr/lib64/llvm'
'/usr/lib64/mysql'
'/usr/lib64/nx'
'/usr/lib64/qt-3.3/lib'
'/usr/lib64/tcl8.5'
'/usr/lib64/xulrunner'
'/usr/local/cuda'
Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]:

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]:

Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.3

Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/local/cuda/10.0.130

Found CUDA 10.0 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include
Found cuDNN 7 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include
Found NCCL 2 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]:

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/local/gcc/4.9.3/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]: y
MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

Now build:
bazel build --linkopt=-lrt --verbose_failures --jobs=8 -c opt --copt=-fabi-version=6 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda //tensorflow/tools/pip_package:build_pip_package

After much output the build fails with the above given errors.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 17, 2019
@gadagashwini-zz gadagashwini-zz added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 17, 2019
@gunan
Copy link
Contributor

gunan commented Jul 17, 2019

I think this is a related issue:
#26610

As far as I know, MPI with TF is community supported.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 18, 2019
@berniekirby
Copy link
Author

It looks to me as though tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc needs to somehow via #includes get the definition of CudaLaunchKernel from somewhere.
It's apparently defined in /tensorflow/core/util/gpu_kernel_helper.h
It's to convoluted for me to figure all that out so I'll just leave it.

If it's community supported, then I suppose we'll just have to wait.

Thank you for your time.

@byronyi
Copy link
Contributor

byronyi commented Jul 18, 2019

I will work on a fix.

@jvishnuvardhan jvishnuvardhan added the stat:contribution welcome Status - Contributions welcome label Aug 9, 2019
@boegel
Copy link

boegel commented Sep 20, 2019

@byronyi Any updates on this? I'm hitting the same problem...

@johnbensnyder
Copy link

I got this working by adding
#include "tensorflow/core/util/gpu_kernel_helper.h"
to line 23 of
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc
Still testing everything, but seems to be working so far.

@boegel
Copy link

boegel commented Sep 21, 2019

I can confirm that adding an include statement for tensorflow/core/util/gpu_kernel_helper.h fixed the reported issue, thanks for sharing @johnbensnyder!

ontheklaud pushed a commit to ontheklaud/tensorflow that referenced this issue Oct 2, 2019
@Flamefire
Copy link
Contributor

Fixed by #29673

@gunan gunan closed this as completed Nov 25, 2019
@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:contribution welcome Status - Contributions welcome subtype:centos Centos Build/Installation issues type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

9 participants