Compiling 1.14 with MPI support #30703

berniekirby · 2019-07-15T10:39:17Z

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

OS Centos 6.9
TensorFlow installed from (source or binary): source
TensorFlow version: 1.14
Python version: 3.6.5
Installed using virtualenv? pip? conda?: n/a
Bazel version (if compiling from source):0.25.2
GCC/Compiler version (if compiling from source): 4.9,3
CUDA/cuDNN version:10.0.130/7
GPU model and memory: cuda

Compiling with MPI support gives the following build errors:
INFO: From Compiling tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc:
external/com_google_absl/absl/strings/string_view.h(495): warning: expression has no effect
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(109): error: identifier "CudaLaunchKernel" is undefined
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(110): error: identifier "CudaLaunchKernel" is undefined
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc(111): error: identifier "CudaLaunchKernel" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_00038d5b_00000000-6_ring.cu.cpp1.ii".

Standard ./configure but answer yes to MPI support

Compiles fine without MPI. Have tried with both openmpi/3.1.3 and cuda enabled openmpi/3.1.3

gadagashwini-zz · 2019-07-16T12:39:14Z

@berniekirby Please provide the exact sequence of commands / steps that you executed before running into the problem.Thanks!

berniekirby · 2019-07-16T23:47:31Z

Well, it's slightly complicated as it's a cluster system that is near the end of it's life (Centos6).
Essentially we 'module load' the versions of software we need:
module load cuda/10.0.130 java gcc/4.9.3 python/3.6.5 bazel/0.25.2 binutils openmpi-gcc/3.1.3

Then just run ./configure
./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.25.2- (@non-git) installed.
Please specify the location of python. [Default is /usr/local/python/3.6.5/bin/python]:

Found possible Python library paths:
/usr/local/python/3.6.5/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/usr/local/python/3.6.5/lib/python3.6/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Could not find any cuda.h matching version '' in any subdirectory:
''
'include'
'include/cuda'
'include/*-linux-gnu'
'extras/CUPTI/include'
'include/cuda/CUPTI'
of:
'/lib'
'/lib/i686/nosegneg'
'/lib64'
'/opt/ibutils/lib64'
'/opt/illumina/bsfs'
'/opt/mellanox/fca/lib'
'/opt/mellanox/libibprof/lib'
'/opt/mellanox/mxm/lib'
'/usr'
'/usr/lib'
'/usr/lib64'
'/usr/lib64/R/lib'
'/usr/lib64/atlas'
'/usr/lib64/llvm'
'/usr/lib64/mysql'
'/usr/lib64/nx'
'/usr/lib64/qt-3.3/lib'
'/usr/lib64/tcl8.5'
'/usr/lib64/xulrunner'
'/usr/local/cuda'
Asking for detailed CUDA configuration...

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]:

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]:

Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]: 2.3

Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: /usr/local/cuda/10.0.130

Found CUDA 10.0 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include
Found cuDNN 7 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include
Found NCCL 2 in:
/usr/local/cuda/10.0.130/lib64
/usr/local/cuda/10.0.130/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]:

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/local/gcc/4.9.3/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]: y
MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

Now build:
bazel build --linkopt=-lrt --verbose_failures --jobs=8 -c opt --copt=-fabi-version=6 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda //tensorflow/tools/pip_package:build_pip_package

After much output the build fails with the above given errors.

gunan · 2019-07-17T17:59:30Z

I think this is a related issue:
#26610

As far as I know, MPI with TF is community supported.

berniekirby · 2019-07-18T03:15:30Z

It looks to me as though tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc needs to somehow via #includes get the definition of CudaLaunchKernel from somewhere.
It's apparently defined in /tensorflow/core/util/gpu_kernel_helper.h
It's to convoluted for me to figure all that out so I'll just leave it.

If it's community supported, then I suppose we'll just have to wait.

Thank you for your time.

byronyi · 2019-07-18T04:19:05Z

I will work on a fix.

boegel · 2019-09-20T12:02:43Z

@byronyi Any updates on this? I'm hitting the same problem...

johnbensnyder · 2019-09-20T23:10:32Z

I got this working by adding
#include "tensorflow/core/util/gpu_kernel_helper.h"
to line 23 of
tensorflow/contrib/mpi_collectives/kernels/ring.cu.cc
Still testing everything, but seems to be working so far.

boegel · 2019-09-21T13:29:20Z

I can confirm that adding an include statement for tensorflow/core/util/gpu_kernel_helper.h fixed the reported issue, thanks for sharing @johnbensnyder!

…3739848)

Flamefire · 2019-11-24T16:03:31Z

Fixed by #29673

tensorflow-bot · 2019-11-25T17:26:33Z

Are you satisfied with the resolution of your issue?
Yes
No

gadagashwini-zz self-assigned this Jul 16, 2019

gadagashwini-zz added subtype:centos Centos Build/Installation issues type:build/install Build and install issues labels Jul 16, 2019

gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Jul 16, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jul 17, 2019

gadagashwini-zz assigned gunan and unassigned gadagashwini-zz Jul 17, 2019

gadagashwini-zz added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 17, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 18, 2019

jvishnuvardhan added the stat:contribution welcome Status - Contributions welcome label Aug 9, 2019

jvishnuvardhan unassigned gunan Aug 9, 2019

ontheklaud pushed a commit to ontheklaud/tensorflow that referenced this issue Oct 2, 2019

include of gpu_kernel_helper (tensorflow/issues/30703#issuecomment-53…

2cffb54

…3739848)

gunan closed this as completed Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling 1.14 with MPI support #30703

Compiling 1.14 with MPI support #30703

berniekirby commented Jul 15, 2019

gadagashwini-zz commented Jul 16, 2019

berniekirby commented Jul 16, 2019 •

edited

gunan commented Jul 17, 2019

berniekirby commented Jul 18, 2019

byronyi commented Jul 18, 2019

boegel commented Sep 20, 2019

johnbensnyder commented Sep 20, 2019

boegel commented Sep 21, 2019

Flamefire commented Nov 24, 2019

tensorflow-bot bot commented Nov 25, 2019

Compiling 1.14 with MPI support #30703

Compiling 1.14 with MPI support #30703

Comments

berniekirby commented Jul 15, 2019

gadagashwini-zz commented Jul 16, 2019

berniekirby commented Jul 16, 2019 • edited

gunan commented Jul 17, 2019

berniekirby commented Jul 18, 2019

byronyi commented Jul 18, 2019

boegel commented Sep 20, 2019

johnbensnyder commented Sep 20, 2019

boegel commented Sep 21, 2019

Flamefire commented Nov 24, 2019

tensorflow-bot bot commented Nov 25, 2019

berniekirby commented Jul 16, 2019 •

edited