Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos: Multiple definition link errors #10830

Closed
vqd8a opened this issue Aug 3, 2022 · 8 comments
Closed

Kokkos: Multiple definition link errors #10830

vqd8a opened this issue Aug 3, 2022 · 8 comments
Labels
client: Gemma ATDM code Gemma type: bug The primary issue is a bug in Trilinos code or tests

Comments

@vqd8a
Copy link
Contributor

vqd8a commented Aug 3, 2022

Bug Report

@trilinos/kokkos
@trilinos/kokkos-kernels
@bartlettroscoe

Description

I am compiling my application code against Adelus and get these errors:

nvlink error   : Multiple definition of '_ZN6Kokkos4Impl79_GLOBAL__N__55_tmpxft_00008180_00000000_6_Kokkos_Cuda_Instance_cpp1_ii_f3693ba322query_cuda_kernel_archEPi' in '/ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a:Kokkos_Cuda_Instance.cpp.o', first defined in '/ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a:Kokkos_Cuda_Instance.cpp.o'
nvlink error   : Multiple definition of 'kokkos_impl_cuda_constant_memory_buffer' in '/ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a:Kokkos_Cuda_Instance.cpp.o', first defined in '/ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoskernels.a:Sparse_spgemm_numeric_eti_DOUBLE_ORDINAL_INT_OFFSET_INT_LAYOUTLEFT_EXECSPACE_CUDA_MEMSPACE_CUDASPACE_MEMSPACE_CUDASPACE.cpp.o'
nvlink fatal   : merge_elf failed

This is the link line when I tried make -j VERBOSE=1

/home/projects/ppc64le-pwr9-nvidia/openmpi/4.0.5/gcc/7.2.0/cuda/10.2.2/bin/mpicxx -O3 -g -Wall -Wno-unknown-pragmas -Wno-unused-but-set-variable -Wno-inline -Wshadow -mcpu=power9 -mtune=power9 --relocatable-device-code=true  -Xcompiler -fopenmp -expt-extended-lambda -arch=sm_70  -O3 -DNDEBUG  -rdynamic -L/home/projects/ppc64le-pwr9-nvidia/cuda/10.2.2/lib64 CMakeFiles/adelus_driver.dir/adelus_driver.cpp.o -o adelus_driver  /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libzadelus.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoskernels.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkosalgorithms.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscontainers.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a -L/home/projects/ppc64le-pwr9-nvidia/cuda/10.2.2/lib64 -lcudart -lcuda -lcublas /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoskernels.a /home/projects/ppc64le-pwr9/spack-installs/externals/openblas/0.3.16/gcc/7.2.0/lib64/libopenblas.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkosalgorithms.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscontainers.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a /usr/lib64/libdl.so -lcudart -lcuda -lcublas

Looking at the link line, I see the same static libraries (e.g. libkokkoskernels.a, libkokkoscore.a, ...) appear on the link line multiple times. If I remove the duplicate *.a files in the above make line, like below, and manually run it, it works:

/home/projects/ppc64le-pwr9-nvidia/openmpi/4.0.5/gcc/7.2.0/cuda/10.2.2/bin/mpicxx -O3 -g -Wall -Wno-unknown-pragmas -Wno-unused-but-set-variable -Wno-inline -Wshadow -mcpu=power9 -mtune=power9 --relocatable-device-code=true  -Xcompiler -fopenmp -expt-extended-lambda -arch=sm_70  -O3 -DNDEBUG  -rdynamic -L/home/projects/ppc64le-pwr9-nvidia/cuda/10.2.2/lib64 CMakeFiles/adelus_driver.dir/adelus_driver.cpp.o -o adelus_driver  /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libzadelus.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoskernels.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkosalgorithms.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscontainers.a /ascldap/users/vqdang/Trilinos/install-trilinos/lib/libkokkoscore.a -L/home/projects/ppc64le-pwr9-nvidia/cuda/10.2.2/lib64 -lcudart -lcuda -lcublas /home/projects/ppc64le-pwr9/spack-installs/externals/openblas/0.3.16/gcc/7.2.0/lib64/libopenblas.a /usr/lib64/libdl.so

I am not sure what cause these errors and what I am missing when building Trilinos.

Steps to Reproduce

Use the Trilinos develop branch on Weaver
Trilinos configure script (should be changed to .sh): build-trilinos.txt
Application code (should be changed to .cpp) : adelus_driver.txt
Application code's CMakeLists.txt: CMakeLists.txt

Steps:

  1. Run the Trilinos configure script
  2. make -j install
  3. Go to the application dir
  4. cmake -DTrilinos_DIR=~/Trilinos/install/include .
  5. make -j VERBOSE=1
@vqd8a vqd8a added type: bug The primary issue is a bug in Trilinos code or tests client: Gemma ATDM code Gemma labels Aug 3, 2022
@bartlettroscoe bartlettroscoe added this to ToDo in Trilinos TriBITS Refactor via automation Aug 3, 2022
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 3, 2022

If I remove the duplicate *.a files in the above make line, like below, and manually run it, it works

@vqd8a that is interesting. That suggests that you can't list the same *.a files more than one time with this compiler/linker. I have never seen a system like this before. But I actually really like this because this will further squash arguments over the years that TriBITS should support circular dependencies between Trilinos packages (and therefore having to list the libraries on the link line multiple times). If you tried that on this system, it would fail and you would be down the creek. So all we need to do is to figure out why the same libraries are being listed more than once and fix this.

@bartlettroscoe bartlettroscoe moved this from ToDo to In Progress in Trilinos TriBITS Refactor Aug 3, 2022
@bartlettroscoe
Copy link
Member

NOTE: I added this Issue to the Trilinos TriBITS Refactor Project just in case this is related to the recent merge of updated TriBITS from PR #10614 and will be addressed as part of #10774.

@bartlettroscoe
Copy link
Member

@vqd8a, added this to the task list in #10774.

Just as an experiment, can you try upgrading your CMakeLists.txt file to use the new target Trilinos::all_selected_libs as per the refactoring in commit 44855fd as part of #10813 and see if that fixes the problem?

@vbrunini
Copy link
Contributor

vbrunini commented Aug 3, 2022

There is a --remove-duplicate-link-files flag for nvcc_wrapper that we've been using for Sierra to workaround this issue on nvidia platforms for several years.

@vqd8a
Copy link
Contributor Author

vqd8a commented Aug 3, 2022

@bartlettroscoe
Thanks. Changing my CMakeLists.txt from #TARGET_LINK_LIBRARIES(adelus_driver ${Trilinos_LIBRARIES} ${Trilinos_TPL_LIBRARIES} ${Trilinos_EXTRA_LD_FLAGS}) to TARGET_LINK_LIBRARIES(adelus_driver Trilinos::all_selected_libs) work.

There are no duplications in the make line anymore:

/home/projects/ppc64le-pwr9-nvidia/openmpi/4.0.5/gcc/7.2.0/cuda/10.2.2/bin/mpicxx -O3 -g -Wall -Wno-unknown-pragmas -Wno-unused-but-set-variable -Wno-inline -Wshadow -mcpu=power9 -mtune=power9 --relocatable-device-code=true  -Xcompiler -fopenmp -expt-extended-lambda -arch=sm_70  -O3 -DNDEBUG  -rdynamic -L/home/projects/ppc64le-pwr9-nvidia/cuda/10.2.2/lib64 CMakeFiles/adelus_driver.dir/adelus_driver.cpp.o -o adelus_driver  /ascldap/users/vqdang/Trilinos/install/lib/libzadelus.a /ascldap/users/vqdang/Trilinos/install/lib/libkokkoskernels.a /home/projects/ppc64le-pwr9/spack-installs/externals/openblas/0.3.16/gcc/7.2.0/lib64/libopenblas.a /ascldap/users/vqdang/Trilinos/install/lib/libkokkosalgorithms.a /ascldap/users/vqdang/Trilinos/install/lib/libkokkoscontainers.a /ascldap/users/vqdang/Trilinos/install/lib/libkokkoscore.a /usr/lib64/libdl.so -lcudart -lcuda -lcublas

But I don't understand why using ${Trilinos_LIBRARIES} ${Trilinos_TPL_LIBRARIES} ${Trilinos_EXTRA_LD_FLAGS} does not work?
Will you suggest I should use Trilinos::all_selected_libs from now on?

@bartlettroscoe
Copy link
Member

There is a --remove-duplicate-link-files flag for nvcc_wrapper that we've been using for Sierra to workaround this issue on nvidia platforms for several years.

Now that you mention it, I remember that option is getting set in the CUDA Trilinos builds.

But I don't understand why using ${Trilinos_LIBRARIES} ${Trilinos_TPL_LIBRARIES} ${Trilinos_EXTRA_LD_FLAGS} does not work?

Not sure why that would be generating duplicate libraries. That will take some investigation.

Will you suggest I should use Trilinos::all_selected_libs from now on?

If you will only be building against Trilinos versions after the merge of #10614, then yes. However, if you want to also build against older versions of Trilinos, then keep using that same set of vars that you are.

In any case, set --remove-duplicate-link-files in your link flags or just add them to CMAKE_CXX_FLAGS.

@bartlettroscoe
Copy link
Member

@vqd8a, if you are okay to refactor to use Trilinos::all_selected_libs or to add --remove-duplicate-link-files to your CMAKE_CXX_FLAGS, then we can close this issue. There is not a lot of driver for removing duplicates from the list ${Trilinos_LIBRARIES} ${Trilinos_TPL_LIBRARIES} since that is the deprecated way to link against Trilinos libraries going forward as we are trying to move the CMake ecosystem to using modern CMake.

@vqd8a
Copy link
Contributor Author

vqd8a commented Aug 3, 2022

I will switch to use Trilinos::all_selected_libs. I am going to close this issue. Thanks @bartlettroscoe @vbrunini

@vqd8a vqd8a closed this as completed Aug 3, 2022
Trilinos TriBITS Refactor automation moved this from Selected to Done Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: Gemma ATDM code Gemma type: bug The primary issue is a bug in Trilinos code or tests
Projects
Development

No branches or pull requests

3 participants