Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATLAS: multiple definition of `ATL_SetAtomicCount' #15045

Closed
jdemeyer opened this issue Aug 14, 2013 · 39 comments
Closed

ATLAS: multiple definition of `ATL_SetAtomicCount' #15045

jdemeyer opened this issue Aug 14, 2013 · 39 comments

Comments

@jdemeyer
Copy link

This non-reproducible problem which can occur during the ATLAS build (with atlas-3.10.1.p3.spkg from #14754) was supposed to be fixed but it actually is not fixed:

make[5]: Entering directory `/home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/spkg/build/atlas-3.10.1.p3/ATLAS-build/lib'
gcc -fPIC -m64 -shared -o libsatlas.so  \
           -Wl,"rpath-link /home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/local/lib" \
           -Wl,--whole-archive liblapack.a libf77blas.a libcblas.a libatlas.a -Wl,--no-whole-archive -L/usr/local/gcc-4.7.0/x86_64-Linux-k10-fc/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../lib64 -lgfortran  -lc -lpthread -lm -lgcc
/usr/local/binutils-2.22/x86_64-Linux-k10-fc-gcc-4.6.2-rh/bin/ld: cannot find rpath-link /home/buildbot/build/sage/lena-1/lena_full/build/sage-5.12.beta1/local/lib: No such file or directory
libatlas.a(ATL_SetAtomicCount_mut.o): In function `ATL_SetAtomicCount':
ATL_SetAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_SetAtomicCount'
libatlas.a(ATL_SetAtomicCount_arch.o):ATL_SetAtomicCount_arch.c:(.text+0x0): first defined here
libatlas.a(ATL_ResetAtomicCount_mut.o): In function `ATL_ResetAtomicCount':
ATL_ResetAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_ResetAtomicCount'
libatlas.a(ATL_ResetAtomicCount_amd64.o):(.text+0x0): first defined here
libatlas.a(ATL_DecAtomicCount_mut.o): In function `ATL_DecAtomicCount':
ATL_DecAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_DecAtomicCount'
libatlas.a(ATL_DecAtomicCount_amd64.o):(.text+0x0): first defined here
libatlas.a(ATL_FreeAtomicCount_mut.o): In function `ATL_FreeAtomicCount':
ATL_FreeAtomicCount_mut.c:(.text+0x0): multiple definition of `ATL_FreeAtomicCount'
libatlas.a(ATL_FreeAtomicCount_arch.o):ATL_FreeAtomicCount_arch.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status
make[5]: *** [GCCTRY] Error 1

full ATLAS build log

Upstream: http://sourceforge.net/p/math-atlas/support-requests/907/

New spkg: http://boxen.math.washington.edu/home/vbraun/spkg/atlas-3.10.1.p5.spkg

Upstream: Reported upstream. No feedback yet.

CC: @jpflori

Component: packages: standard

Author: Volker Braun

Reviewer: Nils Bruin

Merged: sage-5.12.beta5

Issue created by migration from https://trac.sagemath.org/ticket/15045

@vbraun
Copy link
Member

vbraun commented Aug 14, 2013

comment:1

I'm pretty sure this is http://sourceforge.net/p/math-atlas/support-requests/907/

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member

vbraun commented Aug 14, 2013

comment:2

Its probably not something that is reproducable on modern hardware, so progress upstream has been slow.

For now, I propose we just plow ahead and fall back to static libraries.

@vbraun
Copy link
Member

vbraun commented Aug 14, 2013

Author: Volker Braun

@vbraun
Copy link
Member

vbraun commented Aug 14, 2013

Attachment: atlas-p3-p4.diff.gz

diff for review only

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link
Author

Upstream: Reported upstream. No feedback yet.

@jdemeyer
Copy link
Author

comment:4

Replying to @vbraun:

Its probably not something that is reproducable on modern hardware

2009 might not meet your definition of "modern", but it's far from ancient either.

@jdemeyer
Copy link
Author

comment:5

Replying to @vbraun:

For now, I propose we just plow ahead and fall back to static libraries.

If static libraries always work, why don't we always only use static libraries?

Conversely, if static libraries don't always work, wouldn't this "fall back" cause other problems?

@vbraun
Copy link
Member

vbraun commented Aug 19, 2013

comment:6

Because static libraries suck.

It won't cause other problems for now. If we want to use the upstream shared libraries (with a different naming convention than the static libraries) then we'll run into troubles. But by then we hopefully have found a way to not hardcode atlas/blas/lapack library names in the Sage build system.

@jdemeyer
Copy link
Author

comment:7

It still feels bad to build the shared libraries "sometimes" depending on some non-deterministic condition. If static libraries work (even if they don't work so well), I think it's better to stick with them in all cases.

@vbraun
Copy link
Member

vbraun commented Aug 20, 2013

comment:8

I would agree with you if we would want to stick with a hard-coded ATLAS in Sage until the end of time. But IMHO we should work towards a more configurable solution that will just fall back to reference / openblas as appropriate if atlas doesn't build. And perhaps not build atlas by default if it takes too long. And then static libraries would be a huge pain in the butt.

@jdemeyer
Copy link
Author

comment:9

Replying to @vbraun:

But IMHO we should work towards a more configurable solution that will just fall back to reference / openblas as appropriate if atlas doesn't build. And perhaps not build atlas by default if it takes too long. And then static libraries would be a huge pain in the butt.

If you want to work towards that goal, then the problem on this ticket must be fixed in a proper way anyway. I don't see how using shared libraries "sometimes" is closer to the stated goal than "never" using shared libraries.

I am particularly bothered that the choice is non-deterministic and difficult to predict, so Sage builds on the same machine will be substantially different (one with shared ATLAS and one with static ATLAS).

@vbraun
Copy link
Member

vbraun commented Aug 20, 2013

comment:10

Once we have an alternative to ATLAS that we can use, we disable the static library fallback for ATLAS. Its easy. Until then, we have this crutch. I agree that its ugly as sin but not shipping the new ATLAS would be a mistake imho.

I just don't want to work on changing the blas build system in Sage until we have switched to git. It'll be much easier if I don't have to copy tarballs around just for a change to the build script.

@jdemeyer
Copy link
Author

comment:11

Replying to @vbraun:

Once we have an alternative to ATLAS that we can use, we disable the static library fallback for ATLAS. Its easy. Until then, we have this crutch. I agree that its ugly as sin but not shipping the new ATLAS would be a mistake imho.

I am not saying that we shouldn't ship the new ATLAS. I am proposing to always use the static ATLAS libraries always instead of only when the linker problem appears.

@vbraun
Copy link
Member

vbraun commented Aug 20, 2013

comment:12

Replying to @jdemeyer:

I am not saying that we shouldn't ship the new ATLAS. I am proposing to always use the static ATLAS libraries always instead of only when the linker problem appears.

I think that just means more changes to undo later. And mind you the shared libraries build apparently fine on most systems, its just some AMD platforms where presumably the performance impact of holding mutexes is different.

@jdemeyer
Copy link
Author

comment:13

Replying to @vbraun:

I think that just means more changes to undo later.

Really? If you comment out the current approach, then it's essentially just removing # characters. That together with hg or git should work easily.

@jdemeyer
Copy link
Author

comment:14

Replying to @vbraun:

And mind you the shared libraries build apparently fine on most systems, its just some AMD platforms where presumably the performance impact of holding mutexes is different.

So far, Sage has worked equally well for all CPUs of a given architecture, let's keep it that way.

@vbraun
Copy link
Member

vbraun commented Aug 20, 2013

comment:15

As long as the library names are the same (which is currently the case) it does not make a difference for the build system if the library is shared or static. Deliberately disabling shared libraries would just mean to make it suck for everybody, and not just for the small percentage where its the only option. In other words, I'm against it ;-)

@jdemeyer
Copy link
Author

comment:16

Replying to @vbraun:

In other words, I'm against it ;-)

And I'm against more special cases (especially non-deterministic) in the Sage build system ;-)

@vbraun
Copy link
Member

vbraun commented Aug 20, 2013

comment:17

Noted, but the ATLAS spkg has always been trying different ways to build and falling back if they fail. Which always sucked, but the problem is that we don't have a deterministic alternative to deploy.

@jdemeyer
Copy link
Author

comment:18

Replying to @vbraun:

Noted, but the ATLAS spkg has always been trying different ways to build and falling back if they fail.

Those "different ways" only were about tuning parameters and CPU instruction sets, right? Which is far less fundamental than static vs. dynamic libraries.

Imagine we go with your solution and in the future somebody decides to change the Sage build system or some package such that it only works for a dynamic ATLAS library. Initial testing might reveal that everything works, even though it doesn't work in case your ATLAS spkg decides to use static libraries. This is what I want to avoid.

@vbraun
Copy link
Member

vbraun commented Aug 22, 2013

comment:19

I disagree. Pretty much the only way to achieve your scenario would be to hardcode the shared library file name somewhere. But those already differ on Linux vs. OSX.

What I'm mainly objecting to is the basic premise that we should keep this monstrosity alive for any longer than necessary. Things have to be fixed either in ATLAS or by fixing the Sage BLAS build system. As soon as that is done, we can switch off the static library fallback again.

@jdemeyer
Copy link
Author

comment:20

Replying to @vbraun:

Pretty much the only way to achieve your scenario would be to hardcode the shared library file name somewhere.

If there is no observable difference between shared and static libraries, then why do you care so much about shared libraries?

What I'm mainly objecting to is the basic premise that we should keep this monstrosity alive for any longer than necessary.

The problem is that it needs to be kept alive in any case precisely because of this ticket.

@vbraun
Copy link
Member

vbraun commented Aug 22, 2013

comment:21

There is no difference at the linker command line. There is of course a price to be paid later in that you can never benefit from blas/atlas upgrades if you have linked statically.

This ticket is just a workaround until we have a better fix. Unless you think we don't need a workaround for now, in which case I'm happy to wait for either upstream to fix it or we have switched to git...

@jdemeyer
Copy link
Author

comment:22

I asked sage-devel about what to do, since we're clearly not going to agree...

@nbruin
Copy link
Contributor

nbruin commented Aug 24, 2013

comment:23

For atlas issue 907, the relevant code would be:

ATLAS/tune/threads/tune_count.c:242

      if (tmut < tldec*1.02)
      {
         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
         ATL_assert(!system("make iForceUseMutex"));
      }

Looking at
ATLAS/makes/Make.ttune:159

iFind_atomic_arch :
        if $(MAKE) xprobe_atomic_amd64 ; then \
           cp $(myTHRdir)/ATL_*AtomicCount_arch.c . ; \
           cp $(myTHRdir)/ATL_ResetAtomicCount_amd64.S \
              ATL_ResetAtomicCount_arch.S ; \
           cp $(myTHRdir)/ATL_DecAtomicCount_amd64.S \
              ATL_DecAtomicCount_arch.S ; \
           rm $(BLDdir)/src/threads/atomic.inc ; \
           echo "aobj = ATL_SetAtomicCount_arch.o ATL_ResetAtomicCount_amd64.o ATL_DecAtomicCount_amd64.o ATL_FreeAtomicCount_arch.o" > $(BLDdir)/src/threads/atomic.inc ;
           .... lines elided ....
        fi
        - rm -f $(BLDdir)/src/threads/lib.grd
        $(MAKE) tlib

ATLAS/makes/Make.ttune:205

iForceUseMutex:
        rm $(BLDdir)/src/threads/atomic.inc
        echo "aobj = ATL_SetAtomicCount_mut.o ATL_ResetAtomicCount_mut.o ATL_DecAtomicCount_mut.o ATL_FreeAtomicCount_mut.o" > $(BLDdir)/src/threads/atomic.inc

The file atomic.inc is only touched in Make.ttune, always in the style above. The file is only used in ATLAS/makes/Make.thr, where it is included to get aobj defined, which is then included in obj to build a library.

I suspect the problem is that make iForceUseMutex doesn't actually rebuild the library, which does happen in make iFind_atomic_arch. So, when we do

iTune_atomic :
        if $(MAKE) mutcnt ; then \
           if $(MAKE) iFind_atomic_arch ; then \
              $(MAKE) xtune_count ; ./xtune_count -r 1000000 -o yes; \
           fi ; \
        else \
           $(MAKE) iFind_atomic_arch ; \
        fi

hich presumably runs the code in tune_count.c, we end up with a built library that has the architecture-specific bits in it, but an aobj that reflects otherwise. Probably, the make system runs into these things another time and then tries to include the same definitions another time in the library.

Perhaps just patch the system?

ATLAS/tune/threads/tune_count.c:244
-         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
-         ATL_assert(!system("make iForceUseMutex"));
+         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, BUT WE LEAVE IT IN ANYWAY\n");

I think there is some hope that this will eliminate the (rare) build fails.

@jpflori
Copy link

jpflori commented Aug 25, 2013

comment:24

I definitely cannot work on that at the moment but this is reallyt reminiscent of #10508 comment:455

Ok, I just read the ticket description, and our zorkaround was working because of our custoñ way of building shared libraries.

Now that we also build the upstream shared libraries, our workaround is not enough anymore....

@jpflori
Copy link

jpflori commented Aug 25, 2013

comment:25

From what I remember the main problem is that ATLAS builds a (static) lib for tuning, and then just update it to get the final (static) library so it ends up including two objects files with the same symbols (because of the atoñic.inc ñagic which does notinclude the same pieces of code in tyhe tuning and in the final phases, and because old object files of the tuning phase are still lurking around when the final phase is going on).

It presumably does so because updating the static archive is faster than rebuilding everything from scratch.

So another solution would be to erase the static lib built for tuning (and the corresponding object files) when the tuning phase is finished and let everything be built again when the real building phase starts.

@jpflori
Copy link

jpflori commented Aug 25, 2013

comment:26

See #10508 comment:446 which might be clearer than the above :)

@nbruin
Copy link
Contributor

nbruin commented Aug 26, 2013

comment:27

Replying to @nbruin:

ATLAS/tune/threads/tune_count.c:242

      if (tmut < tldec*1.02)
      {
         printf("\nNO REAL ADVANTAGE TO ASSEMBLY, FORCING USE OF MUTEX\n");
         ATL_assert(!system("make iForceUseMutex"));
      }

So would it be a plan to

  • check that this check is the culprit of (at least some) failure by patching it to be tmut < tldec*1000 and verifying that on relevant hardware ALTAS now reliably fails to build
  • patch atlas to NOT run make iForceUseMutex at this point and ship with that. Perhaps patch it as:
printf("\nASSEMBLY/MUTEX ratio is %d, but we'll stick with assembly anyway\n",tldec/tmut);

so that our logfiles at least tells us what the performance penalty is (well, there's none because ATLAS doesn't build dynamic libs without).

@vbraun
Copy link
Member

vbraun commented Aug 30, 2013

comment:28

I've made the change to force iForceUseMutex and that indeed reproduces the bug reliably. I made the change that Nils suggested and always use the assembly version (never iForceUseMutex), this fixes it for me.

@vbraun

This comment has been minimized.

@vbraun
Copy link
Member

vbraun commented Aug 30, 2013

Attachment: atlas-p4-p5.diff.gz

diff for review only

@nbruin
Copy link
Contributor

nbruin commented Aug 30, 2013

comment:29

Congratulations on getting this fixed.

Just for the record: I think the mutex code can still get built on platforms where no assembly alternative is available. In that case, I don't think we should particularly expect conflicting symbols either, because there are no other implementations available that the mutex code can conflict with in that case.

This is why I figured it was best to patch the code that takes action depending on the tuning result and not the target in the makefile.

The comment in the SPKG.txt doesn't strictly contradict this, but I had to read the comment a second time to convince myself of that.

@jdemeyer
Copy link
Author

jdemeyer commented Sep 2, 2013

comment:30

Nils: do you want to formally review this ticket?

@nbruin
Copy link
Contributor

nbruin commented Sep 2, 2013

Reviewer: Nils Bruin

@nbruin
Copy link
Contributor

nbruin commented Sep 2, 2013

comment:31

I have no idea what p4 is about but the change from p4 to p5 as posted on this ticket looks pretty reasonable. I haven't built or downloaded this new package, but other people have and it's a blocker, so a positive review helps things along.

@jdemeyer
Copy link
Author

jdemeyer commented Sep 4, 2013

Merged: sage-5.12.beta5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants