New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wanted feature: global privatization through dynamic linker tricks #137

Closed
mquinson opened this Issue Feb 16, 2017 · 16 comments

Comments

Projects
None yet
3 participants
@mquinson
Member

mquinson commented Feb 16, 2017

This bug is there to keep track of an ongoing discussion about a nice feature we are currently implementing. This text was edited to incorporate the input that others wrote as a comment below. Thanks to @sthibaul and @randomstuff for their valuable input (as usual).

The context is about the automatic global privatization when folding the MPI ranks as threads of the simulated processor. For now, right after the Unix process is loaded and setup, we make N copies of the process global segments, one copy per MPI rank. Then, on each context switch, we remap the area where the code searches its data onto the copy of the process we want to execute.

The idea is to avoid this segment remaping (for sake of performance). Instead, we want the system to load the binary N times.

Writing our own library loader seems ways too complicated, but we could probably do it with a standard dlopen(), provided that the binary was compiled with -fPIE or -fPIC (Position Independent Executable/Code). -fPIC means that you recompile your code as a dynamic library, which is not optimal but could constitute a safe fallback implementation for windows. -fPIE created to allow the loader to change the position and order of the executable segments each time, to make it harder for security attacker to guess where things are. Nowadays, that option is enabled by default on Debian, so this is definitely something we can safely rely on: only applications doing horrible things with symbols have problems with it.

The main difficulty of this approach is that ld.so is too cleaver: it won't dlopen the same file several times. Only the first request is honored. dlmopen seems appealing to remove that limitation, but only 256 naming spaces can be done with this call. Plus, libsimgrid would also be duplicated which is obviously not exactly something we want. Instead, we need to work around this ld.so sanity check in order to dlopen() N times the same code.

Patching ld.so could be doable since we only want to desactivate the test on the inode deciding to avoid the file loading. But still, even for such a simple change, modifying ld.so remains rather complex and hard to maintain. In addition, it is though to force our users to have an alternative ld.so. An alternative is to interpose ld.so is not possible with LD_PRELOAD since ld.so embeeds everything it needs as static objects, but you could still use ptrace. This really slows down the application, but you can disable it once the application is properly loaded. Still, it sounds rather difficult to achieve properly.

We decided to rely on a much easier trick: actually copy the file on the disk. And it just works, as demonstrated in f161538. Of course, this consumes a lot of space both on disk and in memory. Duplicating the data space is exactly what we wanted, but also duplicating the code and constants is not exactly optimal.

A really good approach to solve this would be to use cp --reflink=auto. That would do the Right Thing when executed on a btrfs partition. In this case, the file will be copy-on-write, ie not duplicated neither on disk nor in memory (FIXME: is this really true for memory too? I guess so).

A more complex approach to reach the same effect would be to use fallocate and FALLOC_FL_PUNCH_HOLE to remove the unwanted content of the file as soon as possible (code, constants, and DWARF informations). Then, you want to remap the corresponding memory areas onto the original corresponding segment. libelf could be used to read the sectioning information in the ELF file. Currently, we only require this when activating MC, but changing this is not a big deal.

But in the first iteration, having N versions of your binary on disk is something we can live with. Even more since the file is unlinked right after being loaded in memory (before the application actually starts). The patch is only missing some polish before being integrated.

This approach is very interesting for the following reasons:

  • Only standard tools are used (even not for their first purpose) with no hard patch to maintain
  • Each MPI rank has its own copy of the global data, so they can even run in parallel (which is not possible with the current remaping madness), and context switch becomes really cheap again
  • Users can easily choose which libraries they want to share and which libraries they want to have private to each MPI rank: they just have to statically link against the private libraries. Note however that compiling statically against a given library may be uneasy for a binary compiled with -fPIE. Doable, but not part of the easy options of any existing build system.
  • It should be portable on all our current target architectures. Smpicc and smpirun may need to collaborate for that (eg adding -fPIE for you), but no major difficulty foreseen.

It presents the following drawbacks:

  • It uses a lot of space on disk. It'd be good if the tmpfs could implement the CopyOnWrite, but I'm not willing to modify the kernel to speed simgrid up :)
  • I think that updating the MC to benefit of this is not a reasonable task. Adding yet another level of indirection to the MC may only result in chaos and madness.
  • Someone should finish and test the implementation. Missing bits before integration:
    • Do not require a recompile to switch between privatization mean. Not removing -Dmain=smpi_simulated_main_ from the compilation line in smpicc and dloading that symbol instead of main should be sufficient.
    • Test for the feature we need (in particular PUNCH_HOLE when we start using it)
    • Add a command line option controlling the privatization that the user want
    • Document this
@sthibaul

This comment has been minimized.

Show comment
Hide comment
@sthibaul

sthibaul Feb 16, 2017

Contributor

As side comment: -fPIE is currently used by default for building official Debian packages, so it's a quite safe option to enable :) Only applications which do horrible things with symbols have problems.

Another idea that came up in the brainstorm was to trick ld: to determine whether it's the same library loaded several times, it uses stat() on the binary. Ptrace can be used to change the result of stat during the load. That would allow to avoid the copy on systems where we can use ptrace.

Contributor

sthibaul commented Feb 16, 2017

As side comment: -fPIE is currently used by default for building official Debian packages, so it's a quite safe option to enable :) Only applications which do horrible things with symbols have problems.

Another idea that came up in the brainstorm was to trick ld: to determine whether it's the same library loaded several times, it uses stat() on the binary. Ptrace can be used to change the result of stat during the load. That would allow to avoid the copy on systems where we can use ptrace.

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

s/ld/ld.so/

to trick ld: to determine whether it's the same library loaded several times, it uses stat() on the binary

Apparently, this happens in ld.so so we can't LD_PRELOAD. we could either patch ld.so or use syscall interposition (strace) but both options are not super pleasant.

I kind of like the idea of relying on refllink (filesystem-level COW-copy, see man cp, man ioctl_ficlone) but this is supported by only a few filesystems such as btrfs (and not ext) which means a lot of people would not benefit from it without some extra setup. I'd be really nice to have support for this in tmpfs.

Collaborator

randomstuff commented Feb 17, 2017

s/ld/ld.so/

to trick ld: to determine whether it's the same library loaded several times, it uses stat() on the binary

Apparently, this happens in ld.so so we can't LD_PRELOAD. we could either patch ld.so or use syscall interposition (strace) but both options are not super pleasant.

I kind of like the idea of relying on refllink (filesystem-level COW-copy, see man cp, man ioctl_ficlone) but this is supported by only a few filesystems such as btrfs (and not ext) which means a lot of people would not benefit from it without some extra setup. I'd be really nice to have support for this in tmpfs.

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

I find it kind of disappointing that nobody's considering to implement our own dynamic linker :)

Collaborator

randomstuff commented Feb 17, 2017

I find it kind of disappointing that nobody's considering to implement our own dynamic linker :)

@mquinson

This comment has been minimized.

Show comment
Hide comment
@mquinson

mquinson Feb 17, 2017

Member
Member

mquinson commented Feb 17, 2017

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

reflinking if possible and falling back to (plain copy + mmap) looks like a reasonnable plan for a first iteration.

Collaborator

randomstuff commented Feb 17, 2017

reflinking if possible and falling back to (plain copy + mmap) looks like a reasonnable plan for a first iteration.

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

they just have to statically link against the private libraries

As a side note, they must first compile a (static + PIC) version of the libraries which might involve hacking the build system of the library: .a are (usually) compiled without PIC.

Collaborator

randomstuff commented Feb 17, 2017

they just have to statically link against the private libraries

As a side note, they must first compile a (static + PIC) version of the libraries which might involve hacking the build system of the library: .a are (usually) compiled without PIC.

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

So I wrote a quick-and-dirty POC (smpi-privatization-dlopen branch) based on cp+dlopen which seems to behave quite nicely:

91% tests passed, 22 tests failed out of 235

Total Test time (real) =  20.07 sec

The following tests FAILED:
	167 - smpi-energy-thread (Failed)
	256 - tesh-smpi-macro-shared-thread (Failed)
	259 - tesh-smpi-coll-allgather-thread (Failed)
	262 - tesh-smpi-coll-allgatherv-thread (Failed)
	265 - tesh-smpi-coll-allreduce-thread (Failed)
	268 - tesh-smpi-coll-alltoall-thread (Failed)
	271 - tesh-smpi-coll-alltoallv-thread (Failed)
	274 - tesh-smpi-coll-barrier-thread (Failed)
	277 - tesh-smpi-coll-bcast-thread (Failed)
	280 - tesh-smpi-coll-gather-thread (Failed)
	283 - tesh-smpi-coll-reduce-thread (Failed)
	286 - tesh-smpi-coll-reduce-scatter-thread (Failed)
	289 - tesh-smpi-coll-scatter-thread (Failed)
	292 - tesh-smpi-macro-sample-thread (Failed)
	295 - tesh-smpi-pt2pt-dsend-thread (Failed)
	298 - tesh-smpi-pt2pt-pingpong-thread (Failed)
	301 - tesh-smpi-type-hvector-thread (Failed)
	304 - tesh-smpi-type-indexed-thread (Failed)
	307 - tesh-smpi-type-struct-thread (Failed)
	310 - tesh-smpi-type-vector-thread (Failed)
	313 - tesh-smpi-bug-17132-thread (Failed)
	316 - tesh-smpi-timers-thread (Failed)

The threads are not (yet) so happy however.

Collaborator

randomstuff commented Feb 17, 2017

So I wrote a quick-and-dirty POC (smpi-privatization-dlopen branch) based on cp+dlopen which seems to behave quite nicely:

91% tests passed, 22 tests failed out of 235

Total Test time (real) =  20.07 sec

The following tests FAILED:
	167 - smpi-energy-thread (Failed)
	256 - tesh-smpi-macro-shared-thread (Failed)
	259 - tesh-smpi-coll-allgather-thread (Failed)
	262 - tesh-smpi-coll-allgatherv-thread (Failed)
	265 - tesh-smpi-coll-allreduce-thread (Failed)
	268 - tesh-smpi-coll-alltoall-thread (Failed)
	271 - tesh-smpi-coll-alltoallv-thread (Failed)
	274 - tesh-smpi-coll-barrier-thread (Failed)
	277 - tesh-smpi-coll-bcast-thread (Failed)
	280 - tesh-smpi-coll-gather-thread (Failed)
	283 - tesh-smpi-coll-reduce-thread (Failed)
	286 - tesh-smpi-coll-reduce-scatter-thread (Failed)
	289 - tesh-smpi-coll-scatter-thread (Failed)
	292 - tesh-smpi-macro-sample-thread (Failed)
	295 - tesh-smpi-pt2pt-dsend-thread (Failed)
	298 - tesh-smpi-pt2pt-pingpong-thread (Failed)
	301 - tesh-smpi-type-hvector-thread (Failed)
	304 - tesh-smpi-type-indexed-thread (Failed)
	307 - tesh-smpi-type-struct-thread (Failed)
	310 - tesh-smpi-type-vector-thread (Failed)
	313 - tesh-smpi-bug-17132-thread (Failed)
	316 - tesh-smpi-timers-thread (Failed)

The threads are not (yet) so happy however.

@mquinson

This comment has been minimized.

Show comment
Hide comment
@mquinson

mquinson Feb 17, 2017

Member

Wooot! that's incredible, you definitely rulez.

Who cares about the slow threads when you have a breath taking privatization schema?

Member

mquinson commented Feb 17, 2017

Wooot! that's incredible, you definitely rulez.

Who cares about the slow threads when you have a breath taking privatization schema?

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 17, 2017

Collaborator

After re-mmap-ing the text segment of the original executable file we could punch holes into the copied file with posix_fallocate() + FALLOC_FL_PUNCH_HOLE.

^ This is probably useless.

Collaborator

randomstuff commented Feb 17, 2017

After re-mmap-ing the text segment of the original executable file we could punch holes into the copied file with posix_fallocate() + FALLOC_FL_PUNCH_HOLE.

^ This is probably useless.

@mquinson

This comment has been minimized.

Show comment
Hide comment
@mquinson

mquinson Feb 17, 2017

Member

Ok, so we have a working version. Now we could try to optimize for space.

Some ideas for disk space:

  • add '--reflink=auto' to the cp so that the happy users of btrfs automatically benefit it
  • use fallocate() with FALLOC_FL_PUNCH_HOLE to remove the useless parts of the files (both the code at the beginning and the DWARF at the end) so that the corresponding inodes are freed as soon as possible. More complex, but it also works on ext4

Idea for memory space:

  • we have to unmap the code areas, and remap them onto the original copy (as discussed earlier)

In both cases, we need to read the ELF sections to know what to punch_hole on disk (resp. to remap in memory). libelf is currently mandatory only if we activate MC, but adding it as a dependency of that privatization is not such a big deal, I'd say.

Member

mquinson commented Feb 17, 2017

Ok, so we have a working version. Now we could try to optimize for space.

Some ideas for disk space:

  • add '--reflink=auto' to the cp so that the happy users of btrfs automatically benefit it
  • use fallocate() with FALLOC_FL_PUNCH_HOLE to remove the useless parts of the files (both the code at the beginning and the DWARF at the end) so that the corresponding inodes are freed as soon as possible. More complex, but it also works on ext4

Idea for memory space:

  • we have to unmap the code areas, and remap them onto the original copy (as discussed earlier)

In both cases, we need to read the ELF sections to know what to punch_hole on disk (resp. to remap in memory). libelf is currently mandatory only if we activate MC, but adding it as a dependency of that privatization is not such a big deal, I'd say.

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 24, 2017

Collaborator

Updated the code in order to have privatization disabled by default.

Things to do for a first version (optimization will come later) :

  • Repair classic privatization mode
  • Windows support
  • MacOS X support
  • Fix smpi_main handling (PATH, installation, etc.)
  • Cleanup
  • Don't rely on external cp command
Collaborator

randomstuff commented Feb 24, 2017

Updated the code in order to have privatization disabled by default.

Things to do for a first version (optimization will come later) :

  • Repair classic privatization mode
  • Windows support
  • MacOS X support
  • Fix smpi_main handling (PATH, installation, etc.)
  • Cleanup
  • Don't rely on external cp command
@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff Feb 24, 2017

Collaborator

Opened pull request for discussing the code.

Collaborator

randomstuff commented Feb 24, 2017

Opened pull request for discussing the code.

@mquinson

This comment has been minimized.

Show comment
Hide comment
@mquinson

mquinson May 5, 2017

Member

I think that even if puch_hole()ing is not done yet, this issue can now be closed. The feature is implemented and put in production on all platforms we have. It is only not compatible with Mac and BSD's threads. So we switch to raw contextes on this platform. But it is the backup implementation instead of mmap on FreeBSD that is broken. So this is very mature to me.

As for the PUNCH_HOLE thing, it is not actually needed in practice. other parts are hurting our performances.

So. What do you guys think of this bug?

Member

mquinson commented May 5, 2017

I think that even if puch_hole()ing is not done yet, this issue can now be closed. The feature is implemented and put in production on all platforms we have. It is only not compatible with Mac and BSD's threads. So we switch to raw contextes on this platform. But it is the backup implementation instead of mmap on FreeBSD that is broken. So this is very mature to me.

As for the PUNCH_HOLE thing, it is not actually needed in practice. other parts are hurting our performances.

So. What do you guys think of this bug?

@randomstuff

This comment has been minimized.

Show comment
Hide comment
@randomstuff

randomstuff May 8, 2017

Collaborator

Punch-holing the file is probably not that useful (and would not play nicely with debuggers/profilers/etc.). However, sharing the text segment is not implemented and might help. I'm not sure if debuggers/profilers would be so happy about it however.

Collaborator

randomstuff commented May 8, 2017

Punch-holing the file is probably not that useful (and would not play nicely with debuggers/profilers/etc.). However, sharing the text segment is not implemented and might help. I'm not sure if debuggers/profilers would be so happy about it however.

@mquinson

This comment has been minimized.

Show comment
Hide comment
@mquinson

mquinson May 8, 2017

Member
Member

mquinson commented May 8, 2017

@sthibaul

This comment has been minimized.

Show comment
Hide comment
@sthibaul

sthibaul May 10, 2017

Contributor

I guess punching the hole before remapping only leads to a SIGBUS if you access the corresponding page before remapping, so not a big deal.

Contributor

sthibaul commented May 10, 2017

I guess punching the hole before remapping only leads to a SIGBUS if you access the corresponding page before remapping, so not a big deal.

@mquinson mquinson referenced this issue May 31, 2017

Closed

SMPI privatization based on dlopen #140

7 of 13 tasks complete

@mquinson mquinson closed this in d12368e Dec 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment