Skip to content
This repository has been archived by the owner on Nov 24, 2022. It is now read-only.

[WIP] Add some parallelization to ahc-ld and ahc-link #622

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gkaracha
Copy link
Member

@gkaracha gkaracha commented Apr 27, 2020

This is WIP. TODOs as described in #621:

  • When loading archives and object files in ahc-ld, we can parallelize the deserialization of each object file. All object files are converted to ByteStrings first, either via direct reading or ArchiveEntry, then the deserialization can be performed in parallel.
  • After the gc-sections pass is run, the shrinked AsteriusModule should be fully evaluated, and this can be done in parallel as well.
  • In the binaryen backend, we can parallelize the marshaling of different data segments and functions. binaryen will transparently switch to a new allocator when it notices it's allocating an IR node on a different thread, so we should ensure each Haskell worker thread is pinned using forkOn.

@gkaracha gkaracha linked an issue Apr 28, 2020 that may be closed by this pull request
asterius/app/ahc-ld.hs Outdated Show resolved Hide resolved
asterius/src/Asterius/Internals/Parallel.hs Outdated Show resolved Hide resolved
@@ -45,9 +46,13 @@ data LinkTask
loadTheWorld :: LinkTask -> IO AsteriusCachedModule
loadTheWorld LinkTask {..} = do
ncu <- newNameCacheUpdater
lib <- mconcat <$> for linkLibs (loadAr ncu)
objs <- rights <$> for linkObjs (tryGetFile ncu)
lib <- parallelFor 1 linkLibs (loadAr ncu) -- TODO: Parameterize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadAr is already parallelized, and here we use parallelFor again, which creates nested parallelism. Not to say this won't run, but the overall efficiency will be reduced.

We need to either:

  • Refactor loadAr, so that it only returns the ArchiveEntrys, and after we collect all the ArchiveEntrys, parallelize their deserialization (along with objs)
  • Improve our parallelization framework so that it supports nested parallelism; each job can spawn new jobs to be evaluated somewhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the first approach:

Refactor loadAr, so that it only returns the ArchiveEntrys, and after we collect all the ArchiveEntrys, parallelize their deserialization (along with objs)

but I don't completely understand what you mean with the second bullet:

Improve our parallelization framework so that it supports nested parallelism; each job can spawn new jobs to be evaluated somewhere else.

Nested parallelism is already allowed, right? I understand that nested calls would pin threads to the same capabilities (hence slower), but it'd still work. What is the alternative solution you mean here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the second approach elaborated is:

  • We have a P monad which you can lift IO actions into. It carries the context (thread pool, task queue) needed for all parallelism
  • There's a fork :: P a -> P (Future a) method which spawns a P a action for parallel execution. The P a action itself can fork new jobs, thus nested parallelism here.
  • There's a join :: Future a -> P a method which blocks until the result is available.

Going through all the trouble of creating the P monad abstraction has an advantage: we can enforce the number of worker threads to match CPU core count all the time. Directly calling parallelFor inside parallelFor would be much less efficient.

asterius/src/Asterius/Internals/Parallel.hs Outdated Show resolved Hide resolved
asterius/src/Asterius/Internals/Parallel.hs Outdated Show resolved Hide resolved
@gkaracha
Copy link
Member Author

@TerrorJack, I think it is ready for reviewing now. I skipped parallelization of data segments in the binaryen backend as we discussed so that it doesn't backfire on us. I have also left a couple of TODOs in the comments I'd like you to have a look at if you can.

@gkaracha gkaracha marked this pull request as ready for review April 30, 2020 10:47
asterius/src/Asterius/Types.hs Outdated Show resolved Hide resolved
@@ -103,6 +103,11 @@ parseTask args = case err_msgs of
in if i >= 0 && i <= 2
then t {shrinkLevel = i}
else error "Shrink level must be [0..2]",
str_opt "thread-pool-size" $ \s t ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to pass the --thread-pool-size option to ahc-ld as well. ahc-ld is called as the linker executable by ahc, so search for --optl in this module and see how other ahc-ld options are being passed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Should be ok now.

@TerrorJack
Copy link
Member

We have to hold this one for a bit longer; I gave this a try on my build server, and the overall clock time of linking pandoc has actually increased with --thread-pool-size=8 compared to 1. So we need to investigate further:

  • Are we picking the right places for parallelization
  • Are we implementing parallelFor correctly

Merely reading the flamegraph doesn't seem to help too much either. I think we also need to investigate the eventlogs with tools like threadscope and such.

@gkaracha
Copy link
Member Author

gkaracha commented May 5, 2020

@TerrorJack I took your advice and tried out threadscope:

  • Are we implementing parallelFor correctly

The implementation of parallelFor was actually not OK; laziness got in the way. Hopefully this is now fixed (see late commit 06efdb9).

  • Are we picking the right places for parallelization

Not sure about that, but if you could have another go at linking pandoc we might find out more about this.

@TerrorJack
Copy link
Member

Nice finding about the missed laziness in parallelFor. Do you have updated numbers for the Cabal test case locally, for thread pool sizes 1, 2, 4, 8?

@gkaracha
Copy link
Member Author

@TerrorJack some updates on this one:

Inspired by your fix for #663 (0085dcf), I tried giving up on parallelization in loadTheWorld (56a5aeb), which basically combined all the object files into one eagerly. Indeed, this took the OOM errors away, but it also showed no benefit whatsoever in the parallelization of the rest. For

import Distribution.Simple
main = defaultMain

wall clock time seems to only be getting worse:

$ time stack build --exec "ahc-link --thread-pool-size=1 --input-hs perf-test/Setup.hs"
real    0m46,915s
user    3m56,870s
sys     0m2,098s

$ time stack build --exec "ahc-link --thread-pool-size=2 --input-hs perf-test/Setup.hs"
real    0m53,966s
user    4m2,502s
sys     0m2,822s

$ time stack build --exec "ahc-link --thread-pool-size=4 --input-hs perf-test/Setup.hs"
real    0m48,266s
user    4m2,818s
sys     0m2,243s

$ time stack build --exec "ahc-link --thread-pool-size=8 --input-hs perf-test/Setup.hs"
real    0m48,725s
user    4m8,747s
sys     0m3,086s

@gkaracha gkaracha marked this pull request as draft May 27, 2020 09:19
@TerrorJack
Copy link
Member

@gkaracha Thanks for the update! Indeed, it seems that adding parallelization is pragmatic only after we further cut down memory usage of the single-threaded case.

@gkaracha gkaracha changed the title Add some parallelization to ahc-ld and ahc-link [WIP] Add some parallelization to ahc-ld and ahc-link Jul 13, 2020
@gkaracha gkaracha self-assigned this Aug 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement parallel ahc-ld/ahc-link
2 participants