Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel map2 optimization #3428

Merged
merged 12 commits into from
Apr 20, 2023
Merged

Parallel map2 optimization #3428

merged 12 commits into from
Apr 20, 2023

Conversation

durban
Copy link
Contributor

@durban durban commented Feb 17, 2023

Context:

The "old" parallel map2 used both; #2159 optimized it to avoid tuples, but that implementation was incorrect; #2239 fixed it by starting 2 extra fibers.

What is this:

  • Wrote some extra tests to detect possible changes in behavior.
  • In 9e9d015 restored the "old" implementation with both (as allocating a few tuples might be cheaper than starting 2 extra fibers).
  • In 2c4bc68 created a new implementation, which avoids tuples and starting extra fibers too. Its unclear, if this actually helps (benchmark results pending).

@durban durban changed the title Parallel map2 test Parallel map2 optimization Feb 20, 2023
@durban
Copy link
Contributor Author

durban commented Feb 20, 2023

Benchmark results:

4995c298c (~series/3.4.x):

[info] Benchmark                      (cpuTokens)  (size)   Mode  Cnt    Score   Error  Units
[info] ParallelBenchmark.parTraverse        10000    1000  thrpt   25  377.757 ± 4.005  ops/s


9e9d01551 (old `map2` implementation with `both`):

[info] Benchmark                      (cpuTokens)  (size)   Mode  Cnt    Score   Error  Units
[info] ParallelBenchmark.parTraverse        10000    1000  thrpt   25  422.737 ± 1.574  ops/s


5ffdb35bb (new `map2` implementation without tuples):

[info] Benchmark                      (cpuTokens)  (size)   Mode  Cnt    Score   Error  Units
[info] ParallelBenchmark.parTraverse        10000    1000  thrpt   25  410.672 ± 2.215  ops/s

So yeah, the old (and simplest) implementation (with both) is the fastest. My implementation is faster than the current one, but still slower than the old one.

I propose going back to the old implementation.

@djspiewak
Copy link
Member

First off, this is a clever and subtle bit of fiber ordering. :-) The supervisor fibers weren't meant to remove the tuple, but rather meant to remove the racePair without requiring Deferred. I didn't think of ordering the nesting in this fashion to eliminate the asymmetry in error cancelation. This is much nicer.

Regarding the tuple-vs-not-tuple, we might want to hold off on that until we adjust traverseViaChain (in Cats). I really would be pretty surprised if tupling were faster, and there might be some confounding factors there.

Finally, parTraverse is a better benchmark than what we had at the time iirc. Overall, I think this is a win. Thanks for looking at this!

@durban
Copy link
Contributor Author

durban commented Feb 20, 2023

Yeah, it was a nice puzzle to figure out :-)

Avoiding racePair: I assumed the reason for avoiding racePair was to avoid the tuple allocation(s)?

Optimizing traverseViaChain in Cats: sure, sounds good. Should this PR be kept open until then, or should it be merged now?

@djspiewak
Copy link
Member

Avoiding racePair: I assumed the reason for avoiding racePair was to avoid the tuple allocation(s)?

Also the Either. Also the complexity. Etc etc.

Optimizing traverseViaChain in Cats: sure, sounds good. Should this PR be kept open until then, or should it be merged now?

I'm going to just knock off the traverseViaChain optimization as soon as I can nail the bug I'm working on, so let's sit on this for a few days and see where we net out. This PR may also make it harder for me to isolate the specific bug in question. If I don't get to traverseViaChain by this time next week, let's go ahead with this PR.

@armanbilge
Copy link
Member

If I don't get to traverseViaChain by this time next week, let's go ahead with this PR.

Since you are so keen on it, I can look at it too :) is the idea basically to revive this PR typelevel/cats#3993?

@djspiewak
Copy link
Member

djspiewak commented Feb 20, 2023

Since you are so keen on it, I can look at it too :) is the idea basically to revive this PR

No, simpler than that. Just dynamically typecase on the Applicative in the two traverseViaChain methods. If it extends Defer (or StackSafeMonad, both are an option here, but maybe the former is more correct), then we should skip the Evals and use map2 instead of map2Eval. We also need to double check that this runtime type is actually present on the Parallel applicative presented by Cats Effect.

Should be accompanied by some benchmarks to demonstrate that it's worth it. We can benchmark on Eval in Cats, though obviously IO is the big winner here.

@djspiewak
Copy link
Member

I think we shouldn't hold this PR for the traverse changes. :-) I'm pretty convinced they'll be very complementary.

@durban
Copy link
Contributor Author

durban commented Mar 8, 2023

Okay. Should this be merged as is, or should I add a commit to go back to the old implementation (with both), which currently seems faster (see the benchmark results above)?

@djspiewak
Copy link
Member

Sorry forgot to reply. Let's just go back to the old implementation. I suspect we can still do better, but for now, let's do that.

@durban durban marked this pull request as ready for review April 19, 2023 20:08
@djspiewak djspiewak merged commit 300469b into typelevel:series/3.4.x Apr 20, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants