Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for FiberRuntime runloop #8800

Merged
merged 4 commits into from May 1, 2024

Conversation

kyri-petrou
Copy link
Contributor

@kyri-petrou kyri-petrou commented Apr 28, 2024

This PR contains a few optimizations and micro-optimizations for the FiberRuntime runloop that I've noticed while previously working on #8745 but wanted to optimize them separately. Let's dig into the different things that have been optimized.

1. Checking for messages while running

This PR introduces a wrapper over ConcurrentLinkedQueue which has a weakly consistent unsafeIsEmpty method for checking if the inbox contains messages. Since we check for new messages before evaluating every single effect in the runloop, the overhead of calling poll() is to check whether the queue is empty is non-negligible.

Since adding messages from outside the FiberRuntime is extremely rare (in most cases it's for interruption), I think we can get away with this weakly consistent check in the runLoop itself. Note that we should never rely on the unsafeIsEmpty method at any other point.

2. Stack optimizations

The main optimization here is that we avoid setting entries to null whenever we "pop" an entry from the stack when we are at the "shallow" part of the stack (idx < 128). The main reasons for this is that we assume that entries that are in the shallow part of the stack are more likely that they'll be replaced automatically as the pointer moves up and down, so we don't need to manually GC them.

The other (micro)optimization regarding the stack is having it initialized when we first start evaluating the effect and avoid repeatedly checking whether it's null during the runloop. Since the _stack will be initialized on any kind of effect other than a Sync or Exit, the only drawback of this is a very small overhead when we fork things like ZIO.unit. However, since realistically all effects that are forked will have at least 1 effect that will need to initalize the stack, I think it's better not to initialize it dynamically.

3. Updating _lastTrace

Currently, we update _lastTrace whenever the current trace is not null or empty. However, since ZIO's methods require an implicit trace which is propagated to all methods that are called, it's very common for _lastTrace to be updated with the same value multiple times. Since reading the variable is much cheaper than writing it, we first check whether the current trace is different than the old one.

I also added some comments in the PR below with some questions / remarks

Benchmarking results

I only run the NarrowFlatMap and BroadFlatMap benchmarks using 1 thread, let me know if you think we need to run other benchmarks as well

TLDR:

  • ~25% increase in throughput for "narrow" flatmaps (i.e., the stack doesn't need to resize)
  • ~10% increase in throughput for "broad" flatmaps

series/2.x:

| Benchmark                                | size | Mode  | Cnt |     Score   |   Error  | Units |
|------------------------------------------|------|-------|-----|-------------|----------|-------|
| BroadFlatMapBenchmark.zioBroadFlatMap    | 20   | thrpt | 5   | 2838.735    | 19.254   | ops/s |
| NarrowFlatMapBenchmark.zioNarrowFlatMap  | 1000 | thrpt | 5   | 66646.224   | 829.889  | ops/s |

PR:

| Benchmark                               | size | Mode  | Cnt |     Score   |   Error  | Units |
|-----------------------------------------|------|-------|-----|-------------|----------|-------|
| NarrowFlatMapBenchmark.zioBroadFlatMap  | 20   | thrpt | 5   | 3154.488    | 53.074   | ops/s |
| NarrowFlatMapBenchmark.zioNarrowFlatMap | 1000 | thrpt | 5   | 84474.125   | 866.596  | ops/s |


private final class Inbox {
private[this] val queue = new java.util.concurrent.ConcurrentLinkedQueue[FiberMessage]()
private[this] var _isEmpty = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be made volatile? The only scenario that kind of worries me is the following:

  1. Thread 1 (runLoop thread) empties the queue (this implies that a message was added right in the previous iteration as well)
  2. Thread 1: calls queue.isEmpty and sets _isEmpty to true
  3. Thread 2 (external) adds message to queue immediately after
  4. Thread 2: sets _isEmpty = false but somehow (2) overrides it with _isEmpty = true

The chances of this exact sequence of events happening are astronomically small to begin with, but is it something we need to cater for? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen and therefore will happen (Murphy's Law of Concurrency), and making it volatile won't help, you'd have to use an atomic integer to track emptiness if you really wanted to fix it (which would probably defeat the optimization).

I am torn on whether to deal with this now, or bite the bullet and do another ticket (that I have yet to write) on creating a highly optimized concurrent mailbox just for fiber runloop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdegoes I see your point! I'll revert this change in the PR since it's now tracked by #8807 and give it a go at tackling it separately

var message = inbox.poll()

// Unfortunately we can't avoid the virtual call to `trace` here
if (message ne null) updateLastTrace(cur.trace)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this was missing (must have missed it when I worked on #8671) and was causing a test to be flaky. This shouldn't add any performance overhead since we're very rarely processing messages from the inbox

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kyri-petrou kyri-petrou marked this pull request as ready for review April 28, 2024 01:18
@kyri-petrou
Copy link
Contributor Author

@jdegoes Would you be able to review this PR? You might be able to spot some flaws in the optimizations that I might have missed

@@ -836,11 +844,6 @@ final class FiberRuntime[E, A](fiberId: FiberId.Runtime, fiberRefs0: FiberRefs,
startStackIndex: Int,
currentDepth: Int
): Exit[Any, Any] = {
assert(running.get)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this since we're already checking that we're running prior to calling this method, that way we avoid calling it repeatedly.

Or should it be added back as a safeguard in case we introduce a bug that sets the flag to false while the fiber is still running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it only has overhead if assertions are enabled for the JVM (albeit that might be all the time). It's mainly designed for bug detection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it back 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I just had a look at assert; it seems that assertion generation is controlled during compile-time, not at the JVM level. From assert scaladoc:

A set of assert functions are provided for use as a way to document
and dynamically check invariants in code. Invocations of assert can be elided
at compile time by providing the command line option -Xdisable-assertions,
which raises -Xelide-below above elidable.ASSERTION, to the scalac command.

We should probably use the -Xdisable-assertions compiler flag when we generate the published artifacts, but that's probably better done in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

@@ -771,9 +764,24 @@ final class FiberRuntime[E, A](fiberId: FiberId.Runtime, fiberRefs0: FiberRefs,
}

@inline
private def popStackFrame(nextStackIndex: Int): Unit = {
_stack(nextStackIndex) = null // GC
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the more I realise that it's a pretty dangerous thing.

What if instead we didn't GC when nextStackIndex is below X (perhaps 300 to coincide with trampolining)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with an auto-gc threshold (of 128). Updated the PR description to match the new approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I love the performance improvement, I think people will complain if we are holding onto (unnecessary) memory for arbitrary long periods of time.

One possibility is to clear out the entries when the run loop begins, basically by starting at stack index, and nulling until the first null.

This opens the door for making a null array, e.g. val NullData = Array.fill[AnyRef](...)(null) and then using the faster arraycopy to null out the extra entries.

However, we'd still have the problem of holding onto memory for an indeterminate amount of time.

How much does this single change contribute to the performance improvements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is to clear out the entries when the run loop begins, basically by starting at stack index, and nulling until the first null.

If I'm understanding your recommendation correctly, I believe that this is similar to what's currently implemented, but instead clearing out the entries when the runloop starts we do it when it exits (this also means on every yield / async operation or termination).

Effectively we're holding on to objects unnecessarily only while evaluating synchronous effects, which in almost all cases should be a very small period of time, and only for those objects above the _stackIndex until the first null entry. Once the runloop finishes, the amount of memory we're holding on to will be the same as previously.

This opens the door for making a null array, e.g. val NullData = Array.fillAnyRef(null) and then using the faster arraycopy to null out the extra entries.

It's currently done iteratively but I like this recommendation better 👍

How much does this single change contribute to the performance improvements?

Between 5-10% of increase in throughput depending on the benchmark.

@jdegoes
Copy link
Member

jdegoes commented Apr 28, 2024

@kyri-petrou Will do a detailed review tomorrow!

@kyri-petrou kyri-petrou force-pushed the fiber-runtime-optimizations branch from 4b80b81 to 184bc68 Compare May 1, 2024 09:24
Copy link
Member

@jdegoes jdegoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@jdegoes jdegoes merged commit 9eb1270 into zio:series/2.x May 1, 2024
21 checks passed
@kyri-petrou kyri-petrou deleted the fiber-runtime-optimizations branch May 1, 2024 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants