New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations for FiberRuntime runloop #8800
Optimizations for FiberRuntime runloop #8800
Conversation
|
||
private final class Inbox { | ||
private[this] val queue = new java.util.concurrent.ConcurrentLinkedQueue[FiberMessage]() | ||
private[this] var _isEmpty = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be made volatile? The only scenario that kind of worries me is the following:
- Thread 1 (runLoop thread) empties the queue (this implies that a message was added right in the previous iteration as well)
- Thread 1: calls
queue.isEmpty
and sets_isEmpty
totrue
- Thread 2 (external) adds message to queue immediately after
- Thread 2: sets
_isEmpty = false
but somehow (2) overrides it with_isEmpty = true
The chances of this exact sequence of events happening are astronomically small to begin with, but is it something we need to cater for? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen and therefore will happen (Murphy's Law of Concurrency), and making it volatile won't help, you'd have to use an atomic integer to track emptiness if you really wanted to fix it (which would probably defeat the optimization).
I am torn on whether to deal with this now, or bite the bullet and do another ticket (that I have yet to write) on creating a highly optimized concurrent mailbox just for fiber runloop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var message = inbox.poll() | ||
|
||
// Unfortunately we can't avoid the virtual call to `trace` here | ||
if (message ne null) updateLastTrace(cur.trace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this was missing (must have missed it when I worked on #8671) and was causing a test to be flaky. This shouldn't add any performance overhead since we're very rarely processing messages from the inbox
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@jdegoes Would you be able to review this PR? You might be able to spot some flaws in the optimizations that I might have missed |
@@ -836,11 +844,6 @@ final class FiberRuntime[E, A](fiberId: FiberId.Runtime, fiberRefs0: FiberRefs, | |||
startStackIndex: Int, | |||
currentDepth: Int | |||
): Exit[Any, Any] = { | |||
assert(running.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this since we're already checking that we're running prior to calling this method, that way we avoid calling it repeatedly.
Or should it be added back as a safeguard in case we introduce a bug that sets the flag to false
while the fiber is still running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it only has overhead if assertions are enabled for the JVM (albeit that might be all the time). It's mainly designed for bug detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it back 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I just had a look at assert
; it seems that assertion generation is controlled during compile-time, not at the JVM level. From assert
scaladoc:
A set of
assert
functions are provided for use as a way to document
and dynamically check invariants in code. Invocations ofassert
can be elided
at compile time by providing the command line option-Xdisable-assertions
,
which raises-Xelide-below
aboveelidable.ASSERTION
, to thescalac
command.
We should probably use the -Xdisable-assertions
compiler flag when we generate the published artifacts, but that's probably better done in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
@@ -771,9 +764,24 @@ final class FiberRuntime[E, A](fiberId: FiberId.Runtime, fiberRefs0: FiberRefs, | |||
} | |||
|
|||
@inline | |||
private def popStackFrame(nextStackIndex: Int): Unit = { | |||
_stack(nextStackIndex) = null // GC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the more I realise that it's a pretty dangerous thing.
What if instead we didn't GC when nextStackIndex
is below X (perhaps 300 to coincide with trampolining)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to go with an auto-gc threshold (of 128). Updated the PR description to match the new approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though I love the performance improvement, I think people will complain if we are holding onto (unnecessary) memory for arbitrary long periods of time.
One possibility is to clear out the entries when the run loop begins, basically by starting at stack index, and nulling until the first null.
This opens the door for making a null array, e.g. val NullData = Array.fill[AnyRef](...)(null)
and then using the faster arraycopy
to null out the extra entries.
However, we'd still have the problem of holding onto memory for an indeterminate amount of time.
How much does this single change contribute to the performance improvements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility is to clear out the entries when the run loop begins, basically by starting at stack index, and nulling until the first null.
If I'm understanding your recommendation correctly, I believe that this is similar to what's currently implemented, but instead clearing out the entries when the runloop starts we do it when it exits (this also means on every yield / async operation or termination).
Effectively we're holding on to objects unnecessarily only while evaluating synchronous effects, which in almost all cases should be a very small period of time, and only for those objects above the _stackIndex
until the first null
entry. Once the runloop finishes, the amount of memory we're holding on to will be the same as previously.
This opens the door for making a null array, e.g. val NullData = Array.fillAnyRef(null) and then using the faster arraycopy to null out the extra entries.
It's currently done iteratively but I like this recommendation better 👍
How much does this single change contribute to the performance improvements?
Between 5-10% of increase in throughput depending on the benchmark.
@kyri-petrou Will do a detailed review tomorrow! |
4b80b81
to
184bc68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work!
This PR contains a few optimizations and micro-optimizations for the FiberRuntime runloop that I've noticed while previously working on #8745 but wanted to optimize them separately. Let's dig into the different things that have been optimized.
1. Checking for messages while running
This PR introduces a wrapper over
ConcurrentLinkedQueue
which has a weakly consistentunsafeIsEmpty
method for checking if the inbox contains messages. Since we check for new messages before evaluating every single effect in the runloop, the overhead of callingpoll()
is to check whether the queue is empty is non-negligible.Since adding messages from outside the FiberRuntime is extremely rare (in most cases it's for interruption), I think we can get away with this weakly consistent check in the runLoop itself. Note that we should never rely on the
unsafeIsEmpty
method at any other point.2. Stack optimizations
The main optimization here is that we avoid setting entries to
null
whenever we "pop" an entry from the stack when we are at the "shallow" part of the stack (idx < 128). The main reasons for this is that we assume that entries that are in the shallow part of the stack are more likely that they'll be replaced automatically as the pointer moves up and down, so we don't need to manually GC them.The other (micro)optimization regarding the stack is having it initialized when we first start evaluating the effect and avoid repeatedly checking whether it's null during the runloop. Since the
_stack
will be initialized on any kind of effect other than aSync
orExit
, the only drawback of this is a very small overhead when we fork things likeZIO.unit
. However, since realistically all effects that are forked will have at least 1 effect that will need to initalize the stack, I think it's better not to initialize it dynamically.3. Updating
_lastTrace
Currently, we update
_lastTrace
whenever the current trace is notnull
orempty
. However, since ZIO's methods require an implicit trace which is propagated to all methods that are called, it's very common for_lastTrace
to be updated with the same value multiple times. Since reading the variable is much cheaper than writing it, we first check whether the current trace is different than the old one.I also added some comments in the PR below with some questions / remarks
Benchmarking results
I only run the
NarrowFlatMap
andBroadFlatMap
benchmarks using 1 thread, let me know if you think we need to run other benchmarks as wellTLDR:
series/2.x:
PR: