-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timer stealing without a concurrent data structure #3781
Timer stealing without a concurrent data structure #3781
Conversation
Benchmarks after latest changes. series/3.5.x
This PR
|
This is an interesting approach. I agree, that the "fiddly things" should be solved for this to work. Specifically, cancelling timers seems critical. I'm not sure, that cleaning them up during other inserts will be enough. (But it might be, as it's not a skiplist any more...) But actually freeing cancelled timers seems important. Another question: are we okay with timer stealing being a no-op? Because (if I'm reading it correctly), on certain platforms it definitely can be... |
Thanks! I've pushed solutions for all of these in the latest commits.
Based on the first round of benchmarks it did seem like this was not enough. The latest commits I pushed changed it so that if the canceling thread owns the timer heap (i.e. it was the same thread that scheduled it) then it is removed immediately. This appears to have made a difference in the second benchmarks. I just realized my justification for this approach depends a lot on the I/O-integrated runtime ... so my thinking is, timeouts are typically associated with some I/O op e.g. when the read/write succeeds the timeout can be canceled. In the new I/O integrated runtime, both the timer and the I/O op will be specifically associated with that thread. So when the I/O completes, it will continue on the same thread that started it and also started the timer. However this is not currently the case where I/O relies on an external selector thread ... in that case, we'd be going through the external queue and we'll end up some random worker thread that probably doesn't own our timer. Hmm. tl;dr this approach heavily optimizes for the happy path where in practice we are staying local to a single worker thread.
Freeing what specifically, the callback? This is definitely freed regardless of which thread does the canceling, by However, the heap datastructure itself could grow quite large holding all these |
Oh, forgot to respond to this. I'm not sure what you mean? Even in the current implementation it can be a no-op, if there are no timers to steal 😉 I guess the relevant question is, are we okay with timer stealing missing some currently stealable timers? Still, in the current approach this is a possibility, since it is just choosing threads at random, you could keep getting very unlucky. But the idea is eventually someone should steal from that thread. Which I think is the same idea here: eventually the data should be published to the other thread ... right? I'm not enough of an expert particularly when it comes to ARM. Maybe there are some pathological situations where the timers would never be published, but I feel like because of GC and stuff it has to happen eventually ... |
Sorry, I probably didn't look at the newer commits. Yeah, actually freeing "memory". So not just the callback, but also other memory which was allocated when the callback was inserted. (So yeah, I'm thinking exactly about the "datastructure itself could grow quite large".) About stealing being no-op:
Exactly. I don't think this can happen currently. Specifically: if there is at least one stealable timer, it will not finish stealing without actually stealing something. (Currently it doesn't steal everything it can, but it could.) Eventual memory visibility: as far as I know, plain reads (which this PR seems to use) absolutely don't guarantee that. (That would be a property of opaque, if I remember correctly, which is JVM 9+ I think.) |
Yeah, this is a legitimate danger when too much cancelation is originating on other threads. I think the owner thread is just going to have to do some periodic self-cleanup ... this is annoying b/c scanning the heap for canceled timers is O(n) ...
Sure, generally speaking plain reads definitely don't guarantee that. But what I'm wondering is if this is effectively guaranteed in practice by virtue of the fact that JVM is a GC runtime. So all allocations need to become visible to other threads eventually, in order for GC to work correctly. Right?
Interesting. I read a bit about it and I'm not entirely sure:
If there's no assurance of ordering wrt other threads, then isn't it lacking the same guarantees? What is "program order"? |
I'm not sure (the JMM doesn't seem to mention the GC very much). I think it's enough if they become visible to the GC; other threads don't have to care. Opaque: I think this is where I've read that opaque accesses eventually happen: https://gee.cs.oswego.edu/dl/html/j9mm.html#opaquesec (this is not the spec, but unlike the spec, it actually explains some things).
I mean, a plain read is not guaranteed to happen at all... so an opaque read still has more guarantees.
Here you go: https://docs.oracle.com/javase/specs/jls/se17/html/jls-17.html#jls-17.4.3 (I'm not saying I understand it, but it is very clearly defined :-). |
Oh yeah, I am not trying to make a formal argument based on JMM or specifications or anything. I'm just saying in practice, because of how things work in real programs on real JVMs on real CPUs, timers will eventually be visible to other threads, probably sooner rather than later. And in the grand scheme of things I think that's good enough. IMO programs should not rely on stealing to work correctly; it should really only be most useful in pathological scenarios, which are already violating several other "happy-path" assumptions in CE. Finally, we do have a test for stealing, so we will see for ourselves if that ever fails, particularly on ARM. Aha, thanks for those links about opacity! That's a much more helpful explanation :) "Writes are eventually visible." Well maybe One Day:tm: we can use that /shrug |
Okay, so you're basically saying, that we are okay with timer stealing missing some currently stealable timers. I also think that's probably fine (at least I can't rellay think of a non-pathological case which would be hurt by it). |
Yes. We are okay with this, and also my hypothesis is that in practice we won't be detectably missing currently stealable timers anyway e.g. they will be fortuitously published by some other mechanism. |
I think this might have a problem. It's important, that a timer (which is not cancelled) is actually triggered (its callback is called). I don't think this is guaranteed here:
This might be fixed with an |
That's a good point, we should probably just do that. Thanks for bringing this up. I had thought through this exact scenario as well but I convinced myself it was okay based on a Discord thread. https://discord.com/channels/632277896739946517/839263556754472990/1087781250284130375 |
Relevant reading: https://shipilev.net/blog/2014/all-fields-are-final/ |
More relevant reading: https://shipilev.net/blog/2014/safe-public-construction/#_safe_initialization It describes the implementation "quirk" we are currently relying on.
So essentially we are using the cats-effect/core/jvm/src/main/scala/cats/effect/unsafe/TimerHeap.scala Lines 327 to 331 in 76b9e83
So at least on HotSpot I don't think it's broken. Whether we want to rely on this quirk however is a much different question. Especially since in this case it's not too hard to fix. |
Hm... Yeah, I've remembered something like that (except I've misremembered all the important details :-). So yeah, it seems it'll work on hotspot (for now).
How? Now that I've actually read what the spec says, I'm no sure how to fix it... |
Really? Your idea above seemed reasonable, does it have some flaw?
Also we could make the |
Ah, yes, the
Hm... freeing memory on cancel seems important. Although, the |
From a performance standpoint, this is nearly a wash. It does make Before
After
|
I did my own x86 benchmarks and they are comparable to the ARM benchmarks. Both were 8 core linux machines. this PR
series/3.5.x
|
Daniel S responds:
@durban since you previously approved for merging into 3.5.x, I'm curious if you have any opinion. |
@armanbilge I don't have a strong opinion regarding 3.5 or 3.6. What I think is important though, is to figure out if we're really okay with this "best effort" stealing. Basically the discussion in #3873. |
Based on discussion in #3781 (comment) and #3873 (reply in thread) I'm going to re-target to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few minor comments.
NOTICE.txt
Outdated
@@ -0,0 +1,8 @@ | |||
cats-effect | |||
Copyright 2020-2023 Typelevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024
} else false | ||
} else false | ||
|
||
val heap = this.heap // local copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, the fact that we never read null
here depends on the WSTP initializing and starting threads in a particular order (first init everything, then start everything). This seems a little bit fragile. Would it be worth doing a null check here to be sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Good catch.
In #3499 we introduced a custom
ConcurrentSkipListMap
-like data structure. Notably this enabled timers to be stolen between threads (and also for canceled timers to be removed on cancelation).Here I explore if we can support timer stealing without a specialized concurrent data structure. The observations are that:
IO.sleep
IO.sleep
is implemented in terms ofIO.async
IO.async
is already a synchronization point: it can be invoked many times, but only the first will winSo if Thread B steals and invokes a timer from Thread A, it actually doesn't need to notify Thread A of this at all. When Thread A later tries to invoke that timer's callback it will safely no-op. (We can do even better if Thread B lazily nulls out that callback and this is fortuitously published to Thread A, then it can avoid invoking the callback at all without any explicit synchronization.)
So this PR introduces a
TimersHeap
that is intended to be written only by its owner thread. However, it may still be read by other threads (guarding against inconsistent state due to race conditions). This makes progress towards #3544 since accessing the sleepers queue in the worker loop no longer crosses any memory barriers.Daniel summarized the potential benefits of this strategy on Discord.
Fiddly things
timer cancelation. Without thread-safety we can't immediately remove it from the data structure, so they have to be cleaned up during other traversals for triggering/inserting timers. I wonder if we could be clever and detect if the cancelation request came from the worker thread that owns the
TimerHeap
, in which case it would be okay to clean it up immediately.handling non-
IO.sleep
timers i.e. scheduled via theScheduler
interface. Presumably the contract is that we will not invoke these more than once, so we need to setup anAtomicBoolean
to guarantee this.scheduling timers from external threads. Currently this involves submitting a
Runnable
via the external queue that then schedules the timer.Benchmarks
Somehow
sleepRace
got worse which suggests we need smarter cancelation. Also there is a lot of error in the measurement.series/3.5.x
This PR