Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer stealing without a concurrent data structure #3781

Conversation

armanbilge
Copy link
Member

In #3499 we introduced a custom ConcurrentSkipListMap-like data structure. Notably this enabled timers to be stolen between threads (and also for canceled timers to be removed on cancelation).

Here I explore if we can support timer stealing without a specialized concurrent data structure. The observations are that:

  1. in practice, most if not all timers will be some kind of IO.sleep
  2. IO.sleep is implemented in terms of IO.async
  3. the callback in IO.async is already a synchronization point: it can be invoked many times, but only the first will win

So if Thread B steals and invokes a timer from Thread A, it actually doesn't need to notify Thread A of this at all. When Thread A later tries to invoke that timer's callback it will safely no-op. (We can do even better if Thread B lazily nulls out that callback and this is fortuitously published to Thread A, then it can avoid invoking the callback at all without any explicit synchronization.)

So this PR introduces a TimersHeap that is intended to be written only by its owner thread. However, it may still be read by other threads (guarding against inconsistent state due to race conditions). This makes progress towards #3544 since accessing the sleepers queue in the worker loop no longer crosses any memory barriers.

Daniel summarized the potential benefits of this strategy on Discord.

In the case where you don't have any theft at all, it's strictly faster. When you have theft, those timers effectively impose an amortized burden which is a bit higher than necessary on the "steal-ee" worker, but it won't be actively incorrect.

Fiddly things

  • timer cancelation. Without thread-safety we can't immediately remove it from the data structure, so they have to be cleaned up during other traversals for triggering/inserting timers. I wonder if we could be clever and detect if the cancelation request came from the worker thread that owns the TimerHeap, in which case it would be okay to clean it up immediately.

  • handling non-IO.sleep timers i.e. scheduled via the Scheduler interface. Presumably the contract is that we will not invoke these more than once, so we need to setup an AtomicBoolean to guarantee this.

  • scheduling timers from external threads. Currently this involves submitting a Runnable via the external queue that then schedules the timer.

Benchmarks

Somehow sleepRace got worse which suggests we need smarter cancelation. Also there is a lot of error in the measurement.

series/3.5.x

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  200.754 ± 57.792  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   60.868 ± 18.988  ops/min

info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.09038918301666676
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.8053270754333333
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.08938118594166666
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 1.5771005562083333

This PR

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  251.271 ± 94.770  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   38.113 ± 11.084  ops/min


[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.11180908160833325
[info] Measuring heavily loaded...
[info] Loaded overhead: 1.0175761078166667
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.09586685694166674
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 1.444556097975

@armanbilge
Copy link
Member Author

Benchmarks after latest changes.

series/3.5.x

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  314.807 ? 49.493  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   74.017 ?  8.415  ops/min

[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.08945369533333336
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.4166134899416667
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.0899092131999999
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 0.40755935104166663

This PR

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  324.368 ? 47.451  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   89.657 ?  3.758  ops/min

[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.0938145233583334
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.43363560199166673
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.08244931874166661
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 0.3595659251250001

@durban
Copy link
Contributor

durban commented Sep 5, 2023

This is an interesting approach. I agree, that the "fiddly things" should be solved for this to work.

Specifically, cancelling timers seems critical. I'm not sure, that cleaning them up during other inserts will be enough. (But it might be, as it's not a skiplist any more...) But actually freeing cancelled timers seems important.

Another question: are we okay with timer stealing being a no-op? Because (if I'm reading it correctly), on certain platforms it definitely can be...

@armanbilge
Copy link
Member Author

This is an interesting approach. I agree, that the "fiddly things" should be solved for this to work.

Thanks! I've pushed solutions for all of these in the latest commits.

Specifically, cancelling timers seems critical. I'm not sure, that cleaning them up during other inserts will be enough.

Based on the first round of benchmarks it did seem like this was not enough. The latest commits I pushed changed it so that if the canceling thread owns the timer heap (i.e. it was the same thread that scheduled it) then it is removed immediately. This appears to have made a difference in the second benchmarks.

I just realized my justification for this approach depends a lot on the I/O-integrated runtime ... so my thinking is, timeouts are typically associated with some I/O op e.g. when the read/write succeeds the timeout can be canceled. In the new I/O integrated runtime, both the timer and the I/O op will be specifically associated with that thread. So when the I/O completes, it will continue on the same thread that started it and also started the timer.

However this is not currently the case where I/O relies on an external selector thread ... in that case, we'd be going through the external queue and we'll end up some random worker thread that probably doesn't own our timer. Hmm.

tl;dr this approach heavily optimizes for the happy path where in practice we are staying local to a single worker thread.

But actually freeing cancelled timers seems important.

Freeing what specifically, the callback? This is definitely freed regardless of which thread does the canceling, by nulling out the callback. What is non-deterministic is when this is visible/published (since the update is not synchronized/volatile). But it should happen eventually ...

However, the heap datastructure itself could grow quite large holding all these nulled entries. 🤔

@armanbilge
Copy link
Member Author

Another question: are we okay with timer stealing being a no-op? Because (if I'm reading it correctly), on certain platforms it definitely can be...

Oh, forgot to respond to this. I'm not sure what you mean? Even in the current implementation it can be a no-op, if there are no timers to steal 😉

I guess the relevant question is, are we okay with timer stealing missing some currently stealable timers? Still, in the current approach this is a possibility, since it is just choosing threads at random, you could keep getting very unlucky. But the idea is eventually someone should steal from that thread.

Which I think is the same idea here: eventually the data should be published to the other thread ... right? I'm not enough of an expert particularly when it comes to ARM. Maybe there are some pathological situations where the timers would never be published, but I feel like because of GC and stuff it has to happen eventually ...

@durban
Copy link
Contributor

durban commented Sep 5, 2023

Sorry, I probably didn't look at the newer commits.

Yeah, actually freeing "memory". So not just the callback, but also other memory which was allocated when the callback was inserted. (So yeah, I'm thinking exactly about the "datastructure itself could grow quite large".)

About stealing being no-op:

are we okay with timer stealing missing some currently stealable timers?

Exactly. I don't think this can happen currently. Specifically: if there is at least one stealable timer, it will not finish stealing without actually stealing something. (Currently it doesn't steal everything it can, but it could.)

Eventual memory visibility: as far as I know, plain reads (which this PR seems to use) absolutely don't guarantee that. (That would be a property of opaque, if I remember correctly, which is JVM 9+ I think.)

@armanbilge
Copy link
Member Author

So yeah, I'm thinking exactly about the "datastructure itself could grow quite large".

Yeah, this is a legitimate danger when too much cancelation is originating on other threads. I think the owner thread is just going to have to do some periodic self-cleanup ... this is annoying b/c scanning the heap for canceled timers is O(n) ...


Eventual memory visibility: as far as I know, plain reads (which this PR seems to use) absolutely don't guarantee that.

Sure, generally speaking plain reads definitely don't guarantee that. But what I'm wondering is if this is effectively guaranteed in practice by virtue of the fact that JVM is a GC runtime. So all allocations need to become visible to other threads eventually, in order for GC to work correctly. Right?

(That would be a property of opaque, if I remember correctly, which is JVM 9+ I think.)

Interesting. I read a bit about it and I'm not entirely sure:

Returns the value of a variable, accessed in program order, but with no assurance of memory ordering effects with respect to other threads.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#getOpaque(java.lang.Object...)

If there's no assurance of ordering wrt other threads, then isn't it lacking the same guarantees? What is "program order"?

@durban
Copy link
Contributor

durban commented Sep 6, 2023

So all allocations need to become visible to other threads eventually, in order for GC to work correctly. Right?

I'm not sure (the JMM doesn't seem to mention the GC very much). I think it's enough if they become visible to the GC; other threads don't have to care.

Opaque: I think this is where I've read that opaque accesses eventually happen: https://gee.cs.oswego.edu/dl/html/j9mm.html#opaquesec (this is not the spec, but unlike the spec, it actually explains some things).

If there's no assurance of ordering wrt other threads, then isn't it lacking the same guarantees?

I mean, a plain read is not guaranteed to happen at all... so an opaque read still has more guarantees.

What is "program order"?

Here you go: https://docs.oracle.com/javase/specs/jls/se17/html/jls-17.html#jls-17.4.3 (I'm not saying I understand it, but it is very clearly defined :-).

@armanbilge
Copy link
Member Author

I'm not sure (the JMM doesn't seem to mention the GC very much). I think it's enough if they become visible to the GC; other threads don't have to care.

Oh yeah, I am not trying to make a formal argument based on JMM or specifications or anything. I'm just saying in practice, because of how things work in real programs on real JVMs on real CPUs, timers will eventually be visible to other threads, probably sooner rather than later.

And in the grand scheme of things I think that's good enough. IMO programs should not rely on stealing to work correctly; it should really only be most useful in pathological scenarios, which are already violating several other "happy-path" assumptions in CE.

Finally, we do have a test for stealing, so we will see for ourselves if that ever fails, particularly on ARM.


Aha, thanks for those links about opacity! That's a much more helpful explanation :)

"Writes are eventually visible." Well maybe One Day:tm: we can use that /shrug

@durban
Copy link
Contributor

durban commented Sep 7, 2023

Okay, so you're basically saying, that we are okay with timer stealing missing some currently stealable timers. I also think that's probably fine (at least I can't rellay think of a non-pathological case which would be hurt by it).

@armanbilge
Copy link
Member Author

we are okay with timer stealing missing some currently stealable timers

Yes. We are okay with this, and also my hypothesis is that in practice we won't be detectably missing currently stealable timers anyway e.g. they will be fortuitously published by some other mechanism.

@durban
Copy link
Contributor

durban commented Sep 10, 2023

I think this might have a problem.

It's important, that a timer (which is not cancelled) is actually triggered (its callback is called). I don't think this is guaranteed here:

  • Let's see a case, when the owner thread calls pollFirstIfTriggered, and concurrently another thread calls steal.
  • Both of these will call getAndClear, which is essentially:
    • plain read of callback
    • plain write of callback (null)
  • Previously the owner thread initialized callback to non-null, also with a plain write.
  • It might happen, that the non-owner thread does not "see" the initializing write, and also the owner thread "sees" the null the other thread wrote. Thus, nobody will call the callback.

This might be fixed with an if in getAndClear, but I think a better thing would be to safely initialize Node, so everybody can "see" the initializing write.

@armanbilge
Copy link
Member Author

but I think a better thing would be to safely initialize Node, so everybody can "see" the initializing write.

That's a good point, we should probably just do that.

Thanks for bringing this up. I had thought through this exact scenario as well but I convinced myself it was okay based on a Discord thread.

Screen Shot 2023-09-10 at 1 53 28 AM

https://discord.com/channels/632277896739946517/839263556754472990/1087781250284130375

@armanbilge
Copy link
Member Author

@armanbilge
Copy link
Member Author

More relevant reading: https://shipilev.net/blog/2014/safe-public-construction/#_safe_initialization

It describes the implementation "quirk" we are currently relying on.

A final field was written. Notice we do not care about what field was actually written, we unconditionally emit the barrier before exiting the (initializer) method. That means if you have at least one final field write, the final fields semantics extend to every other field written in constructor.

So essentially we are using the finality of triggerTime to publish the callback.

private final class Node(
val triggerTime: Long,
private[this] var callback: Right[Nothing, Unit] => Unit,
var index: Int
) extends Function0[Unit]

So at least on HotSpot I don't think it's broken. Whether we want to rely on this quirk however is a much different question. Especially since in this case it's not too hard to fix.

@durban
Copy link
Contributor

durban commented Sep 14, 2023

Hm...

Yeah, I've remembered something like that (except I've misremembered all the important details :-). So yeah, it seems it'll work on hotspot (for now).

Especially since in this case it's not too hard to fix.

How? Now that I've actually read what the spec says, I'm no sure how to fix it...

@armanbilge
Copy link
Member Author

How? Now that I've actually read what the spec says, I'm no sure how to fix it...

Really? Your idea above seemed reasonable, does it have some flaw?

This might be fixed with an if in getAndClear

Also we could make the callback a val so it's a final field and you use a boolean var to track whether its been invoked. We can't null it in that case, but it's not the end of the world.

@durban
Copy link
Contributor

durban commented Sep 15, 2023

Ah, yes, the if. That should fix this specific issue. (I was thinking about fixing the whole "var might not be visible to other threads" thing, and forgot about the specific problem. In fact, we should be sure that there are no other similar problems.)

Also we could make the callback a val so it's a final field and you use a boolean var to track whether its been invoked. We can't null it in that case, but it's not the end of the world.

Hm... freeing memory on cancel seems important. Although, the packIfNeeded might already ensure that.

@djspiewak
Copy link
Member

From a performance standpoint, this is nearly a wash. It does make sleep a bit faster, but it also reduces timer granularity under load by about 31%, which definitely isn't awesome (though not the end of the world). I'm not entirely sure why timer granularity is made worse by this change; I would have expected the opposite?

Before

[info] Benchmark                 (size)   Mode  Cnt    Score   Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  958.362 ± 0.636  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20  198.568 ± 0.277  ops/min

Warming up...
Measuring unloaded...
Unloaded overhead: 0.07872363459999998
Measuring heavily loaded...
Loaded overhead: 0.28008827346666676
Measuring unloaded 100x...
Unloaded overhead 100x: 0.07937270549166664
Measuring heavily loaded 100x...
Loaded overhead: 0.2589938195999999

After

[info] Benchmark                 (size)   Mode  Cnt     Score   Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  1046.615 ± 1.359  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   201.767 ± 0.359  ops/min

[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.07740299239999993
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.25821024226666656
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.07822806469166665
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 0.33923368308333335

@armanbilge
Copy link
Member Author

I did my own x86 benchmarks and they are comparable to the ARM benchmarks. Both were 8 core linux machines.

this PR

[info] Benchmark                 (size)   Mode  Cnt    Score   Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  397.111 ± 2.954  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   88.611 ± 2.623  ops/min

[info] running (fork) cats.effect.benchmarks.SleepDrift 
[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.10907944324999996
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.4034487338499999
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.11182439151666657
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 0.4696231997083333

series/3.5.x

[info] Benchmark                 (size)   Mode  Cnt    Score    Error    Units
[info] SleepBenchmark.sleep       10000  thrpt   20  266.710 ± 14.059  ops/min
[info] SleepBenchmark.sleepRace   10000  thrpt   20   75.928 ±  1.305  ops/min

[info] Warming up...
[info] Measuring unloaded...
[info] Unloaded overhead: 0.11620415939166673
[info] Measuring heavily loaded...
[info] Loaded overhead: 0.40317077278333335
[info] Measuring unloaded 100x...
[info] Unloaded overhead 100x: 0.11014420795833324
[info] Measuring heavily loaded 100x...
[info] Loaded overhead: 0.4731014630083332

@armanbilge
Copy link
Member Author

Quoting myself from Discord:

I'd love to see it merged and it seems good but I'm willing to admit its risky for 3.5.x
the risks are that its a complete change of implementation, and that the implementation works best with happy paths that polling promises
e.g. my canonical example is: schedule timeout, start I/O, complete I/O, cancel timeout
in 3.5 the I/O comes back in through external pool, so the cancel is from a random thread
in 3.6 the I/O comes back on the same thread, so the cancel is also on the same thread

Daniel S responds:

since it's optimizing for a happy path that is very improbable in 3.5, and since it's relatively risky… I think we're better off punting it to 3.6 where the risk has a payoff attached to it
IMO, let's retarget it and merge it

@durban since you previously approved for merging into 3.5.x, I'm curious if you have any opinion.

@durban
Copy link
Contributor

durban commented Jan 14, 2024

@armanbilge I don't have a strong opinion regarding 3.5 or 3.6. What I think is important though, is to figure out if we're really okay with this "best effort" stealing. Basically the discussion in #3873.

@armanbilge
Copy link
Member Author

armanbilge commented Feb 15, 2024

Based on discussion in #3781 (comment) and #3873 (reply in thread) I'm going to re-target to series/3.x and then it should be good to merge.

@armanbilge armanbilge changed the base branch from series/3.5.x to series/3.x February 15, 2024 19:26
@djspiewak djspiewak closed this Feb 15, 2024
@djspiewak djspiewak reopened this Feb 15, 2024
durban
durban previously approved these changes Feb 18, 2024
Copy link
Contributor

@durban durban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few minor comments.

NOTICE.txt Outdated
@@ -0,0 +1,8 @@
cats-effect
Copyright 2020-2023 Typelevel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024

} else false
} else false

val heap = this.heap // local copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, the fact that we never read null here depends on the WSTP initializing and starting threads in a particular order (first init everything, then start everything). This seems a little bit fragile. Would it be worth doing a null check here to be sure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Good catch.

@djspiewak djspiewak merged commit dddaae1 into typelevel:series/3.x Feb 19, 2024
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants