Skip to content


Choose a tag to compare
@djspiewak djspiewak released this 28 Nov 05:32
· 2752 commits to series/3.x since this release

This is the fifteenth major release in the Cats Effect 3.x lineage. It is fully binary compatible with every 3.x release, and fully source-compatible with every 3.3.x release. Note that source compatibility has been broken with 3.2.x in some minor areas (detailed below). Scalafixes are available and should be automatically applied by Scala Steward if relevant.

The theme of this release has been improving observability, testability, and debuggability of all applications using Cats Effect 3. This has resulted in a massive set of new functionality, tweaks, and improvements which make 3.3.0 the most significant Cats Effect release ever apart from 3.0.0 itself. The developer experience has been significantly improved, particularly in tricky areas such as diagnosing deadlocks and deterministically testing functionality involving timers and clocks. Additionally, new functionality has been brought to Scala.js, including full support for tracing!

Finally, we took the opportunity to continue to build on IO's best-in-class performance with significant improvements to the fiber runtime (including dynamic workload fairness self-tuning) and continued dramatic slimming of the fiber memory footprint. Embarrassingly, we even discovered that the build was using an unnecessary Scala compiler flag which inhibited performance in some scenarios (particularly involving large numbers of fibers) by up to 15%! All of these improvements should add up into a very noticeable leap forward for application metrics in nearly all real-world scenarios.

Notable Changes

Thread Fiber Dumps

One of the most annoying and difficult problems to resolve in any asynchronous application is the asynchronous deadlock. This scenario happens when you have a classic deadlock of some variety, where one fiber is waiting for a second fiber which is in turn waiting for the first (in the simplest case). Due to the asynchronous nature of the runtime, whenever a fiber blocks, all references to it are removed from the internal runtime, meaning that a deadlock generally leaves absolutely no residue whatsoever, and the only recourse as a developer is to just start sprinkling IO.println expressions around the code to see if you can figure out where it's getting stuck.

This is very much in contrast to a conventional deadlock in a synchronous runtime, where we have JVM-level tools such as thread dumps to suss out where things are stuck. In particular, thread dumps are a commonly-applied low level tool offered by the JVM which can serve to inform users of what threads are active at that point in time and what each of their call stacks are. This tool is generally quite useful, but it becomes even more useful when the threads are deadlocked: the call stacks show exactly where each thread is blocked, making it relatively simple to reconstruct what they are blocked on and thus how to untie the knot.

Fiber dumps are a similar construct for Cats Effect applications. Even better, you don't need to change anything in order to take advantage of this functionality. As a simple example, here is an application which trivially deadlocks:

import cats.effect.{IO, IOApp}

object Deadlock extends IOApp.Simple {
  val run =
    for {
      latch <- IO.deferred[Unit]

      body = latch.get
      fiber <- body.start
      _ <- fiber.join

      _ <- latch.complete(())
    } yield ()

The main fiber is waiting on fiber.join, which will only be completed once latch is released, which in turn will only happen on the main fiber after the child fiber completes. Thus, both fibers are deadlocked on each other. Prior to fiber dumps, this situation would be entirely invisible. Manually traversing the IO internals via a heap dump would be the only mechanism for gathering clues as to the problem, which is far from user-friendly and also generally fruitless.

As of Cats Effect 3.3.0, users can now simply trigger a fiber dump to get the following diagnostic output printed to standard error:

cats.effect.IOFiber@56824a14 WAITING
 ├ flatMap @ Deadlock$.$anonfun$run$2(Deadlock.scala:26)
 ├ flatMap @ Deadlock$.$anonfun$run$1(Deadlock.scala:25)
 ├ deferred @ Deadlock$.<clinit>(Deadlock.scala:22)
 ├ flatMap @ Deadlock$.<clinit>(Deadlock.scala:22)
 ╰ run$ @ Deadlock$.run(Deadlock.scala:19)
cats.effect.IOFiber@6194c61c WAITING
 ├ get @ Deadlock$.$anonfun$run$1(Deadlock.scala:24)
 ╰ get @ Deadlock$.$anonfun$run$1(Deadlock.scala:24)
Thread[io-compute-14,5,run-main-group-6] (#14): 0 enqueued
Thread[io-compute-12,5,run-main-group-6] (#12): 0 enqueued
Thread[io-compute-6,5,run-main-group-6] (#6): 0 enqueued
Thread[io-compute-5,5,run-main-group-6] (#5): 0 enqueued
Thread[io-compute-8,5,run-main-group-6] (#8): 0 enqueued
Thread[io-compute-9,5,run-main-group-6] (#9): 0 enqueued
Thread[io-compute-11,5,run-main-group-6] (#11): 0 enqueued
Thread[io-compute-7,5,run-main-group-6] (#7): 0 enqueued
Thread[io-compute-10,5,run-main-group-6] (#10): 0 enqueued
Thread[io-compute-4,5,run-main-group-6] (#4): 0 enqueued
Thread[io-compute-13,5,run-main-group-6] (#13): 0 enqueued
Thread[io-compute-0,5,run-main-group-6] (#0): 0 enqueued
Thread[io-compute-2,5,run-main-group-6] (#2): 0 enqueued
Thread[io-compute-3,5,run-main-group-6] (#3): 0 enqueued
Thread[io-compute-1,5,run-main-group-6] (#1): 0 enqueued
Thread[io-compute-15,5,run-main-group-6] (#15): 0 enqueued
Global: enqueued 0, foreign 0, waiting 2

A fiber dump prints every fiber known to the runtime, regardless of whether they are suspended, blocked, yielding, active on some foreign runtime (via evalOn), or actively running on a worker thread. You can see an example of a larger dump in this gist. Each fiber given a stable unique hexadecimal ID and paired with its status as well as its current trace, making it extremely easy to identify problems such as our earlier deadlock: the first fiber is suspended at line 26 (fiber.join) while the second fiber is suspended at line 24 (latch.get). This gives us a very good idea of what's happening and how to fix it.

Note that most production applications have a lot of fibers at any point in time (millions and even tens of millions are possible even on consumer hardware), so the dump may be quite large. It's also worth noting that this is a statistical snapshot mechanism. The data it is aggregating is spread across multiple threads which may or may not have all published into main memory at a given point in time. Thus, it isn't necessarily an instantaneously consistent view of the runtime. Under some circumstances, trace information for a given fiber may be behind its actual position, or a fiber may be reported as being in one state (e.g. YIELDING) when in fact it is in a different one (e.g. WAITING). Under rare circumstances, newly-spawned fibers may be missed. These circumstances are considerably more common on ARM architectures than they are under x86 due to store order semantics.

Summary statistics for the global fiber runtime are printed following the fiber traces. In the above example, these statistics are relatively trivial, but in a real-world application this can give you an idea of where your fibers are being scheduled.

Triggering the above fiber dump is a matter of sending a POSIX signal to the process using the kill command. The exact signal is dependent on the JVM (and version thereof) and operating system under which your application is running. Rather than attempting to hard-code all possible compatible signal configurations, Cats Effect simply attempts to register both INFO and USR1 (for JVM applications) or USR2 (for Node.js applications). In practice, INFO will most commonly be used on macOS and BSD, while USR1 is more common on Linux. Thus, kill -INFO <pid> on macOS and kill -USR1 <pid> on Linux (or USR2 for Node.js applications). POSIX signals do not exist on Windows (except under WSL, which behaves exactly like a normal Linux), and thus the mechanism is disabled.

Since INFO is the signal used on macOS and BSD, this combined with a quirk of Apple's TTY implementation means that anyone running a Cats Effect application on macOS can simply hit Ctrl-T within the active application to trigger a fiber dump, similar to how you can use Ctrl-\ to trigger a thread dump. Note that this trick only works on macOS, since that is the only platform which maps a particular keybind to either the INFO or USR1 signals.

In the event that you're either running on a platform which doesn't support POSIX signals, or the signal registration failed for whatever reason, Cats Effect on the JVM will also automatically register an MBean under cats.effect.unsafe.metrics.LiveFiberSnapshotTriggerMBean which can produce a string representation of the fiber dump when its only method is invoked.

This entire mechanism has no performance impact (well, it probably would if you kept printing the dump in a loop, but don't do that). It is controlled by the same configuration as tracing.

And in case you were wondering, yes, it does work on Node.js applications!

cats.effect.IOFiber@d WAITING
cats.effect.IOFiber@9 WAITING
 ╰ deferred @ <jscode>.null.$c_LDeadlock$(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:22)
cats.effect.IOFiber@a WAITING
 ├ flatMap @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:26)
 ├ flatMap @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:25)
 ├ deferred @ <jscode>.null.$c_LDeadlock$(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:22)
 ╰ flatMap @ <jscode>.null.$c_LDeadlock$(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:22)
cats.effect.IOFiber@b WAITING
Global: enqueued 0, waiting 4

You'll notice that there are two extra fibers in this example. These are actually internal to Cats Effect itself, since IOApp on JavaScript maintains a keep-alive fiber and races it against the main fiber. The two fibers in the middle line up pretty closely with those from the JVM dump of the same program, though some of the tracing information is a little different. This is happening because Scala.js itself is very aggressive about inlining across call-sites, and Cats Effect is written in such a way that much of it does get inlined away. This in turn confuses the tracing slightly, since there's simply less information to work with. In practice, this isn't too much of an issue, since most practical fibers will have more than enough tracing information to pin down what's going on, but it's something to keep in mind.

Note that this functionality is currently only available on applications running under Node.js, since it isn't entirely clear how to trigger a fiber dump within a browser runtime. If you have thoughts or inspirations on this point, feel free to open an issue or discuss in Discord!

Better Fiber toString

println-based debugging is the staple of a lot of tricky program diagnosis, particularly in the realm of highly-concurrent programs. Unfortunately, prior to this release, printing out a Fiber would more often than not simply print false (and more rarely, true) due to some subtle implementation details. This was actually worse than the default inherited from AnyRef which would at least produce an instance-identifying numeric value which could be used to correlate fibers in different places.

This property has been restored, along with several other pieces of useful debugging information, on the new toString for the default Fiber implementation in IO. For example:

import cats.effect.{IO, IOApp}
import scala.concurrent.duration._

object ToString extends IOApp.Simple {
  val run = {
    lazy val loop: IO[Unit] = IO.cede >> loop
    (loop.start <* IO.sleep(100.millis)).flatMap(IO.println(_))

In this example, we're starting a fiber which loops infinitely, sleeping for 100 milliseconds, and then simply printing the toString of that fiber and exiting. The results will be something like the following:

cats.effect.IOFiber@217df74f RUNNING: >> @ ToString$.loop$lzycompute$1(ToString.scala:22)

The hexadecimal value following the @ is stable and unique to the instance (though it will not be consistent between JVM executions), and will in fact be the same value printed during fiber dumps. The status will either be RUNNING or SUSPENDED, since fibers themselves have access to less status context than the fiber dump mechanism does.

The final component of the toString is the top-most frame from the fiber trace, if available. Note that, much as with fiber dumps, we're leveraging some tricky memory barriers to retrieve this information without performance cost. This in turn means that it may be slightly out of date, depending on the thread which is calling toString, and depending on the underlying CPU architecture (again, x86 will produce more accurate results here than ARM).

Since this value is the toString for the fiber and not an additional debug function, it should be of significant aid to IDE debuggers (such as IntelliJ or BSP), which generally rely on the toString of an object when rendering. If tracing is disabled, only the first half of the string will be produced (everything prior to the :).

Fiber Runtime Observability

In keeping with this release's focus on developer usability and productionalization, we've significantly improved the observability of all Cats Effect applications out of the box by adding a number of JMX MBeans which provide metrics on the underlying runtime. These metrics are implemented without any impact to performance and thus can be safely incorporated into automated reporting, alerting, and monitoring systems. They can also be a useful local diagnostic tool when viewed with Mission Control:

All MBeans are automatically registered and enabled by default. This functionality can be disabled using the same configuration that disables tracing support. In the event that JMX is unsupported by the underlying runtime (e.g. Graal native-image), Cats Effect will silently continue without active monitoring. Also please note that MBean observability is only supported for the global IORuntime, as well as any runtime managed by IOApp.

It's worth keeping in mind that nearly all metrics are only periodically published to main memory, meaning that samples are not guaranteed to be internally self-consistent nor are they guaranteed to represent an instantaneously-accurate snapshot of the runtime. All metrics will generally exhibit higher accuracy on x86 than they will on ARM for architectural reasons.

A number of metrics are available right now, but just to provide a brief taste:

  • Total number of blocked workers
  • Number of fibers currently suspended
  • Total number of workers
  • Total spilled fibers for each worker
  • Total number of stolen fibers for each worker
  • Current fiber dump represented as an Array[String]
  • ...and much more!

Please note that these metrics should be considered a first draft and we will likely continue to iterate and change them over time. We may remove or rename some existing metrics within the 3.x release lineage, though if this happens, it will be reserved for a minor version bump (rather than a patch version).

Be sure to let us know if the new observability proves useful, and also if you have any additional ideas for metrics which can be safely exposed by the runtime.

Mock Time and Improved Support for Testing IO

One of the many changes brought about by Cats Effect 3.0.0 has been the consolidation of numerous formerly-ancillary functions into the core typeclass hierarchy. Deferred and Ref are two obvious examples of this, but the now-deleted ContextShift and the heavily reworked Clock typeclasses are also quite significant. The former was superseded by enriched functionality within the Async typeclass – in particular, evalOn and stronger laws around async – while the latter still exists but has become explicitly part of the hierarchy, rather than serving as an additional element on the side.

By and large, this change simplifies usage of Cats Effect in that it is no longer necessary to manually thread a Clock instance through layers and layers of call sites: you can simply invoke the realTime or monotonic functions on either IO or whatever typeclass constraint you happen to have (e.g. Temporal, Sync, or Async). Unfortunately, this integration and canonicalization also brought about an unintended consequence: mocking time for the purposes of testing became effectively impossible.

To an extent, some of this regression was to be expected. After all, one of Cats Effect 3's major focus areas was in improving the safety and uniformity of the runtime semantics. Mocking time in Cats Effect 2 by creating alternative Clock instances was always fraught with caveats even when you could be sure that the underlying code didn't just do something hacky like Sync[F].delay(System.currentTimeMillis()). Among other things, timeouts and races behaved inconsistently with respect to this source of time, meaning that code might have wildly different semantics under test.

However, none of this changes the fact that these kinds of use-cases are unavoidable. Testing code involving timeouts in particular is a tricky scenario which arises often. Prior to Cats Effect 3.3.0, this could only really be done by using wall clock time, which usually results in long running and unreliable unit tests. The Ember project (a purely functional HTTP network layer built on Fs2 and Cats Effect) has suffered particularly acutely from this limitation.

The solution to all of these problems is to provide a new mechanism which overrides the runtime evaluation semantics of IO such that all time-related functionality (including realTime, monotonic, and sleep) behave consistently, deterministically, and in a fashion which is ultimately under external user control for the purposes of testing. This mechanism should be otherwise as consistent as possible with the production IO runtime, ensuring that code under test can be relied upon to behave as close as possible to how it would in a production environment, except that its view of time as a whole would be redefined.

In Cats Effect 3.3.0, this mechanism is now available in the form of TestControl, a new type in the cats-effect-testkit module. This wraps and revises the functionality previously available in a less ergonomic form within TestContext, a utility that has been a part of Cats Effect since before 1.0.0. TestControl is much more opinionated about how you interact with the runtime, and this in turn allows it to provide a more deterministic and granular set of controls to users.

Note that this tool is only intended for use in test suites! It is not a production runtime.

You can find an intricate example of applying TestControl to a complex scenario on the project website. As a very quick example though using MUnit syntax, consider testing the timeout function:

import cats.effect.testkit.TestControl

test("timeout should take the winner") {
  val program = IO.sleep(5.seconds).timeout(10.seconds)

test("timeout should halt the loser") {
  val program = IO.sleep(10.seconds).timeout(5.seconds)

Both of these tests will deterministically pass within milliseconds, even on CI. You can get even finer grain control over time passage by using the TestControl.execute function, though this is beyond the scope of these notes.

This functionality fully subsumes the old Clock mocking technique in Cats Effect 2, as well as all prior use of TestContext (which has now been changed subtly to support TestControl and is not recommended for direct use).

Improved interruptible

Judging by questions in Discord, most people seem to still be unaware of the interruptible function and its implications. Considering that this function was also, until this release, almost entirely undocumented, this is perhaps not a surprising turn of events!

In 3.3.0, we've slightly reworked the IO.interruptible and Sync#interruptible APIs (jointly). In particular, interruptible previously took a Boolean parameter which governed whether or not the state machine would repeatedly attempt to call Thread#interrupt or if it would just send a single signal and await termination. Far and away, the most common use-case here would be to simply pass false, but this led to a rather clunky and unintuitive syntax: IO.interruptible(false) { ... }.

In an attempt to fix this, interruptible has been split into two methods: interruptible and interruptibleMany, with the former behaving exactly like interruptible(false) previously behaved. This change was carefully encoded to ensure binary compatibility with previous Cats Effect 3.x versions, which unfortunately means that the old interruptible signature remains available on the IO companion, albeit marked with @deprecated.

As an aside, interruptible is very similar to blocking in that each can be used to wrap a side-effect which hard-blocks a Thread (think: file I/O, use of java.util.concurrent primitives, etc). However, unlike blocking and delay, interruptible side-effects can be canceled just like any other fiber stage! This mechanism works by invoking the Thread#interrupt method on the underlying carrier thread which evaluates the side-effect. Note that not all JVM code respects thread interruption (notably, most file I/O libraries do not), and only thread suspension points are generally interruptible. Thus, the effectiveness of this mechanism is somewhat limited, but for cases where it is helpful it can be employed:

import cats.effect.{IO, IOApp}
import scala.concurrent.duration._

object InterruptExample extends IOApp.Simple {
  val run = {
    val sleepy = IO interruptible {

    sleepy.timeoutTo(100.millis, IO.println("abort abort abort"))

The above program will terminate within about 100 milliseconds printing abort abort abort. Replacing interruptible with blocking would be semantically correct, but in that case the program would run for the full 10 seconds before printing slept! followed by abort abort abort. The latter semantic happens because blocking (like delay) is effectively uncancelable, meaning that the timeoutTo cancelation is unable to terminate the action early.

interruptible is not a safe drop-in replacement for blocking in all cases, since it is extremely common for JVM code, whether written in Scala or in Java, to be not interruption-safe. You should be careful to make sure that the side-effect in question will respond properly to interruption without leaking resources before taking advantage of this functionality. There are also some minor performance implications (blocking is slightly faster), though they are not expected to have a meaningful impact in most applications.

Tracing on Scala.js

Asynchronous fiber traces have been available on Cats Effect on the JVM since 2.3.0 (with a brief hiatus during the 3.0 era), but it's always been known that it should be possible (at least in theory) to bring the same technology to Scala.js. Tracing works by constructing a Throwable at IO call sites and walking the stack trace to determine the most semantically reasonable caller for a given combinator. That stack frame is then pushed into a ring buffer and carried along with the fiber. Of course, constructing Throwables in this fashion is quite slow, so we recover performance by caching this process based on the java.lang.Class instance of whatever thunk or Function value was passed to the combinator. This cache converges quite quickly on most applications (since it is only proportional to the size of the calling code), which in turn means that tracing in practice has a very minimal performance cost. (roughly 30% on worst-case synthetic benchmarks, closer to 5-10% in more realistic compute-bound tests, and entirely unmeasurable in I/O-bound applications)

These implementation details are relevant because, at least in theory, all of these same mechanisms exist on Scala.js! We can construct exceptions and walk their call stacks on that platform just as well as we can on the JVM, and while it isn't meaningful to ask for the Class of an AnonFunction, we can get access to its source code, which gives us at least something we can hang our hats on.

Thanks to the hard work of Arman Bilge, we have now done exactly this! Cats Effect applications on Scala.js are tracing-enabled and support all of the same features as the JVM tracing, including enhanced exceptions and fiber dumps. Unfortunately, JS tracing is a little less "set and forget" than JVM tracing. In practice, enabling tracing in a Node.js application imposes a nearly 100% performance penalty. The exact reasons for this are less than clear, but it seems to have something to do with the way that most JavaScript runtimes (but particularly V8) are optimized to assume that exceptions are very rare. Constructing an exception seems to cause the runtime to actively deoptimize itself, which has follow-on ripple effects across the application.

Additionally, while traces are very cool, they're not particularly helpful if they cannot be linked back to the original source code. This is relatively easy on the JVM since debug symbols are embedded in class files and parsed into Throwable stack traces by the runtime, but it's less straightforward on Scala.js where the actual exception source references would normally be to the generated JavaScript and not to the original Scala.

This problem is resolved through the use of source maps, a bit of metadata which allows the generated JavaScript to maintain a map which links back into the original source files which produced it. Cats Effect tracing is fully aware of this mechanism, and if it is enabled, will automatically use it as part of its trace output.

All of this comes together to give us the following!

java.lang.Throwable: Boom!
  at null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:35:55)
  at apply @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:35)
  at ifM @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:35)
  at apply @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:34)
  at flatMap @ <jscode>.null.<anonymous>(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:34)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at apply @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)
  at flatMap @ Example$.fib(/workspace/cats-effect/example/js/src/main/scala/cats/effect/example/Example.scala:24)

This aligns very nicely with the output produced by the same program on the JVM:

java.lang.Throwable: Boom!
        at Example$.$anonfun$program$5(Example.scala:35)
        at apply @ Example$.$anonfun$program$3(Example.scala:35)
        at ifM @ Example$.$anonfun$program$3(Example.scala:35)
        at apply @ Example$.$anonfun$program$1(Example.scala:34)
        at flatMap @ Example$.$anonfun$program$1(Example.scala:34)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)
        at apply @ Example$.fib(Example.scala:24)
        at flatMap @ Example$.fib(Example.scala:24)

Due to the performance costs associated with tracing on JavaScript, it is only enabled under fastOptJS. It is assumed that any production deployment of an application will use fullOptJS, which automatically disables tracing and restores full performance (all tracing call sites will be inlined away by the JavaScript runtime).

Dynamically Self-Tuning Fairness

While the Cats Effect fiber scheduler was originally derived from algorithms and data structures pioneered by the Tokio project, it has since evolved in several significant areas which result in better performance and tighter optimization with respect to both the JVM in general and conventional Cats Effect usage patterns specifically. One of these areas is the batched queue.

In any work-stealing scheduler, there are at least n + 1 work queues: one shared "external" queue, and one unshared local queue for each worker. The whole system is optimized around using the local queues for as much as possible, and keeping work in the local queues whenever there is an option to do so. Workers steal from each other's local queues when they run out of work, and periodically pull from the shared external queue, which in turn is only used for new incoming work which came from entirely outside of the thread pool (such as async continuations).

One of the key insights coming from Tokio is the encoding of a local queue which is efficient to steal from, efficient to locally poll/push, and removes contention points between all three operations. In order to do this, the local queues are all fixed to a total capacity of 256 fibers, and they can never contain any more than this. Of course, at some point, a local queue may overflow, meaning that a worker has more than 256 fibers which it needs to manage. This can happen either due to internal forking (e.g. use of the start operation) or polling from the external queue (which can bring a fiber into an already-full local queue). When this happens, we need a place to send those fibers.

Tokio solves this by taking 128 fibers (half of the local queue) and pushing each one individually onto the external queue. The halving property means that this kind of overflow happens relatively infrequently, and while it does sacrifice fairness for that back half of the queue, it at least avoids starving other sources which are attempting to enqueue work. Cats Effect improves upon this process to increase its efficiency and reduce the impact to fairness.

In particular, when this overflow happens, Cats Effect's fiber runtime places the overflowed fibers onto a batched queue, which maintains the fibers as a large batch in a shared space. Just like the external queue, all fibers can pull from this batched queue, and when they do they will receive a large set of fibers rather than pulling just one at a time from the external queue. In order to maintain fairness with the external queue, workers alternate between pulling from the batched queue or the external queue once every 64 steps.

This optimization works quite well on average, but it risks over-fitting to artificial benchmarks. In particular, it makes a number of assumptions. For one thing, it assumes that internal start operations (which also occur as part of par operators and in other circumstances) are dramatically more common than external async continuations. This is often true in Cats Effect applications, but it is not a guarantee, and in any application with a lot of async and very little start, the async continuations will take longer to complete than they could since workers will be wasting time polling the batched queue unnecessarily often. Conversely, if an application has even more starting than expected, but very little async, the external queue would be the one taking unnecessary attention away from the batched queue.

The goal should be to allow the pull to dynamically self-tune between these extremes depending on workload. An application with a lot of starting should see more time devoted to batch processing, while the inverse should be true for an application with a lot of async. The solution here is to build a virtuous cycle into the data structures by collapsing the external and batched queues into a single polymorphic queue which can contain either Fiber or Array[Fiber] elements, with the latter representing batches that had previously been spilled. This queue is polled once every 64 steps. When workers spill to it, they take 128 of their local fibers and split them into four batches of 32, enqueuing each of the four batches onto the polymorphic external queue.

This approach maintains the performance benefits of batching (optimizing for start-heavy workloads, which seem more common in Cats Effect than in Tokio applications), while maintaining the flexibility needed to dynamically optimize for async-heavy workloads when needed. Additionally, monotonicity, fairness, and amortization arguments are all preserved by the careful choice of coefficients (4 batches, 32 fibers, 64 steps, 128 fibers spilled, 256 fibers in local queue). This in turn ensures that while a fiber may have its execution slot delayed considerably in pathological cases, it is always guaranteed to come to the front of a worker eventually, preserving progress and performance simultaneously.

IO Run Loop No Longer Nests During Finalization

This is a rather strange bug that has been around since the earliest builds of Cats Effect 3 that included IO itself. Specifically, it has to do with the way in which IO rebuilds the run loop for a fiber when that fiber is being canceled. What used to happen is a brand new run loop would be created, clearing out and reusing the data structures which managed the old one, and then this process would repeat for each finalizer (evaluated inside-out). This meant that any fiber which was canceled within an arbitrarily long stack of finalizers (a situation which is relatively easy to construct, albeit with some contrivance) would in turn result in arbitrarily many copies of the run loop, each nested within each other.

As if this weren't bad enough, this process of recreating the run loop also would reset the counters which are used to insert fairness boundaries, meaning that a fiber could be constructed such that its cancelation could hog a worker thread for some arbitrary length of time, starving the pool.

We're very confident that no one has observed this in production, but it's still certainly a severe enough bug that it merits special attention. The fix should also have the pleasing side-effect of improving both throughput and fairness in cancelation-heavy workloads.

User-Facing Pull Requests

Special thanks to each and every one of you! You are all awesome.