Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cats Effect 3.0 Interruption Model Proposal #681

Open
SystemFw opened this issue Oct 23, 2019 · 9 comments
Labels

Comments

@SystemFw
Copy link
Collaborator

@SystemFw SystemFw commented Oct 23, 2019

I'm writing this proposal to describe the interruption model I'd like to see for cats-effect 3.0.
I'm only going to be focusing on semantics, things like the place in the hierarchy should be discussed elsewhere.
The main goals of this proposal are to lower the skill level needed to write interruption- and finalisation- safe code, and to preserve as much compositionality as possible in a problem that at some level defies it.
I'm going to try setting enough context to justify the decisions in this proposal, if you'd like to skip to the core of it, go to the Combinators and Semantics section.

The problem

The notion of safety in concurrent code is not univocal, and often revolves around two separate concepts: finalisation and interruption.
Finalisation is needed to ensure that certain sections of the code will run in any scenario: for example to close resources, or restore some data to a valid state.
Interruption is needed to avoid using resources once the calling code has decided they are no longer needed, to prevent deadlock by offering timeouts, and to compositionally express several patterns (e.g. do this for 10 seconds, then start doing that).
Unfortunately, these two concepts are at odds at a fundamental level :

  • Interruption means "No matter what you are doing, stop"
  • Finalisation means "No matter what you say, I must keep going"

From a theoretical point of view, there is no perfect compromise: by prioritising resource safety at all costs, you lose all interruption (this is current uncancelable), and by prioritising interruption at
all costs, you lose resource safety (this is your standard flatMap).
However, I will posit that for our users, loss of resource safety is more dangerous than loss of interruption, and therefore we shall keep things as interruptible as possible, but prioritise resource safety when breaking a tie. You will hopefully see exactly what I mean by this, further down in the proposal.

Current situation

I think it's safe to say that currently writing resource safe code that preserves interruption is a matter for experts, for two reasons:

  • The first is that users need to be familiar with guaranteeCase, bracketCase, continual, uncancelable, and the interaction of those with join and cancel. Each operation has subtly different semantics when it comes to interruption and they are all absolutely necessary in some scenarios.

  • The second is more fundamental: writing interruption-aware code means basically writing code that restores some properties when interruption happens. How and what needs to be restored generally depends on where interruption happened, and there is no way of knowing that.
    In practice this often means having to do complicated reasoning over a piece of state and figure out how to write code which can always restore it to a "steady" state, which can get very complex.

Principles

  1. As a default, interruption can happen "anywhere" (any yield points, between any two flatMaps), but in critical sections the writer of the code needs to limit and control where interruption can happen. The ability to do this is what makes the model easier to use.
  2. The writer of the code, not the caller, is ultimately in control of the resource safety of the code they write. It shouldn't be possible for someone to call code that was designed to be resource safe and somehow make it resource unsafe.
  3. If we accept principle 2, it turns out that interruption behaves in exactly the opposite way, the caller of the code is ultimately in control of whether the code they are calling is interruptible or
    not. The writer can allow interruption in limited places, but cannot have a guarantee that the final code will be interruptible, it's up to the caller.
    In other words, it's possible for someone to call code that was designed to be interruptible and make it uninterruptible. This is unavoidable really (if we want principle 2), but I'll argue it's the
    correct default. Also, preserving interruptibility is fairly easy, unlike today.

Combinators and Semantics

Safe regions

The model can be described by two functions (names can be changed):

def onCancel[A](fa: F[A])(finalizer: F[Unit]): F[A]
def mask[A](region: Poll => F[A]): F[A]

class Poll {
  def apply[B](fa: F[B]): F[B]
}


// or, in dotty to show the higher rank
def mask[A](region: ([B] => F[B] => F[B]) => F[A]): F[A]

onCancel attaches a finalizer to an F[A]. Upon interruption, all registered finalizers are run and the rest of the computation is short-circuited. The code triggering interruption backpressures on
said finalizers.
mask allows to control interruption in critical sections by defining a region in which interruption cannot happen unless one explicitly encloses the code that can be interrupted in a call to Poll.apply.
Because the writer of the code now knows exactly at which points interruption can happen, it's easy to add the correct logic via onCancel. We will soon see that the semantics of Poll.apply need to be refined a bit, but this is a good first intuition.
Example:

val p1 = mask { poll =>
  for {
    r <- openResource
    c <- poll(waitOnCondition).onCancel(closeResource)
    _ <- modifySomeStateDependingOn(c, r).attempt
    _ <- closeResource
  } yield ()
}

Which means:

val p1 = mask { poll =>
  for {
    r <- openResource // cannot be interrupted
    // flatMap cannot be interrupted
    c <- poll(waitOnCondition).onCancel(closeResource) // waiting can be interrupted
    // flatMap cannot be interrupted
    r <- modifySomeStateDependingOn(c, r).attempt // cannot be interrupted
   // flatMap cannot be interrupted
    _ <- closeResource // cannot be interrupted
  } yield ()
}

Note that p1 is almost the same as

openResource.bracket { r =>
  waitOnCondition.flatMap { c =>
    modifySomeStateDependingOn(c, r).attempt
  }
}{closeResource}

but there is a crucial difference, in the bracket version .flatMap and modifySomeState... can be interrupted, which means the code now has to worry about restoring SomeState, without knowing ifinterruption happened on waitOnCondition, or in between, or in any of the flatMaps that might make up modifySomeState..., requiring complex reasoning to figure out what to restore SomeState to.
I hope this illustrates how much nicer this model is to work with, as you essentially get to build a safe region where you need one.

Note 1:
In the snippet above I simply attempted any synchronous errors for brevity, but all sorts of error handling are possible of course.

Note 2:
In cats-effect 2.0, you have to bracket and continual to writep1 correctly.

Nested regions and restore semantics

So, in the section above we have defined Poll.apply, which is called in an uninterruptible region defined by mask, as making its argument interruptible again.
We will now see how these semantics are slightly wrong and require some refinement in order to comply with the principles above.
Imagine p1 is being called by p2, which is also safety-aware:

val p2 = mask { poll =>
  _ <- openResource2 
  _ <- p1
  _ <- closeResource2
}

Now, p2 thinks it's safe, but it's unaware that inside p1 there is an interruptible region defined by a call to poll, and so it can now be interrupted and not execute closeResource2.
In other words, writing resource safe code requires inspecting all downstream code to know whether a flatMap is safe, which in my opinion is not an acceptable solution.

We can solve this problem by changing the semantics of Poll.apply: a call to Poll.apply makes its argument inherit the interruption status of the outer scope.
For example, if we have def main = p1, the outer scope interruption status is interruptible (by default things are interruptible), and inside p1 waitOnCondition inherits this status, i.e. it's
interruptible, and we have recovered the previous semantics.
However, when nested inside p2, the outer scope is uninterruptible since we are inside mask, and so the call to poll inside p1 inherits that, and things are safe again.
If p2 wants to allow interruption of p1, it can call poll again:

val p2 = mask { poll =>
  _ <- openResource2 
  _ <- poll(p1).onCancel(closeResource2)
  _ <- closeResource2
}

def main = p2

p1.poll inherits the status of p2.poll, which inherits main, meaning that waitOnCondition is now interruptible again, but none of the other parts are, which are the desired semantics. This extends to arbitrary nesting.

Let's highlight where the tradeoff lies: the caller of code written inmask can take code that is safely interruptible, and (perhaps accidentally) make it uninterruptible by not calling poll on it.
Or, to look at it from the other side, the writer of code in mask cannot guarantee that their code is interruptible, it can only allow for it by calling poll, but the caller will have the final say.

I think this is the correct tradeoff: as the caller, you know whether you care about interrupting one of the things you call or not, so the failure mode of forgetting to call poll is way less dangerous than the initial semantics, where the code you write can become interruptible because something way down in the callstack has used an interruptible region.

I think this as compositional as we can get, and hopefully explains principles 2 and 3, and clarifies what I mean by saying that on balance, resource safety is favoured.

Relationship with bracket etc.

We don't have to rewrite bracket and co., and actually we could also forego onCancel and use guaranteeCase, but I'll point out for the sake of argument how our native combinators can be easily expressed in this model.
This should also serve as a small example section:

bracketCase = mask { poll =>
  acquire.flatMap { r =>
     poll(use(r))
       .onCancel(close(ExitCase.Canceled))
       .handleErrorWith(e => close(ExitCase.Error(e)) >> F.raiseError(e))
       .flatMap(result => close(ExitCase.Completed).as(result)) 
  }
  
interruptibleBracketCase = mask { poll =>
  poll(acquire).flatMap { r =>
     poll(use(r))
       .onCancel(close(ExitCase.Canceled))
       .handleErrorWith(e => close(ExitCase.Error(e)) >> F.raiseError(e))
       .flatMap(result => close(ExitCase.Completed).as(result)) 
  }
 
uncancelable = mask(_ => fa)

continual = mask { poll =>
  poll(fa).flatMap(f)
}

Differences with Haskell

This model is obviously inspired by Haskell's async exceptions, but there is a huge difference: in haskell some blessed operations are interruptible even inside mask, for example MVar.take.
As argued above, I think this is not a great default in general, and it's even worse in cats-effect because things like MVar aren't built in and users can create their own blocking primitives.
Another obvious difference is that onCancel/guaranteeCase aren't the same as handleErrorWith, so you cannot catch interruption, whereas in haskell you can and you need to remember to rethrow.

Open questions on start

There is an open question about start: the Haskell's equivalent (forkIO) has the spawned fiber inherit the interruption status if forkIO is called inside mask, with another primitive
forkIOWithUnmask to achieve independence.
I still haven't made up my mind as to which semantics are necessary or desirable here. The other questions are around what happens if users pass a Poll instance around, but I think those shouldn't constitute a blocker to the main drive of the proposal.

Thanks

Sorry for the wall of text! I did my best to try and condensate this immensely tricky problem to its essence, but it still turned out super long.

@djspiewak djspiewak added the CE 3 label Oct 23, 2019
@JD557

This comment has been minimized.

Copy link

@JD557 JD557 commented Oct 25, 2019

I have some questions regarding the poll/onCancel semantics, which might not be obvious since poll does not change the return type:

  1. What happens if you call foo.onCancel(bar) instead of poll(foo).onCancel(bar)? Is that onCancel a no-op?
  2. Can you chain onCancel calls like poll(foo).onCancel(bar).onCancel(baz)?

One possible way to avoid ambiguity would be to remove onCancel and move the finalizer definition to Poll:

class Poll {
  def apply[B](fa: F[B], finalizer: F[Unit]): F[B]
}
@SystemFw

This comment has been minimized.

Copy link
Collaborator Author

@SystemFw SystemFw commented Oct 25, 2019

  1. It depends where you are. onCancel can be used outside of a mask section just fine, i.e. a.flatMap(f).flatMap(g).onCancel(doSomething). If you are inside a mask section and you don't say poll(foo), then foo cannot be canceled, and onCancel is a no-op in that sense
  2. They all get run, in order

One possible way to avoid ambiguity would be to remove onCancel and move the finalizer definition to Poll:

You can't do that because onCancel can be used outside of mask as well. Another way of looking at it is that onCancel is just guaranteeCase, and in fact we could just keep guaranteeCase, I chose onCancel because it's the simplest operation that can describe the model, which I was trying to distill to its essence. Good question though, I considered attaching the finaliser to poll as well, at some point

@neko-kai

This comment has been minimized.

Copy link
Contributor

@neko-kai neko-kai commented Oct 25, 2019

@SystemFw
How would this code behave under the proposed model?

mask { poll =>
  poll(F.never).guarantee {
    F.delay(println("Waiting 1s")) >>
    F.sleep(1.second) >>
    poll(F.unit) >>
    F.delay(println("Waited 1s"))
  }
}.start.flatMap {
  F.sleep(1.second) *> _.cancel
}

Will it ever print Waited 1s or will interrupt during the poll before? i.e. resulting in 2 interruptions per 1 cancel.

@SystemFw

This comment has been minimized.

Copy link
Collaborator Author

@SystemFw SystemFw commented Oct 25, 2019

So, that code is using the two things whose semantics are still in the works, i.e:

  • start within mask
  • sharing poll outside of a mask scope

I'm still trying to define what's best there, in haskell the Fiber would inherit the masking status unless you use forkIOWithUnmask. I'm thinking we don't do that and just have start spawn a Fiber in a new status (so, interruptible), so you wouldn't need poll, and the whole code acts like if mask didn't exist. There are some implications to this choice, but I'm still fully pondering them.

What are your ideas here?

@SystemFw

This comment has been minimized.

Copy link
Collaborator Author

@SystemFw SystemFw commented Oct 25, 2019

I think we can extract at least one clear question from there, which is:

what if you call poll inside onCancel?

My first thought is that this is a no-op, things inside onCancel should be flat-out uncancelable, but obviously open to hearing counter points. These are all great points though :)

@neko-kai

This comment has been minimized.

Copy link
Contributor

@neko-kai neko-kai commented Oct 25, 2019

@SystemFw

So, that code is using the two things whose semantics are still in the works, i.e:
start within mask
sharing poll outside of a mask scope

If you look closely, the start is outside of mask here – the entire mask {} expression is started.
Also, the poll is only used within mask expression if you accept that in mask(a guarantee b) both a and b are "within" mask.

So actually, my snippet isn't about inheritance of masking status, it's about "stickinness" of interruption – can a single cancel be observed repeatedly from multiple polls during execution of finalizers?

@SystemFw

This comment has been minimized.

Copy link
Collaborator Author

@SystemFw SystemFw commented Oct 25, 2019

Yeah, sorry about that :)
Then I guess as I said the question reduces to:

what if you call poll inside onCancel?

And I'd say that that's a no-op and things inside onCancel are uninterruptible. What do you think?

@neko-kai

This comment has been minimized.

Copy link
Contributor

@neko-kai neko-kai commented Oct 25, 2019

@SystemFw
I'm not sure of the implications, but sounds reasonable, esp. if we dictate that a Fiber can only be canceled once – i.e. multiple cancels have no effect and are non-sensical, unlike Haskell.

Re: usage of poll out of scope, while Scala doesn't make it possible to prevent poll from leaking, it's possible to determine its scope and invalidate all out of scope uses by 1. invalidating poll in finalizer of mask to prevent sequential leaking, 2. add FiberId and invalidate poll for any fiber but the one that spawned it, to prevent leaking to different fibers.

@djspiewak

This comment has been minimized.

Copy link
Member

@djspiewak djspiewak commented Oct 25, 2019

Re: usage of poll out of scope, while Scala doesn't make it possible to prevent poll from leaking, it's possible to determine its scope and invalidate all out of scope uses by 1. invalidating poll in finalizer of mask to prevent sequential leaking, 2. add FiberId and invalidate poll for any fiber but the one that spawned it, to prevent leaking to different fibers.

We could actually make it possible by encoding a functional dependency within Concurrent and using the same trick that the ST monad does, but on a transformer type around F in poll. I have an example of that in Gitter, I think. But… it's horrifying. I think it's safer to say "it's not possible" and leave it at that. :-)

Anyway, I think your proposal is more realistic: just detect leaked polls and turn them into no-ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.