New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853
Conversation
3df63f1
to
b1b03b0
Compare
I think this duplicates #2848 (review) |
Isn't it essentially a duplication of @JosDenmark PR #2848 ? |
rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala
Outdated
Show resolved
Hide resolved
@JosDenmark it's similar to your PR but it removes global state for errors from Runtime completely. |
@tgrospic You're still using |
Cancellation is non-deterministic for |
It also looks like errors don't cross concurrent boundaries in your implementation. So, given |
Pawel's comment was not in regard to global state, it was in regard to the this.synchronized calls. |
b71a0e3
to
f698943
Compare
@JosDenmark the main question is what is the result of execution of Mechanism that we have for cancellation is by throwing OutOfPhlogistonError. But more generally cancellation can be any exit from the task. Because all errors are handled there is no uncaught errors. This is very thin wrapper to handle and collect errors and it doesn't try to dictate how errors are related. Our shared state (Cost) is in a sense cancellation token that is checked on every step of execution if it is ready to die. If we want to cancel all tasks on the first error we can do it in the same way. I think you already suggested something into that direction which seems very easy to add. Do you think that we have benefits of global state for errors in Runtime? I have impression from your and @marcin-rzeznicki comments on your PR that you want to get rid of it. I was happy to see @EncodePanda comments also. Maybe he will have another deja-vu. ;) |
By the way, if you look at the last commit of the PR I issued, you'll see why the casper tests are failing. It's a bug in |
// Out Of Phlogiston error is always single | ||
// - if one execution path is out of phlo, the whole evaluation is also | ||
case errList if errList.contains(OutOfPhlogistonsError) => | ||
OutOfPhlogistonsError.raiseError[M, Unit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These raiseError
calls are the problem. The semantics of parTraverse
when raiseError
is called in an execution branch is not "wait for all threads running in parallel with this one to cancel before returning an error." It is, "run a cancellation process (Task
) concurrently with the currently executing branches and return error immediately." In other words, parTraverse
can return with an error before all threads are cancelled. Since it can return early, we can move past the point in the call sequence where we're able to handle errors.
The reason that we were getting an OOPE
that would cause node to hang is because, given two branches P | Q
, if P
ran out of gas, it would throw an OOPE
, but the result of parTraverse
would return before Q
was cancelled. As we exited the interpreter (and all opportunities for handling interpreter errors had passed), Q
was still uncancelled. Then, Q
would make another call to charge
and then throw another OOPE
. However, by the time the second OOPE
was thrown, we'd already exited the interpreter, so instead of being caught and handled, it became an uncaught exception, and was reported to the uncaught exception handler all the way at the top of node. We'd then try to reprocess the block, and the cycle continued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll be able to see this if you run the first test in InterpreterSpec
with a large number of terms in parallel with many COMM events and a single error-throwing process on at least the second level (inside a continuation). I suggest you do that before you continue work on this PR, and before you even do that, please review the PR I opened so that it can be merged. If it turns out you're right, you can always replace what I've written with what you've written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JosDenmark I completely agree with you. If failed Tasks created by parTraverse are not handled, many different scenarios can happen that are not desirable.
But parTraverse is not the culprit, it's just the higher level abstraction as you already pointed out in the comment for eval method where parTraverse is used.
What I suggest is let's cut it at the source and not care about it anywhere else in our code base. Here is all that is needed for a fix. Result from each Task spawned for each call to eval cannot escape with unhanded error.
// Collect errors from all parallel execution paths (pars)
terms.zipWithIndex.toVector
.parTraverse {
case (term, index) =>
eval(term)(env, split(index))
.map(_ => none[Throwable])
// This will prevent for Task to finish with an error
// so failed Task cannot occur as a result of parTraverse
.handleError(_.some)
}
.map(_.flattenOption)
.flatMap(aggregateEvaluatorErrors)
There is no need to guard any other function separately if we know the problem can happen only when a new Task is spawned/exited. We can also implement this logic in the task scheduler which can do this for us so even using parTraverse can be opaque to this inconvenience.
Examples of this pattern exists in JS AggregateError and C# AggregateException.
In test you are proposing, with many terms and comm events, all parallel Tasks will use this code to handle all failed branches on any nesting level and execution will continue after the last Task will produce result, None
for success and Some(Throwable)
for failed. So child Task cannot exit with the failure and all errors are aggregated and propagated to the parent.
In support of this, I've found a bug that is still undetected in your PR when error in user code does not consume all phlos in RuntimeManagerTest.
086b05c
to
bac9de9
Compare
@9rb What justification do you have for approving this PR? |
rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala
Outdated
Show resolved
Hide resolved
rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala
Outdated
Show resolved
Hide resolved
bac9de9
to
b8c8d2e
Compare
@JosDenmark I'm thinking about cancellation on the first error. It looks like that state is needed to store a flag that needs to be checked inside interpreter in the same way you call |
This PR fixes the bug with unhanded errors and removes ErrorLog as a global state in Runtime to hold all interpreter errors. The problem with Runtime state is after each call to eval errors must be separately reset and cannot be run in parallel. @JosDenmark are you agree that we merge this PR and implement cancellation in a separate PR with the state I suggested? I will explore your idea to use Parallel instance to hold state and reduce places in interpreter where it needs to be supplied. |
bors r+ |
bors r+
Now that @JosephDenman agrees that Tomislav can separately implement the 'first error' piece, I'm assuming, we can go ahead merge this PR? Thanks |
👎 Rejected by code reviews |
49fa285
to
a7d03a4
Compare
Based on Joe's ThumbsUp to Tomislav's proposal to create the 'first error' piece separately, continuing with merge of this PR
bors r+ |
2 similar comments
bors r+ |
bors r+ |
e35b1e5
to
bfd08d8
Compare
bors r+ |
1 similar comment
bors r+ |
bors r- |
bors ping |
pong |
bors r+ |
…f phlo (RCHAIN-3790)
bfd08d8
to
9a93cea
Compare
bors r+ |
Build succeeded
|
Ticket to make cancellation of the first error. |
Overview
The main purpose of this PR is to fix errors from parallel executions in Rholang interpreter.
Also eliminates
ErrorLog
fromRuntime
because errors are collected directly as execution results. Execution in each branch is gracefully handled and rethrown on parent thread asAggregateError
(if multiple errors exist).JIRA ticket:
https://rchain.atlassian.net/browse/RCHAIN-4008
https://rchain.atlassian.net/browse/RCHAIN-736
Also enables test disabled in this ticket.
https://rchain.atlassian.net/browse/RCHAIN-3790
Notes
Although this solution satisfy, it's not the final. We should move this logic in the thread scheduler who is responsible for spawning new threads (
Task
s).Motivation to remove global state in
Runtime
is to enable better concurrent execution.Please make sure that this PR:
Bors cheat-sheet:
bors r+
runs integration tests and merges the PR (if it's approved),bors try
runs integration tests for the PR,bors delegate+
enables non-maintainer PR authors to run the above.