RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

tgrospic · 2020-01-21T17:01:08Z

Overview

The main purpose of this PR is to fix errors from parallel executions in Rholang interpreter.
Also eliminates ErrorLog from Runtime because errors are collected directly as execution results. Execution in each branch is gracefully handled and rethrown on parent thread as AggregateError (if multiple errors exist).

JIRA ticket:

https://rchain.atlassian.net/browse/RCHAIN-4008
https://rchain.atlassian.net/browse/RCHAIN-736
Also enables test disabled in this ticket.
https://rchain.atlassian.net/browse/RCHAIN-3790

Notes

Although this solution satisfy, it's not the final. We should move this logic in the thread scheduler who is responsible for spawning new threads (Tasks).
Motivation to remove global state in Runtime is to enable better concurrent execution.

Please make sure that this PR:

is at most 200 lines of code (excluding tests),
meets RChain development coding standards,
includes tests for all added features,
has a reviewer assigned,
has all commits signed.

Bors cheat-sheet:

bors r+ runs integration tests and merges the PR (if it's approved),
bors try runs integration tests for the PR,
bors delegate+ enables non-maintainer PR authors to run the above.

JosephDenman · 2020-01-21T18:50:30Z

I think this duplicates #2848 (review)

marcin-rzeznicki · 2020-01-21T18:52:39Z

Isn't it essentially a duplication of @JosDenmark PR #2848 ?

rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala

tgrospic · 2020-01-21T19:56:01Z

@JosDenmark it's similar to your PR but it removes global state for errors from Runtime completely.

JosephDenman · 2020-01-21T20:26:37Z

@tgrospic You're still using raiseError to throw OutOfPhlogistonError.

JosephDenman · 2020-01-21T20:27:51Z

Cancellation is non-deterministic for parTraverse, so you don't know when a given process will be cancelled. This is a problem, since it leads to uncaught errors being thrown at unexpected times.

JosephDenman · 2020-01-21T20:30:14Z

It also looks like errors don't cross concurrent boundaries in your implementation. So, given P|Q, if P throws an error and is cancelled, Q is not cancelled.

JosephDenman · 2020-01-21T20:31:56Z

Pawel's comment was not in regard to global state, it was in regard to the this.synchronized calls.

tgrospic · 2020-01-21T22:37:17Z

@JosDenmark the main question is what is the result of execution of Tasks spawned by the interpreter. If execution finishes with an error (by throwing error or applicative error) it will be unhandled and the task will blow up.
So all resulting errors from parTraverse execution must be handled so that task exits gracefully. After all child tasks are finished, collected errors are rethrown on the parent task and so recursively with all flatten in AggregateError.

Mechanism that we have for cancellation is by throwing OutOfPhlogistonError. But more generally cancellation can be any exit from the task. Because all errors are handled there is no uncaught errors.
Even more correct would be to implement this in the task scheduler and spare the interpreter of any inconvenience because of underlying execution model. It should be easy to raiseError everywhere we need.

This is very thin wrapper to handle and collect errors and it doesn't try to dictate how errors are related. Our shared state (Cost) is in a sense cancellation token that is checked on every step of execution if it is ready to die. If we want to cancel all tasks on the first error we can do it in the same way. I think you already suggested something into that direction which seems very easy to add.

Do you think that we have benefits of global state for errors in Runtime? I have impression from your and @marcin-rzeznicki comments on your PR that you want to get rid of it. I was happy to see @EncodePanda comments also. Maybe he will have another deja-vu. ;)

JosephDenman · 2020-01-22T01:34:50Z

By the way, if you look at the last commit of the PR I issued, you'll see why the casper tests are failing. It's a bug in RhoSpecContract.

JosephDenman · 2020-01-22T01:47:37Z

rholang/src/main/scala/coop/rchain/rholang/interpreter/Reduce.scala

+    // Out Of Phlogiston error is always single
+    // - if one execution path is out of phlo, the whole evaluation is also
+    case errList if errList.contains(OutOfPhlogistonsError) =>
+      OutOfPhlogistonsError.raiseError[M, Unit]


These raiseError calls are the problem. The semantics of parTraverse when raiseError is called in an execution branch is not "wait for all threads running in parallel with this one to cancel before returning an error." It is, "run a cancellation process (Task) concurrently with the currently executing branches and return error immediately." In other words, parTraverse can return with an error before all threads are cancelled. Since it can return early, we can move past the point in the call sequence where we're able to handle errors.

The reason that we were getting an OOPE that would cause node to hang is because, given two branches P | Q, if P ran out of gas, it would throw an OOPE, but the result of parTraverse would return before Q was cancelled. As we exited the interpreter (and all opportunities for handling interpreter errors had passed), Q was still uncancelled. Then, Q would make another call to charge and then throw another OOPE. However, by the time the second OOPE was thrown, we'd already exited the interpreter, so instead of being caught and handled, it became an uncaught exception, and was reported to the uncaught exception handler all the way at the top of node. We'd then try to reprocess the block, and the cycle continued.

You'll be able to see this if you run the first test in InterpreterSpec with a large number of terms in parallel with many COMM events and a single error-throwing process on at least the second level (inside a continuation). I suggest you do that before you continue work on this PR, and before you even do that, please review the PR I opened so that it can be merged. If it turns out you're right, you can always replace what I've written with what you've written.

@JosDenmark I completely agree with you. If failed Tasks created by parTraverse are not handled, many different scenarios can happen that are not desirable.
But parTraverse is not the culprit, it's just the higher level abstraction as you already pointed out in the comment for eval method where parTraverse is used.

What I suggest is let's cut it at the source and not care about it anywhere else in our code base. Here is all that is needed for a fix. Result from each Task spawned for each call to eval cannot escape with unhanded error.

// Collect errors from all parallel execution paths (pars) terms.zipWithIndex.toVector .parTraverse { case (term, index) => eval(term)(env, split(index)) .map(_ => none[Throwable]) // This will prevent for Task to finish with an error // so failed Task cannot occur as a result of parTraverse .handleError(_.some) } .map(_.flattenOption) .flatMap(aggregateEvaluatorErrors)

There is no need to guard any other function separately if we know the problem can happen only when a new Task is spawned/exited. We can also implement this logic in the task scheduler which can do this for us so even using parTraverse can be opaque to this inconvenience.
Examples of this pattern exists in JS AggregateError and C# AggregateException.

In test you are proposing, with many terms and comm events, all parallel Tasks will use this code to handle all failed branches on any nesting level and execution will continue after the last Task will produce result, None for success and Some(Throwable) for failed. So child Task cannot exit with the failure and all errors are aggregated and propagated to the parent.

In support of this, I've found a bug that is still undetected in your PR when error in user code does not consume all phlos in RuntimeManagerTest.

JosephDenman · 2020-01-22T17:17:04Z

@9rb What justification do you have for approving this PR?

rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala

tgrospic · 2020-01-22T22:46:02Z

@JosDenmark I'm thinking about cancellation on the first error. It looks like that state is needed to store a flag that needs to be checked inside interpreter in the same way you call runIfNoErrorsAndReport. What I'm suggesting is that we can get raiseError handled from my PR and cancellation logic from your PR. You don't need to store error reporter with an error but only the flag if execution should be cancelled. Is this sound reasonable to you?

tgrospic · 2020-01-23T21:09:51Z

This PR fixes the bug with unhanded errors and removes ErrorLog as a global state in Runtime to hold all interpreter errors. The problem with Runtime state is after each call to eval errors must be separately reset and cannot be run in parallel.
But we need some shared state because we want to stop execution of all parallel tasks after error in the first. Joe in his PR #2848 use modified Runtime state for this first error and to notify other tasks. I would like to include this but without Runtime shared state. If state is on interpreter level each eval call can do cancellation by itself and also run independently.

@JosDenmark are you agree that we merge this PR and implement cancellation in a separate PR with the state I suggested? I will explore your idea to use Parallel instance to hold state and reduce places in interpreter where it needs to be supplied.

9rb · 2020-01-24T03:32:53Z

bors r+

9rb · 2020-01-24T15:00:00Z

bors r+

@9rb What justification do you have for approving this PR?

Now that @JosephDenman agrees that Tomislav can separately implement the 'first error' piece, I'm assuming, we can go ahead merge this PR? Thanks

bors · 2020-01-24T15:00:09Z

👎 Rejected by code reviews

Based on Joe's ThumbsUp to Tomislav's proposal to create the 'first error' piece separately, continuing with merge of this PR

9rb · 2020-01-24T20:30:39Z

bors r+

tgrospic · 2020-01-25T08:18:47Z

bors r+

9rb · 2020-01-25T12:11:57Z

bors r+

tgrospic · 2020-01-25T14:39:22Z

bors r+

woky · 2020-01-25T16:29:24Z

bors r+

woky · 2020-01-25T16:30:09Z

bors r-

woky · 2020-01-25T16:31:15Z

bors ping

bors · 2020-01-25T16:31:16Z

pong

woky · 2020-01-25T16:34:31Z

bors r+

…f phlo (RCHAIN-3790)

tgrospic · 2020-01-25T20:12:30Z

bors r+

2853: RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog r=tgrospic a=tgrospic Co-authored-by: Tomislav Grospic <grospic@gmail.com>

bors · 2020-01-25T21:11:53Z

Build succeeded

bors build finished

tgrospic · 2020-01-27T18:49:06Z

Ticket to make cancellation of the first error.
https://rchain.atlassian.net/browse/RCHAIN-4011

tgrospic requested review from JosephDenman, marcin-rzeznicki, 9rb and pyrocto January 21, 2020 17:01

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch 3 times, most recently from 3df63f1 to b1b03b0 Compare January 21, 2020 17:41

marcin-rzeznicki reviewed Jan 21, 2020

View reviewed changes

rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala Outdated Show resolved Hide resolved

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch 2 times, most recently from b71a0e3 to f698943 Compare January 21, 2020 22:20

JosephDenman previously requested changes Jan 22, 2020

View reviewed changes

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch 2 times, most recently from 086b05c to bac9de9 Compare January 22, 2020 09:59

tgrospic mentioned this pull request Jan 22, 2020

Remove ErrorLog and use Ref to flag errors instead of raiseError. #2848

Closed

9rb approved these changes Jan 22, 2020

View reviewed changes

pyrocto approved these changes Jan 22, 2020

View reviewed changes

rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala Outdated Show resolved Hide resolved

rholang/src/main/scala/coop/rchain/rholang/interpreter/Interpreter.scala Outdated Show resolved Hide resolved

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch from bac9de9 to b8c8d2e Compare January 22, 2020 19:35

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch from 49fa285 to a7d03a4 Compare January 24, 2020 16:21

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch from e35b1e5 to bfd08d8 Compare January 25, 2020 12:35

tgrospic added 9 commits January 25, 2020 20:07

Remove ErrorLog from Runtime

bfcbc72

Remove ErrorLog from tests, enable test - exe multiple branches out o…

1f9554f

…f phlo (RCHAIN-3790)

Remove ErrorLog from rspace-bench tests

d26940f

Handle parallel execution errors, add AggregateError interpreter error

b1f8d1e

Fix print in RhoSpec test runner

86d85dd

Fix test - should consume all phlo in case of user error

3a4d771

In case of an error in user code charge only spent phlos

8052dd9

Consume initial cost for out of phlo error (replay cost mismatch test)

b6f2aba

Fix test - out of phlo for parsing

9a93cea

tgrospic force-pushed the RCHAIN-4008-error-log-remove branch from bfd08d8 to 9a93cea Compare January 25, 2020 19:07

bors bot added a commit that referenced this pull request Jan 25, 2020

Merge #2853

5d2af78

2853: RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog r=tgrospic a=tgrospic Co-authored-by: Tomislav Grospic <grospic@gmail.com>

bors bot merged commit 9a93cea into rchain:dev Jan 25, 2020

tgrospic mentioned this pull request Feb 6, 2020

RCHAIN-4008: fix errors from parallel execution (part 2) #2874

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

tgrospic commented Jan 21, 2020 •

edited

JosephDenman commented Jan 21, 2020

marcin-rzeznicki commented Jan 21, 2020

tgrospic commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

tgrospic commented Jan 21, 2020 •

edited

JosephDenman commented Jan 22, 2020

JosephDenman Jan 22, 2020

JosephDenman Jan 22, 2020

tgrospic Jan 22, 2020

JosephDenman commented Jan 22, 2020

tgrospic commented Jan 22, 2020

tgrospic commented Jan 23, 2020

9rb commented Jan 24, 2020

9rb commented Jan 24, 2020

bors bot commented Jan 24, 2020

9rb commented Jan 24, 2020

tgrospic commented Jan 25, 2020

9rb commented Jan 25, 2020

tgrospic commented Jan 25, 2020

woky commented Jan 25, 2020

woky commented Jan 25, 2020

woky commented Jan 25, 2020

bors bot commented Jan 25, 2020

woky commented Jan 25, 2020

tgrospic commented Jan 25, 2020

bors bot commented Jan 25, 2020

tgrospic commented Jan 27, 2020

RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

Conversation

tgrospic commented Jan 21, 2020 • edited

Overview

JIRA ticket:

Notes

Please make sure that this PR:

Bors cheat-sheet:

JosephDenman commented Jan 21, 2020

marcin-rzeznicki commented Jan 21, 2020

tgrospic commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

JosephDenman commented Jan 21, 2020

tgrospic commented Jan 21, 2020 • edited

JosephDenman commented Jan 22, 2020

JosephDenman Jan 22, 2020

Choose a reason for hiding this comment

JosephDenman Jan 22, 2020

Choose a reason for hiding this comment

tgrospic Jan 22, 2020

Choose a reason for hiding this comment

JosephDenman commented Jan 22, 2020

tgrospic commented Jan 22, 2020

tgrospic commented Jan 23, 2020

9rb commented Jan 24, 2020

9rb commented Jan 24, 2020

bors bot commented Jan 24, 2020

9rb commented Jan 24, 2020

tgrospic commented Jan 25, 2020

9rb commented Jan 25, 2020

tgrospic commented Jan 25, 2020

woky commented Jan 25, 2020

woky commented Jan 25, 2020

woky commented Jan 25, 2020

bors bot commented Jan 25, 2020

woky commented Jan 25, 2020

tgrospic commented Jan 25, 2020

bors bot commented Jan 25, 2020

Build succeeded

tgrospic commented Jan 27, 2020

tgrospic commented Jan 21, 2020 •

edited

tgrospic commented Jan 21, 2020 •

edited