Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog #2853

Merged
merged 9 commits into from Jan 25, 2020

Conversation

tgrospic
Copy link
Collaborator

@tgrospic tgrospic commented Jan 21, 2020

Overview

The main purpose of this PR is to fix errors from parallel executions in Rholang interpreter.
Also eliminates ErrorLog from Runtime because errors are collected directly as execution results. Execution in each branch is gracefully handled and rethrown on parent thread as AggregateError (if multiple errors exist).

JIRA ticket:

https://rchain.atlassian.net/browse/RCHAIN-4008
https://rchain.atlassian.net/browse/RCHAIN-736
Also enables test disabled in this ticket.
https://rchain.atlassian.net/browse/RCHAIN-3790

Notes

Although this solution satisfy, it's not the final. We should move this logic in the thread scheduler who is responsible for spawning new threads (Tasks).
Motivation to remove global state in Runtime is to enable better concurrent execution.

Please make sure that this PR:

Bors cheat-sheet:

  • bors r+ runs integration tests and merges the PR (if it's approved),
  • bors try runs integration tests for the PR,
  • bors delegate+ enables non-maintainer PR authors to run the above.

@tgrospic tgrospic force-pushed the RCHAIN-4008-error-log-remove branch 3 times, most recently from 3df63f1 to b1b03b0 Compare January 21, 2020 17:41
@JosephDenman
Copy link
Collaborator

I think this duplicates #2848 (review)

@marcin-rzeznicki
Copy link
Contributor

Isn't it essentially a duplication of @JosDenmark PR #2848 ?

@tgrospic
Copy link
Collaborator Author

@JosDenmark it's similar to your PR but it removes global state for errors from Runtime completely.

@JosephDenman
Copy link
Collaborator

@tgrospic You're still using raiseError to throw OutOfPhlogistonError.

@JosephDenman
Copy link
Collaborator

Cancellation is non-deterministic for parTraverse, so you don't know when a given process will be cancelled. This is a problem, since it leads to uncaught errors being thrown at unexpected times.

@JosephDenman
Copy link
Collaborator

It also looks like errors don't cross concurrent boundaries in your implementation. So, given P|Q, if P throws an error and is cancelled, Q is not cancelled.

@JosephDenman
Copy link
Collaborator

Pawel's comment was not in regard to global state, it was in regard to the this.synchronized calls.

@tgrospic tgrospic force-pushed the RCHAIN-4008-error-log-remove branch 2 times, most recently from b71a0e3 to f698943 Compare January 21, 2020 22:20
@tgrospic
Copy link
Collaborator Author

tgrospic commented Jan 21, 2020

@JosDenmark the main question is what is the result of execution of Tasks spawned by the interpreter. If execution finishes with an error (by throwing error or applicative error) it will be unhandled and the task will blow up.
So all resulting errors from parTraverse execution must be handled so that task exits gracefully. After all child tasks are finished, collected errors are rethrown on the parent task and so recursively with all flatten in AggregateError.

Mechanism that we have for cancellation is by throwing OutOfPhlogistonError. But more generally cancellation can be any exit from the task. Because all errors are handled there is no uncaught errors.
Even more correct would be to implement this in the task scheduler and spare the interpreter of any inconvenience because of underlying execution model. It should be easy to raiseError everywhere we need.

This is very thin wrapper to handle and collect errors and it doesn't try to dictate how errors are related. Our shared state (Cost) is in a sense cancellation token that is checked on every step of execution if it is ready to die. If we want to cancel all tasks on the first error we can do it in the same way. I think you already suggested something into that direction which seems very easy to add.

Do you think that we have benefits of global state for errors in Runtime? I have impression from your and @marcin-rzeznicki comments on your PR that you want to get rid of it. I was happy to see @EncodePanda comments also. Maybe he will have another deja-vu. ;)

@JosephDenman
Copy link
Collaborator

By the way, if you look at the last commit of the PR I issued, you'll see why the casper tests are failing. It's a bug in RhoSpecContract.

// Out Of Phlogiston error is always single
// - if one execution path is out of phlo, the whole evaluation is also
case errList if errList.contains(OutOfPhlogistonsError) =>
OutOfPhlogistonsError.raiseError[M, Unit]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These raiseError calls are the problem. The semantics of parTraverse when raiseError is called in an execution branch is not "wait for all threads running in parallel with this one to cancel before returning an error." It is, "run a cancellation process (Task) concurrently with the currently executing branches and return error immediately." In other words, parTraverse can return with an error before all threads are cancelled. Since it can return early, we can move past the point in the call sequence where we're able to handle errors.

The reason that we were getting an OOPE that would cause node to hang is because, given two branches P | Q, if P ran out of gas, it would throw an OOPE, but the result of parTraverse would return before Q was cancelled. As we exited the interpreter (and all opportunities for handling interpreter errors had passed), Q was still uncancelled. Then, Q would make another call to charge and then throw another OOPE. However, by the time the second OOPE was thrown, we'd already exited the interpreter, so instead of being caught and handled, it became an uncaught exception, and was reported to the uncaught exception handler all the way at the top of node. We'd then try to reprocess the block, and the cycle continued.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll be able to see this if you run the first test in InterpreterSpec with a large number of terms in parallel with many COMM events and a single error-throwing process on at least the second level (inside a continuation). I suggest you do that before you continue work on this PR, and before you even do that, please review the PR I opened so that it can be merged. If it turns out you're right, you can always replace what I've written with what you've written.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JosDenmark I completely agree with you. If failed Tasks created by parTraverse are not handled, many different scenarios can happen that are not desirable.
But parTraverse is not the culprit, it's just the higher level abstraction as you already pointed out in the comment for eval method where parTraverse is used.

What I suggest is let's cut it at the source and not care about it anywhere else in our code base. Here is all that is needed for a fix. Result from each Task spawned for each call to eval cannot escape with unhanded error.

// Collect errors from all parallel execution paths (pars)
terms.zipWithIndex.toVector
  .parTraverse {
    case (term, index) =>
      eval(term)(env, split(index))
        .map(_ => none[Throwable])
        // This will prevent for Task to finish with an error
        // so failed Task cannot occur as a result of parTraverse
        .handleError(_.some)
  }
  .map(_.flattenOption)
  .flatMap(aggregateEvaluatorErrors)

There is no need to guard any other function separately if we know the problem can happen only when a new Task is spawned/exited. We can also implement this logic in the task scheduler which can do this for us so even using parTraverse can be opaque to this inconvenience.
Examples of this pattern exists in JS AggregateError and C# AggregateException.

In test you are proposing, with many terms and comm events, all parallel Tasks will use this code to handle all failed branches on any nesting level and execution will continue after the last Task will produce result, None for success and Some(Throwable) for failed. So child Task cannot exit with the failure and all errors are aggregated and propagated to the parent.

In support of this, I've found a bug that is still undetected in your PR when error in user code does not consume all phlos in RuntimeManagerTest.

@JosephDenman
Copy link
Collaborator

@9rb What justification do you have for approving this PR?

@tgrospic
Copy link
Collaborator Author

@JosDenmark I'm thinking about cancellation on the first error. It looks like that state is needed to store a flag that needs to be checked inside interpreter in the same way you call runIfNoErrorsAndReport. What I'm suggesting is that we can get raiseError handled from my PR and cancellation logic from your PR. You don't need to store error reporter with an error but only the flag if execution should be cancelled. Is this sound reasonable to you?

@tgrospic
Copy link
Collaborator Author

This PR fixes the bug with unhanded errors and removes ErrorLog  as a global state in Runtime to hold all interpreter errors. The problem with Runtime state is after each call to eval errors must be separately reset and cannot be run in parallel.
But we need some shared state because we want to stop execution of all parallel tasks after error in the first. Joe in his PR #2848 use modified Runtime state for this first error and to notify other tasks. I would like to include this but without Runtime shared state. If state is on interpreter level each eval call can do cancellation by itself and also run independently.

@JosDenmark are you agree that we merge this PR and implement cancellation in a separate PR with the state I suggested? I will explore your idea to use Parallel instance to hold state and reduce places in interpreter where it needs to be supplied.

@9rb
Copy link
Collaborator

9rb commented Jan 24, 2020

bors r+

@9rb
Copy link
Collaborator

9rb commented Jan 24, 2020

bors r+

@9rb What justification do you have for approving this PR?

Now that @JosephDenman agrees that Tomislav can separately implement the 'first error' piece, I'm assuming, we can go ahead merge this PR? Thanks

@bors
Copy link
Contributor

bors bot commented Jan 24, 2020

👎 Rejected by code reviews

@9rb 9rb dismissed JosephDenman’s stale review January 24, 2020 20:30

Based on Joe's ThumbsUp to Tomislav's proposal to create the 'first error' piece separately, continuing with merge of this PR

@9rb
Copy link
Collaborator

9rb commented Jan 24, 2020

bors r+

2 similar comments
@tgrospic
Copy link
Collaborator Author

bors r+

@9rb
Copy link
Collaborator

9rb commented Jan 25, 2020

bors r+

@tgrospic
Copy link
Collaborator Author

bors r+

1 similar comment
@woky
Copy link
Contributor

woky commented Jan 25, 2020

bors r+

@woky
Copy link
Contributor

woky commented Jan 25, 2020

bors r-

@woky
Copy link
Contributor

woky commented Jan 25, 2020

bors ping

@bors
Copy link
Contributor

bors bot commented Jan 25, 2020

pong

@woky
Copy link
Contributor

woky commented Jan 25, 2020

bors r+

@tgrospic
Copy link
Collaborator Author

bors r+

bors bot added a commit that referenced this pull request Jan 25, 2020
2853: RCHAIN-4008: fix errors from parallel execution, remove global ErrorLog r=tgrospic a=tgrospic



Co-authored-by: Tomislav Grospic <grospic@gmail.com>
@bors
Copy link
Contributor

bors bot commented Jan 25, 2020

Build succeeded

  • bors build finished

@bors bors bot merged commit 9a93cea into rchain:dev Jan 25, 2020
@tgrospic
Copy link
Collaborator Author

Ticket to make cancellation of the first error.
https://rchain.atlassian.net/browse/RCHAIN-4011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants