Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include REEF container exceptions in Myria query response #899

Merged
merged 2 commits into from
Jul 12, 2017

Conversation

senderista
Copy link
Contributor

This change ensures that all REEF Evaluator/Context/Task failure exceptions will be properly rethrown by the coordinator and show up in the query response. Unfortunately, OOM failures that result from the container being forcibly terminated by the YARN NodeManager rather than a Java OOM exception are not attributed clearly, but this is still an improvement.

Example of an evaluator that was forcibly terminated for exceeding its memory quota:

edu.washington.escience.myria.DbException: Query #9.0 failed: org.apache.reef.exception.EvaluatorException: Evaluator [container_1497062636339_0001_01_000004] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [FAILED]. This means that the Evaluator failed but wasn't able to send an error message back to the driver. Task [1] was running when the Evaluator crashed.
	at edu.washington.escience.myria.parallel.MasterSubQuery$WorkerExecutionInfo$2.operationComplete(MasterSubQuery.java:119)

Example of an evaluator that triggered a Java OOM exception:

Suppressed: edu.washington.escience.myria.DbException: Worker #2 failed: java.lang.OutOfMemoryError: Java heap space
		... 20 more
	Caused by: edu.washington.escience.myria.parallel.WorkerFailedException: java.lang.OutOfMemoryError: Java heap space
		... 20 more
	Caused by: java.lang.OutOfMemoryError: Java heap space
		at java.lang.String.toCharArray(String.java:2899)

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 26.708% when pulling 761d849 on error_reporting into 07ceb18 on master.

@senderista senderista self-assigned this Jun 12, 2017
Copy link
Contributor

@jingjingwang jingjingwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only trivial comments.

*/
protected synchronized void workerDied(final int workerId) {
protected synchronized void workerDied(final int workerId, final Throwable cause) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @nullable as in other places?

/**
* @param msg the message describing this Exception.
* */
public WorkerFailedException(final String msg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's never used.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 26.736% when pulling 84e0554 on error_reporting into 07ceb18 on master.

@senderista senderista merged commit d4fdb71 into master Jul 12, 2017
@senderista senderista deleted the error_reporting branch July 12, 2017 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants