Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to report failure #21

Closed
jordaaash opened this issue Mar 24, 2017 · 13 comments
Closed

How to report failure #21

jordaaash opened this issue Mar 24, 2017 · 13 comments

Comments

@jordaaash
Copy link
Contributor

If a job callback fails asynchronously, done() doesn't appear to take an argument in typical node style. I read through #3 and my question is the same as OP's first:

(1) How does a job consumer report failure, such as to requeue the job for another attempt later?

I'm not necessarily looking to retry the failed job, but I'd like to tell Boss that the job failed as soon as it does fail, rather than waiting for the timeout. Is this possible?

If not, would you be open to a PR accepting an error argument to the done() call? I would think this would be non-breaking.

@timgit
Copy link
Owner

timgit commented Mar 24, 2017

tell Boss that the job failed as soon as it does fail, rather than waiting for the timeout.

Can you elaborate on what problem this would solve? Is it a concern over pg-boss "waiting"?

@jordaaash
Copy link
Contributor Author

jordaaash commented Mar 24, 2017

If the job has already failed and stopped running, several things. If the job is a singleton, a new job can't be run until the job is known to have failed. Additionally, since the process running the jobs may be interested in logging errors, and optionally retrying certain jobs, with or without delay/backoff, it's useful to know accurately when (and why) they've failed. The reality is simply that the job didn't time out after 30 seconds if it failed and stopped running after 1 second. It also makes debugging simpler (as failed jobs right now could have failed for any reason at any time prior to their timeout expiring).

@jordaaash
Copy link
Contributor Author

I've only read through about half the source and tests so far, so if there's a way that I'm not seeing to do this through events or something else, please let me know!

@timgit
Copy link
Owner

timgit commented Mar 24, 2017

If the job is a singleton, a new job can't be run until the job is known to have failed.

Good point on the singleton jobs. Agreed.

since the process running the jobs may be interested in logging errors

Would you clarify the intent here? Which process? Do you want a done(error) callback to then round trip back into the same process via the 'error' event? I would think that if you have the error with which to pass to done(), you'd just log/report it right there.

@jordaaash
Copy link
Contributor Author

So in my case, I largely treat error logging as a cross-cutting concern that the script running in node is responsible for at a high level, rather than having error logging handled closer to where the error is generated.

This means that the runner that's calling done() when a job completes could pass an error without needing to log that error itself (or be aware if/how the error is being logged).

I realize this may not be a general use case, but it seems that there isn't a way to do it currently, so I have logging located at a lower level now.

On the plus side, using done(error) would also allow me to have my jobs (which return Bluebird promises) just call .asCallback(done) :)

@timgit
Copy link
Owner

timgit commented Mar 24, 2017

Thanks for the clarification. In your case, would you want pg-boss to emit 'error' if you pass done(error)?

@jordaaash
Copy link
Contributor Author

I think I'd want to mark the job as failed, rather than expired, and emit the error on the manager (and thus on the boss instance). Does this sound right?

@timgit
Copy link
Owner

timgit commented Mar 25, 2017

Yes, that sounds right. I am considering the emitting of 'error' on the instance via a done(error) should be an opt-in config in the constructor options, though. That coolio?

@jordaaash
Copy link
Contributor Author

Yeah, I don't see any practical problem with it being opt-in. Would you mind sharing your thought process with me on that?

@timgit
Copy link
Owner

timgit commented Mar 26, 2017

The current behavior of 'error' is unexpected or unhandled errors and should rarely occur. Leaving this behavior as-is maintains the existing semantics of 'error' with node EventEmitters (they will crash hard if not handled).

If an EventEmitter does not have at least one listener registered for the 'error' event, and an 'error' event is emitted, the error is thrown, a stack trace is printed, and the Node.js process exits.

Even though I'm pushing for peeps to handle 'error', it's quite possible that some are not going to do it. This leaves open a normative and common use case of an error in a handler that could crash the entire process. In fact, the more I think about this, the more I'm convinced that manager should emit a distinct event instead of 'error' to maintain the semantic difference between "oh no something unexpected occurred that is worthy of crashing your process" vs. "this happened and I'd like you to log it". And this would even apply to the current error handling I recently shipped with 1.0 which I should migrate off of 'error'.

By the way, using done() in this fashion is a convenience batching feature. Since it needs to update the database then emit an event, it could throw if the database becomes unreachable for any reason. You can always add a catch() on done() to mitigate this, but just keep that in mind since you may lose your original error that you wanted logged.

I was thinking of naming this other event 'warn' along the lines of popular error severity levels but I'm open to other names since I feel like I can't ever name things properly the first time around. 😱

@jordaaash
Copy link
Contributor Author

Got it, I think that makes a lot of sense. This way, error is a problem with the library, or something else serious, and the new event is for something perhaps more ordinary or recoverable.

So, what about a reference to the job itself? A failed event strikes me as the most natural option given the current event API (and could have a matching job state). It also matches the semantics of several other popular Node job queues (kue, bull, bee-queue).

@timgit
Copy link
Owner

timgit commented Mar 27, 2017

Added this in 1.1.0

@timgit timgit closed this as completed Mar 27, 2017
@jordaaash
Copy link
Contributor Author

Woah, very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants