Error Handling Rethink #398

mcdonnelldean · 2016-04-12T13:04:28Z

Error handling needs a rethink, my thoughts here,

err is not always respected, many types of errors instead cause timeouts
Lots of options that do sort of the same thing, errhandler, .error, seneca.on('error')
Cryptic error messages, feeded by logging
Greedy handling, testing with an Exception Assertion based lib requires weird hacks
Exception trapping, are we being too cautious?
Cryptic Gating, errors from gate executor need to be more clear

A lot of why error handling is bad is due to cryptic messages. We need to rethink the sort of messages we are returning. Gitter is littered with people asking about gate executor timeout questions because they are basically unreadble if you don't know the code.

We are also way to safe in my opinion. We shouldn't be trapping errors, the system ends up in very peculiar states. I'm a firm believer in fail fast and restart. From gitter and issues it does seem people's expectation is fail fast, we really need to consider this.

Bear in mind, our own tests are in the state they are in due to weird error handling, if we had untrapped exceptions and proper err flow we could cut our test suite in half and end up with more coverage.

Contributors and users, please feel free to add pain points and constructive feedback.

AdrianRossouw · 2016-04-12T13:05:10Z

@rjrodger what release did you want to tackle error handling/logging for ?

mcdonnelldean · 2016-04-12T13:05:44Z

@AdrianRossouw It'll be major for sure.

Wardormeur · 2016-04-12T13:24:36Z

We're not on the lastest cause, well, time & priority. Also, some of my remarks may be related to how our stack is built, so that it may not be relevant.
From a 2 months of dev perspective, the need to have the possibility to understand from WHERE a call has been done before failing (when having cross microservices calls) is crucial from a debug perspective.
Also, yes, the truncated messages are.. not helpful to reproduce or understand the call context.
Or timeouts issues which are triggered when the real issue is 2 minutes before, but the whole call stack (asyn.each or eq) fails afterwards by "timing out".
Gateway errors as you report them @mcdonnelldean are often a misimplementation of an act from what I've been experiencing, but could be more descriptive for sure.
On production side, our concern is to not have to get our heads inside the logs, but to have logentries to warn us, which is another concern : will logentry be able to parse our logs? Or use vidi to make a visualisation of errors? ;)

mcollina · 2016-04-12T13:37:15Z

I would like a major bump to remove the timeout on messages, and possibly adding it back as a "filter" or plugin for some very specific microservices. I think the timeout is awesome for entry-point microservices, and maybe for Chairo.

mcdonnelldean · 2016-04-12T14:01:01Z

@mcollina Do you want seneca to hang in the case of a block at the other side? I'm trying to work out how we could do this and want to be clear as possible.

mcdonnelldean · 2016-04-12T14:01:32Z

Just a heads up @Wardormeur manages CoderDojo's Zen platform.

mcdonnelldean · 2016-04-12T14:49:14Z

Added related issues.

AdrianRossouw · 2016-04-12T15:33:27Z

The timeouts were responsible for cascading failures at higher loads on a system that I worked on before.

Poor design decisions around how entities were used meant that each entity load / list would fan out into multiple actions loading more entities, until eventually the node callback queue would get so long that the time it takes to respond to an action would surpass the hardcoded seneca message timeout.

so instead of things failing reasonably (as in, one can reason about it) in one place as matteo suggests, it would fail all over the place in a way that had no useful debugging info.

See #398.

rjrodger · 2016-04-22T11:27:51Z

Please also note: http://senecajs.org/contribute/principles.html - Continuity

Error handling rethink != major breakage is acceptable.

The challenge here is not just to rethink the error handling, but also to ensure you have a backwards compatibility path. That means code that depends on the old way still needs to work, by using an option, or a plugin.

The rule for major releases is: one line of code (typically a new plugin) is all you need to make existing code work. It is also acceptable to have an option flag to revert to previous behaviour.

It is an absolutely core value of the Seneca community that existing users are respected and that we do not break existing apps.

This does make the work of fixing error handling harder, I know :)

/cc @mcdonnelldean @AdrianRossouw @mcollina @davidmarkclements @pelger @geek

davidmarkclements · 2016-04-22T15:13:52Z

@rjrodger - I wonder if, for each area, can we have a two step approach?

Step 1 - unencumbered reimagining

here thoughts can run free, protocols/implementations do not consider backwards compatibility

this would be "seneca-next-next" ... postmodern seneca

Step 2 - shimming

once we have nailed down "the best way" then we create an adapter that delivers
backwards compatiblity.

If necessary, tweaks may then be made to the pristine implementation, but only in support of the shim.

this would be "seneca-next"

See senecajs/seneca#402 See senecajs/seneca#398

mcollina · 2016-04-26T10:02:09Z

I tend to agree with @davidmarkclements, with a catch.
I think we should try to implement the new features without breaking, but not being afraid to break the current API if we see no alternative. If such an event happen, we go and seek a solution (maybe one more module). So this became a two step problem, which is easier to solve.

mcdonnelldean · 2016-04-26T10:32:56Z

I agree with the principle but we don't have the time or resources for the above. Realistically what is going to happen is we bludgeon our way through this until we have something smaller and more composable and around version 6 we can start blue sky thinking.

We need to be a little more careful here. For instance @Wardormeur has a fairly big system we will need to run some tests against (CoderDojo) ensure we break as little as possible.

Basically it must be reasonably backwards compatible within a version. We cannot expect a user to have to spend two days changing code to make v.next work, thats just not going to fly with customers who have built big systems.

davidmarkclements · 2016-04-26T10:41:03Z

To be clear, I'm saying we supply our shims wholesale - v.next is v.next.next + shim

The judgement call is on whether it will save more time and resources in the long run to
to follow the ideal implementation and work backwards to a shimmed version or to deal
with the backwards compat overhead and strive towards the eventual rainbow of perfection

StarpTech · 2016-09-01T19:38:08Z

@mcdonnelldean in my opinion error handling is a big picture and need good imrovements. Do you planning to take in the next roadmap?

mcdonnelldean · 2016-09-01T21:23:51Z

@StarpTech I don't think there is anyone here that doesn't agree Error handling needs to be sorted, hence this Issue item. Seneca 3.0 is just out, we have an obligation to folk to ensure current modules are in working order; standard practice for any toolkit this big.

Once this work is done we will re-examine the roadmap.

StarpTech · 2016-09-02T06:20:04Z

Hi @mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

mcdonnelldean · 2016-09-02T08:39:48Z

Thats because it hasn't been planned yet. As mentioned already we are still bedding down 3.0 and making sure everything is ok there. After that we will look at what's next for further versions

Kindest Regards,

Dean

On 2 Sep 2016, at 07:20, Dustin <notifications@github.com mailto:notifications@github.com> wrote:

Hi @mcdonnelldeanhttps://github.com/mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/398#issuecomment-244292765, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADdT6Y2ZWd3TtqI96Hbn6M1J1MhVkHwWks5ql8AXgaJpZM4IFXW8.

StarpTech · 2016-09-02T10:01:39Z

Then I don't understand the prioritisation but thanks for clarification.

StarpTech · 2016-11-04T10:58:49Z

Any updates? No progress since 2 months.

StarpTech · 2016-11-07T22:32:44Z

@mcdonnelldean @rjrodger anything?

StarpTech · 2016-12-02T09:41:49Z

This is not the way to go. There are lots of people who have issues with that and they dont know where they are going. No comment, no Roadmap and too less documentation to work on it.
I create https://github.com/hemerajs/hemera it solves all my problems.

JoshDobbin · 2017-02-02T18:45:12Z

Thanks for making this amazing framework. I found it and odd choice to make errors passed in the first argument of the callback fatal but I didn't take the time to write and share an awesome framework so I can't complain.

My solution was to create a wrapper function for Seneca that can be used everywhere and then include this line inside the wrapper function somewhere after Seneca had been instantiated:

seneca.fixedargs.fatal$ = false;

It still leaves me with the problem of determining how to handle fatal errors, but this way my whole team doesn't have to be retrained on a new way to utilize callbacks that doesn't follow the norms for other libraries.

Again thanks. Great framework.

tswaters · 2017-02-05T00:22:33Z

We also have seneca.fixedargs.fatal$ = false set on all microservices.

Going forward, it would be nice if fatal-type errors that seneca throws itself (such as act_not_found) have a meta key attached for fatal$: true, whereas if you pass an error to the callback of an action, it doesn't assume the error to be fatal and will just return it to the caller as an err.

On that note on returning err to the caller - it would be really nice if the actual error that was passed back looks identical to the one you provided. Right now it creates a new error object with a bunch of seneca-specific stuff on it like plugin, pin, args, etc. it mangles the message to include a bunch of extra stuff.... it's not ideal for actually dealing with business-specific errors throughout an application.

I find I need to do a lot of checks right now for err.orig$ to get back my original error to figure out why the action failed and for what reason, which isn't ideal.... it would be nice if there was err.seneca$ with all the additional information such as plugin, pin, args, etc. If you want to see it, it's there - but the error you receive back still looks like the error you provided.

Here would be our ideal pattern --

seneca.add('role:entity,cmd:get', (args, done) => {
  someEntity.load(args.id, (err, entity) => {
    if (err) { return done(new DatabaseError(err)) }
    if (!entity) { return done(new NotFoundError(args.id)) }
    done(null, entity)
  })
})

And in the calling function, typically a seneca-web action using seneca-web-adapter-express - which now passes errors to next (yay):

seneca.add('role:web,cmd:get', (msg, done) => {
  if (msg.args.user == null) { return done(new UnauthorizedError()) }
  if (msg.args.user.role !== 'specific-role') { return done(new ForbiddenError()) }
  const id = msg.args.query.id
  if (!id) { return done(new ValidationError('id must be provided')) }
  seneca.act('role:entity,cmd:get', {id}, done)
})

And in our express app, we handle all the errors from anywhere along the action calls:

app.use((err, req, res, next) => {
  if (req.headersSent) { return next(err) }
  if (err instanceof DatabaseError) { return res.status(500).send('db error') }
  if (err instanceof NotFoundError) { return res.status(404).send('couldnt find ' + err.message) }
  if (err instanceof ForbiddenError) { return res.status(403).send('you cant do that') }
  if (err instanceof UnauthorizedError) { return res.status(401).send('please login') }
  if (err instanceof ValidationError) { return res.status(400).send(err.message) }
  res.status(500).send('unexpected')
})

This is a moderatly simple example - omitted is the localization of errors into different languages, performing replacements on specific keys in the error context, using parambulator to return the validation errors.... and a bunch of other stuff I've probably missed.

For this pattern to work, we need to do a bunch of extra stuff to get around err.orig$ being the actual error, and the instanceof not working properly because everything gets recreated after passing the transport boundary..... all solvable in userland. But to get any of it to work without services dying, we need seneca.fixedargs.fatal$ = false, which doesn't seem right.

agarbund · 2017-09-07T21:03:36Z

Hello, all. In the first place, I'd like to thank all involved in creating this great framework.

I found this topic when fighting with SENECA TERMINATED error messages after plugging seneca-amqp-transport to my project (we're returning business logic errors sometimes and before plugging transport plugin it was working fine). I read about seneca.fixedargs.fatal$ = false workaround but can't get it to work in mine case.

I created simple repo demonstrating my problem: https://github.com/agarbund/seneca-error-issue Could someone please explain to me what am I doing wrong here?

Glathrop · 2019-04-04T12:08:51Z

Is there more to this in the past year and a half?

Intensive Googling is leading me in circles. The FAQs reference fatal$ on plugin initialization but I see no reference to error propagation. The issue is still open and I'm not seeing a consensus in any error handling thread.

Are the workarounds here "working" so the thread is dead? We love many parts of Seneca, but the lack of clarity in how best to even think about errors causes some major issues in our program.

The explanation of existing error handling reads as a very literal interpretation of microservices where each individual plugin would also be an individually hosted container that dies and restarts independently of other containers. That can't be correct as hosting would be nightmarish. I am missing some fundamental Node.js process understanding? Correct me if I'm wrong but a FATAL = SIGTERM and is going to crash the container along with all the plugins in the container, not just the individual node process?

Our specific problems arise in a hosted K8 environment. When one pod dies due to a fatal error (typically TIMEOUT), all pods of that same type eventually fail as the message broker attempts to get a satisfactory response back. Yes, they restart and heal, but nothing is timely and the app is down for the time it takes for all pods to restart. The net effect is the same as if we had a more monolithic set up. The app is unusable.

Intuitively this feels like an incorrect configuration/setup/pattern in our code. That said, there are quite a bit of moving pieces in a modern hosted environment so who knows! It's challenging for a small team to have experience with the issues that arise going from one to many plugins, local to small distribution, etc. We can't be the only team feeling this way.

What would be very helpful to relative newbies like myself is a written explanation of how to think about this issue and why the workarounds "work". I'm hesitant to just start throwing fatal$ = false; in our codebase without truly understanding the larger ramifications or how to handle errors that are actually fatal and should lead to a SIGTERM.

Thanks to all who have contributed their time and energy to this. I know it's easy to just ask questions, but for the sake of future readers, it does seem time to stir the pot. I've admitted my lack of understanding so I'm happy to answer any clarifying questions if it can add to the thread.

Glathrop · 2019-05-12T18:35:04Z

Crickets...

s9tpepper · 2019-05-12T18:52:02Z

Do you have any remote idea as to what may be causing the errors? Like, is it a JS rte, null pointer of some kind? Or like timeouts? Or database? Are the errors happening in your application actions? Do your actions work at some point and then fail, so you know the code works (kind of) and then something happens?

-omar

On Sun, May 12, 2019 at 11:35 AM GL ***@***.***> wrote: Crickets... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#398 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACHMRKT5NWQHPDRLBOVC7LPVBPNTANCNFSM4CAVOW6A> .

-- Omar Gonzalez s9tpepper@apache.org

Glathrop · 2019-05-12T23:38:44Z

@s9tpepper timeouts are common.

s9tpepper · 2019-05-13T00:27:31Z

But I wasn’t asking if they’re common. I was asking for more details from Jarek as to get more context. And they’re not that common if you know what’s timing out and you fix it.

-omar

On Sun, May 12, 2019 at 4:38 PM GL ***@***.***> wrote: @s9tpepper <https://github.com/s9tpepper> timeouts are common. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#398 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACHMROZPCSTG7JEYDVVBTDPVCTANANCNFSM4CAVOW6A> .

-- Omar Gonzalez s9tpepper@apache.org

s9tpepper · 2019-05-13T00:29:56Z

Sorry, my email client messed up my view. If you have timeouts a lot then something is taking too long, you should adjust the timeout settings on your Seneca instance so the actions have adequate time to perform whatever you are doing in your actions. If you are using default web transport, you must change the transport timeout as well.

-omar

On Sun, May 12, 2019 at 4:38 PM GL ***@***.***> wrote: @s9tpepper <https://github.com/s9tpepper> timeouts are common. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#398 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACHMROZPCSTG7JEYDVVBTDPVCTANANCNFSM4CAVOW6A> .

-- Omar Gonzalez s9tpepper@apache.org

s9tpepper · 2019-05-13T00:32:32Z

Something like: const seneca = Seneca({ timeout: 10000, // 10 secs transport: { web: 10000, tcp: 10000 } })

-omar

On Sun, May 12, 2019 at 4:38 PM GL ***@***.***> wrote: @s9tpepper <https://github.com/s9tpepper> timeouts are common. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#398 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACHMROZPCSTG7JEYDVVBTDPVCTANANCNFSM4CAVOW6A> .

-- Omar Gonzalez s9tpepper@apache.org

mcdonnelldean added investigate labels Apr 12, 2016

mcollina added a commit that referenced this issue Apr 21, 2016

Removed all try catch

da79b09

See #398.

mcollina mentioned this issue Apr 21, 2016

Error handling and bootstrap refactor #402

Closed

vigneshnrfs mentioned this issue Apr 23, 2016

Handling Error in Listener Callback senecajs/seneca-amqp-transport#23

Closed

This was referenced Apr 23, 2016

I wonder why Seneca throttles error returned by seneca service? #326

Closed

nice stack trace handling #98

Closed

seneca catch error that should be catched by mocha #35

Closed

mcollina added a commit to senecajs/seneca-transport that referenced this issue Apr 26, 2016

Removed relevant try catch.

0581262

See senecajs/seneca#402 See senecajs/seneca#398

mcollina mentioned this issue Apr 26, 2016

Removed relevant try catch. senecajs/seneca-transport#94

Closed

afoninsky mentioned this issue Sep 4, 2016

Best practices to log entries and catch errors in 3.0.0 #523

Closed

StarpTech mentioned this issue Oct 18, 2016

Seneca act trapping errors during tests #574

Open

tswaters mentioned this issue Nov 2, 2016

Express server crashes when seneca plugin returns an error senecajs/seneca-web-adapter-express#8

Open

Tapppi mentioned this issue Nov 3, 2016

Why does seneca override eraro ? / Error management proposal #579

Open

StarpTech mentioned this issue Nov 3, 2016

Error handling across plugins #580

Closed

StarpTech mentioned this issue Mar 31, 2017

SENECA TERMINATED at 2017-03-30T03:28:18.079Z. See above for error report #617

Closed

rjrodger removed Feature: Enhancement labels Apr 11, 2017

rjrodger self-assigned this May 3, 2017

rjrodger added the investigate label May 3, 2017

jmatsushita mentioned this issue May 23, 2017

Handling errors on plugin init #640

Open

nfantone mentioned this issue Sep 6, 2017

Error handling, one more time senecajs/seneca-amqp-transport#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Handling Rethink #398

Error Handling Rethink #398

mcdonnelldean commented Apr 12, 2016

AdrianRossouw commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

Wardormeur commented Apr 12, 2016

mcollina commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

AdrianRossouw commented Apr 12, 2016

rjrodger commented Apr 22, 2016

davidmarkclements commented Apr 22, 2016 •

edited

Loading

mcollina commented Apr 26, 2016

mcdonnelldean commented Apr 26, 2016

davidmarkclements commented Apr 26, 2016 •

edited

Loading

StarpTech commented Sep 1, 2016

mcdonnelldean commented Sep 1, 2016

StarpTech commented Sep 2, 2016

mcdonnelldean commented Sep 2, 2016

StarpTech commented Sep 2, 2016

StarpTech commented Nov 4, 2016

StarpTech commented Nov 7, 2016

StarpTech commented Dec 2, 2016

JoshDobbin commented Feb 2, 2017

tswaters commented Feb 5, 2017

agarbund commented Sep 7, 2017

Glathrop commented Apr 4, 2019

Glathrop commented May 12, 2019

s9tpepper commented May 12, 2019 via email

Glathrop commented May 12, 2019

s9tpepper commented May 13, 2019 via email

s9tpepper commented May 13, 2019 via email

s9tpepper commented May 13, 2019 via email

Error Handling Rethink #398

Error Handling Rethink #398

Comments

mcdonnelldean commented Apr 12, 2016

AdrianRossouw commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

Wardormeur commented Apr 12, 2016

mcollina commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

mcdonnelldean commented Apr 12, 2016

AdrianRossouw commented Apr 12, 2016

rjrodger commented Apr 22, 2016

davidmarkclements commented Apr 22, 2016 • edited Loading

Step 1 - unencumbered reimagining

Step 2 - shimming

mcollina commented Apr 26, 2016

mcdonnelldean commented Apr 26, 2016

davidmarkclements commented Apr 26, 2016 • edited Loading

StarpTech commented Sep 1, 2016

mcdonnelldean commented Sep 1, 2016

StarpTech commented Sep 2, 2016

mcdonnelldean commented Sep 2, 2016

StarpTech commented Sep 2, 2016

StarpTech commented Nov 4, 2016

StarpTech commented Nov 7, 2016

StarpTech commented Dec 2, 2016

JoshDobbin commented Feb 2, 2017

tswaters commented Feb 5, 2017

agarbund commented Sep 7, 2017

Glathrop commented Apr 4, 2019

Glathrop commented May 12, 2019

s9tpepper commented May 12, 2019 via email

Glathrop commented May 12, 2019

s9tpepper commented May 13, 2019 via email

s9tpepper commented May 13, 2019 via email

s9tpepper commented May 13, 2019 via email

davidmarkclements commented Apr 22, 2016 •

edited

Loading

davidmarkclements commented Apr 26, 2016 •

edited

Loading