Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Handling Rethink #398

Open
mcdonnelldean opened this issue Apr 12, 2016 · 52 comments
Open

Error Handling Rethink #398

mcdonnelldean opened this issue Apr 12, 2016 · 52 comments
Assignees

Comments

@mcdonnelldean
Copy link
Contributor

Error handling needs a rethink, my thoughts here,

  • err is not always respected, many types of errors instead cause timeouts
  • Lots of options that do sort of the same thing, errhandler, .error, seneca.on('error')
  • Cryptic error messages, feeded by logging
  • Greedy handling, testing with an Exception Assertion based lib requires weird hacks
  • Exception trapping, are we being too cautious?
  • Cryptic Gating, errors from gate executor need to be more clear

A lot of why error handling is bad is due to cryptic messages. We need to rethink the sort of messages we are returning. Gitter is littered with people asking about gate executor timeout questions because they are basically unreadble if you don't know the code.

We are also way to safe in my opinion. We shouldn't be trapping errors, the system ends up in very peculiar states. I'm a firm believer in fail fast and restart. From gitter and issues it does seem people's expectation is fail fast, we really need to consider this.

Bear in mind, our own tests are in the state they are in due to weird error handling, if we had untrapped exceptions and proper err flow we could cut our test suite in half and end up with more coverage.

Contributors and users, please feel free to add pain points and constructive feedback.

@AdrianRossouw
Copy link
Contributor

@rjrodger what release did you want to tackle error handling/logging for ?

@mcdonnelldean
Copy link
Contributor Author

@AdrianRossouw It'll be major for sure.

@Wardormeur
Copy link
Member

We're not on the lastest cause, well, time & priority. Also, some of my remarks may be related to how our stack is built, so that it may not be relevant.
From a 2 months of dev perspective, the need to have the possibility to understand from WHERE a call has been done before failing (when having cross microservices calls) is crucial from a debug perspective.
Also, yes, the truncated messages are.. not helpful to reproduce or understand the call context.
Or timeouts issues which are triggered when the real issue is 2 minutes before, but the whole call stack (asyn.each or eq) fails afterwards by "timing out".
Gateway errors as you report them @mcdonnelldean are often a misimplementation of an act from what I've been experiencing, but could be more descriptive for sure.
On production side, our concern is to not have to get our heads inside the logs, but to have logentries to warn us, which is another concern : will logentry be able to parse our logs? Or use vidi to make a visualisation of errors? ;)

@mcollina
Copy link
Contributor

I would like a major bump to remove the timeout on messages, and possibly adding it back as a "filter" or plugin for some very specific microservices. I think the timeout is awesome for entry-point microservices, and maybe for Chairo.

@mcdonnelldean
Copy link
Contributor Author

@mcollina Do you want seneca to hang in the case of a block at the other side? I'm trying to work out how we could do this and want to be clear as possible.

@mcdonnelldean
Copy link
Contributor Author

Just a heads up @Wardormeur manages CoderDojo's Zen platform.

@mcdonnelldean
Copy link
Contributor Author

Added related issues.

@AdrianRossouw
Copy link
Contributor

The timeouts were responsible for cascading failures at higher loads on a system that I worked on before.

Poor design decisions around how entities were used meant that each entity load / list would fan out into multiple actions loading more entities, until eventually the node callback queue would get so long that the time it takes to respond to an action would surpass the hardcoded seneca message timeout.

so instead of things failing reasonably (as in, one can reason about it) in one place as matteo suggests, it would fail all over the place in a way that had no useful debugging info.

mcollina added a commit that referenced this issue Apr 21, 2016
@rjrodger
Copy link
Collaborator

Please also note: http://senecajs.org/contribute/principles.html - Continuity

Error handling rethink != major breakage is acceptable.

The challenge here is not just to rethink the error handling, but also to ensure you have a backwards compatibility path. That means code that depends on the old way still needs to work, by using an option, or a plugin.

The rule for major releases is: one line of code (typically a new plugin) is all you need to make existing code work. It is also acceptable to have an option flag to revert to previous behaviour.

It is an absolutely core value of the Seneca community that existing users are respected and that we do not break existing apps.

This does make the work of fixing error handling harder, I know :)

/cc @mcdonnelldean @AdrianRossouw @mcollina @davidmarkclements @pelger @geek

@davidmarkclements
Copy link
Contributor

davidmarkclements commented Apr 22, 2016

@rjrodger - I wonder if, for each area, can we have a two step approach?

Step 1 - unencumbered reimagining

here thoughts can run free, protocols/implementations do not consider backwards compatibility

this would be "seneca-next-next" ... postmodern seneca

Step 2 - shimming

once we have nailed down "the best way" then we create an adapter that delivers
backwards compatiblity.

If necessary, tweaks may then be made to the pristine implementation, but only in support of the shim.

this would be "seneca-next"

@mcollina
Copy link
Contributor

I tend to agree with @davidmarkclements, with a catch.
I think we should try to implement the new features without breaking, but not being afraid to break the current API if we see no alternative. If such an event happen, we go and seek a solution (maybe one more module). So this became a two step problem, which is easier to solve.

@mcdonnelldean
Copy link
Contributor Author

I agree with the principle but we don't have the time or resources for the above. Realistically what is going to happen is we bludgeon our way through this until we have something smaller and more composable and around version 6 we can start blue sky thinking.

We need to be a little more careful here. For instance @Wardormeur has a fairly big system we will need to run some tests against (CoderDojo) ensure we break as little as possible.

Basically it must be reasonably backwards compatible within a version. We cannot expect a user to have to spend two days changing code to make v.next work, thats just not going to fly with customers who have built big systems.

@davidmarkclements
Copy link
Contributor

davidmarkclements commented Apr 26, 2016

To be clear, I'm saying we supply our shims wholesale - v.next is v.next.next + shim

The judgement call is on whether it will save more time and resources in the long run to
to follow the ideal implementation and work backwards to a shimmed version or to deal
with the backwards compat overhead and strive towards the eventual rainbow of perfection

@StarpTech
Copy link
Contributor

@mcdonnelldean in my opinion error handling is a big picture and need good imrovements. Do you planning to take in the next roadmap?

@mcdonnelldean
Copy link
Contributor Author

@StarpTech I don't think there is anyone here that doesn't agree Error handling needs to be sorted, hence this Issue item. Seneca 3.0 is just out, we have an obligation to folk to ensure current modules are in working order; standard practice for any toolkit this big.

Once this work is done we will re-examine the roadmap.

@StarpTech
Copy link
Contributor

Hi @mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

@mcdonnelldean
Copy link
Contributor Author

Thats because it hasn't been planned yet. As mentioned already we are still bedding down 3.0 and making sure everything is ok there. After that we will look at what's next for further versions

Kindest Regards,

Dean

On 2 Sep 2016, at 07:20, Dustin <notifications@github.commailto:notifications@github.com> wrote:

Hi @mcdonnelldeanhttps://github.com/mcdonnelldean I just ask this because I cant find it in 4.0, 5.0...

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/398#issuecomment-244292765, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADdT6Y2ZWd3TtqI96Hbn6M1J1MhVkHwWks5ql8AXgaJpZM4IFXW8.

@StarpTech
Copy link
Contributor

Then I don't understand the prioritisation but thanks for clarification.

@StarpTech
Copy link
Contributor

Any updates? No progress since 2 months.

@StarpTech
Copy link
Contributor

@mcdonnelldean @rjrodger anything?

@StarpTech
Copy link
Contributor

This is not the way to go. There are lots of people who have issues with that and they dont know where they are going. No comment, no Roadmap and too less documentation to work on it.
I create https://github.com/hemerajs/hemera it solves all my problems.

@JoshDobbin
Copy link

Thanks for making this amazing framework. I found it and odd choice to make errors passed in the first argument of the callback fatal but I didn't take the time to write and share an awesome framework so I can't complain.

My solution was to create a wrapper function for Seneca that can be used everywhere and then include this line inside the wrapper function somewhere after Seneca had been instantiated:

seneca.fixedargs.fatal$ = false;

It still leaves me with the problem of determining how to handle fatal errors, but this way my whole team doesn't have to be retrained on a new way to utilize callbacks that doesn't follow the norms for other libraries.

Again thanks. Great framework.

@tswaters
Copy link
Member

tswaters commented Feb 5, 2017

We also have seneca.fixedargs.fatal$ = false set on all microservices.

Going forward, it would be nice if fatal-type errors that seneca throws itself (such as act_not_found) have a meta key attached for fatal$: true, whereas if you pass an error to the callback of an action, it doesn't assume the error to be fatal and will just return it to the caller as an err.

On that note on returning err to the caller - it would be really nice if the actual error that was passed back looks identical to the one you provided. Right now it creates a new error object with a bunch of seneca-specific stuff on it like plugin, pin, args, etc. it mangles the message to include a bunch of extra stuff.... it's not ideal for actually dealing with business-specific errors throughout an application.

I find I need to do a lot of checks right now for err.orig$ to get back my original error to figure out why the action failed and for what reason, which isn't ideal.... it would be nice if there was err.seneca$ with all the additional information such as plugin, pin, args, etc. If you want to see it, it's there - but the error you receive back still looks like the error you provided.

Here would be our ideal pattern --

seneca.add('role:entity,cmd:get', (args, done) => {
  someEntity.load(args.id, (err, entity) => {
    if (err) { return done(new DatabaseError(err)) }
    if (!entity) { return done(new NotFoundError(args.id)) }
    done(null, entity)
  })
})

And in the calling function, typically a seneca-web action using seneca-web-adapter-express - which now passes errors to next (yay):

seneca.add('role:web,cmd:get', (msg, done) => {
  if (msg.args.user == null) { return done(new UnauthorizedError()) }
  if (msg.args.user.role !== 'specific-role') { return done(new ForbiddenError()) }
  const id = msg.args.query.id
  if (!id) { return done(new ValidationError('id must be provided')) }
  seneca.act('role:entity,cmd:get', {id}, done)
})

And in our express app, we handle all the errors from anywhere along the action calls:

app.use((err, req, res, next) => {
  if (req.headersSent) { return next(err) }
  if (err instanceof DatabaseError) { return res.status(500).send('db error') }
  if (err instanceof NotFoundError) { return res.status(404).send('couldnt find ' + err.message) }
  if (err instanceof ForbiddenError) { return res.status(403).send('you cant do that') }
  if (err instanceof UnauthorizedError) { return res.status(401).send('please login') }
  if (err instanceof ValidationError) { return res.status(400).send(err.message) }
  res.status(500).send('unexpected')
})

This is a moderatly simple example - omitted is the localization of errors into different languages, performing replacements on specific keys in the error context, using parambulator to return the validation errors.... and a bunch of other stuff I've probably missed.

For this pattern to work, we need to do a bunch of extra stuff to get around err.orig$ being the actual error, and the instanceof not working properly because everything gets recreated after passing the transport boundary..... all solvable in userland. But to get any of it to work without services dying, we need seneca.fixedargs.fatal$ = false, which doesn't seem right.

@agarbund
Copy link

agarbund commented Sep 7, 2017

Hello, all. In the first place, I'd like to thank all involved in creating this great framework.

I found this topic when fighting with SENECA TERMINATED error messages after plugging seneca-amqp-transport to my project (we're returning business logic errors sometimes and before plugging transport plugin it was working fine). I read about seneca.fixedargs.fatal$ = false workaround but can't get it to work in mine case.

I created simple repo demonstrating my problem: https://github.com/agarbund/seneca-error-issue Could someone please explain to me what am I doing wrong here?

@Glathrop
Copy link

Glathrop commented Apr 4, 2019

Is there more to this in the past year and a half?

Intensive Googling is leading me in circles. The FAQs reference fatal$ on plugin initialization but I see no reference to error propagation. The issue is still open and I'm not seeing a consensus in any error handling thread.

Are the workarounds here "working" so the thread is dead? We love many parts of Seneca, but the lack of clarity in how best to even think about errors causes some major issues in our program.

The explanation of existing error handling reads as a very literal interpretation of microservices where each individual plugin would also be an individually hosted container that dies and restarts independently of other containers. That can't be correct as hosting would be nightmarish. I am missing some fundamental Node.js process understanding? Correct me if I'm wrong but a FATAL = SIGTERM and is going to crash the container along with all the plugins in the container, not just the individual node process?

Our specific problems arise in a hosted K8 environment. When one pod dies due to a fatal error (typically TIMEOUT), all pods of that same type eventually fail as the message broker attempts to get a satisfactory response back. Yes, they restart and heal, but nothing is timely and the app is down for the time it takes for all pods to restart. The net effect is the same as if we had a more monolithic set up. The app is unusable.

Intuitively this feels like an incorrect configuration/setup/pattern in our code. That said, there are quite a bit of moving pieces in a modern hosted environment so who knows! It's challenging for a small team to have experience with the issues that arise going from one to many plugins, local to small distribution, etc. We can't be the only team feeling this way.

What would be very helpful to relative newbies like myself is a written explanation of how to think about this issue and why the workarounds "work". I'm hesitant to just start throwing fatal$ = false; in our codebase without truly understanding the larger ramifications or how to handle errors that are actually fatal and should lead to a SIGTERM.

Thanks to all who have contributed their time and energy to this. I know it's easy to just ask questions, but for the sake of future readers, it does seem time to stir the pot. I've admitted my lack of understanding so I'm happy to answer any clarifying questions if it can add to the thread.

@Glathrop
Copy link

Crickets...

@s9tpepper
Copy link

s9tpepper commented May 12, 2019 via email

@Glathrop
Copy link

@s9tpepper timeouts are common.

@s9tpepper
Copy link

s9tpepper commented May 13, 2019 via email

@s9tpepper
Copy link

s9tpepper commented May 13, 2019 via email

@s9tpepper
Copy link

s9tpepper commented May 13, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests