Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to handle consumer retry failure at the user level #643

Merged
merged 5 commits into from
Mar 2, 2020

Conversation

goriunov
Copy link
Contributor

@goriunov goriunov commented Feb 17, 2020

Initial implementation for:
#69

Allow to handle consumer retry failure at the user level.

If you have any suggesting/improvement/changes let me know. Would be good to have this, or something similar soon :)

@goriunov
Copy link
Contributor Author

I don't think test failure related to this change.

@Nevon
Copy link
Collaborator

Nevon commented Feb 28, 2020

Could you provide an example use-case for this change? I don't exactly see what would be the purpose of this, especially since as of the current implementation, there's no way for the user to actually affect the consumer (stopping it from restarting, for example). As it is currently, it's simply a hook that allows you to run some arbitrary code on crashes, which we already have in the form of the consumer.events.CRASH instrumentation event

@goriunov
Copy link
Contributor Author

@Nevon from what i understand this change will allow for users to stop auto restart on system fail of the consumer if they wish. As setTimeout(() => restart(onCrash), retryTime) is responsible for consumer restart and i believe user should have a way to stop autoRestart (which is impossible at the moment) .

We had few case cases which were caused by network issues (hard to constantly reproduce) where we want to stop consumer, wait while system recovers and then create new consumer. The behaviour was we trigger disconnect which does not resolve then when we are back we get old consumer reconnected but runner thinks that consumer is disconnected so basically bad state of the consumer all of that due to above setTimeout which is impossible to cancel. I have a fork which we currently use in our prod where this setTiemout has been disabled and we can correctly fail the consumer and create new one without causing bad state and not working consumers connected to out CGs.

I think that this setTimeout at least should be under some kinda flag/option.

@Nevon
Copy link
Collaborator

Nevon commented Feb 29, 2020

I understand. So the goal is to allow users to bail out of the restart cycle. I can see how that would be useful, and I think it's a reasonable thing to allow for.

In that case, we should change the implementation a bit though. Currently it's just a synchronous function that receives no arguments and returns nothing, which means that pretty much the only thing you can do in there is process.exit - no way to shut down resources cleanly or anything.

I would say that we should change it to be an async function that receives the error as an argument and then communicates back to the caller whether or not it should restart. Something like:

if (e.name === 'KafkaJSNumberOfRetriesExceeded' || e.retriable === true) {
  const shouldRestart = !retry.restartOnFailure || (await retry.restartOnFailure(e)).catch(error => {
    logger.error('Caught error when invoking user-provided "restartOnFailure" callback. Defaulting to restarting.', {
      error: error.message || error,
      originalError: e.message || e,
      groupId,
    })

    return true
  })

  if (shouldRestart) {
    const retryTime = e.retryTime || retry.initialRetryTime || initialRetryTime
    logger.error(`Restarting the consumer in ${retryTime}ms`, {
      retryCount: e.retryCount,
      retryTime,
      groupId,
    })

    setTimeout(() => restart(onCrash), retryTime)
  }
}

This would allow you to stop the restarting depending on the error, as well as clean up any resources you might have (database connections etc.) or wait for ongoing processes to finish.

@goriunov
Copy link
Contributor Author

goriunov commented Mar 1, 2020

@Nevon you approach looks better :) I have updated PR, let me know if anything else should be added

@Nevon
Copy link
Collaborator

Nevon commented Mar 2, 2020

I fixed a syntax error in the code that I proposed, and also pushed some tests to verify that it actually does what it says on the tin.

@moreiravictor
Copy link

Was this option released? I really needed it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants