Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensu server will throw away unprocessed in-flight results when restarting #1165

Closed
xyntrix opened this issue Feb 9, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@xyntrix
Copy link

commented Feb 9, 2016

We're testing how Sensu handles different failure scenarios. In one case, we have a simple handler-extension that makes a call out to a remote service via REST. We then simulated how a network isolation event would impact the handler (the handler has a 5 second connect timeout).

We blocked all traffic to the target REST service. We then fired off 50 alarms for just in time clients to trigger call outs. Once we could see the 25th server getting processed, we issued a restart to Sensu server and removed the manual network block.

The server restarted, and we expected Sensu to pick up where it had left off and continue processing the queue of work -- however, what we found was that Sensu had dropped all the unprocessed results on the floor during the restart.

Would it make sense to explore adjusting how Sensu picks up and acks work off the transport, so that if the client is halted while processing work it doesn't throw the results on the floor? We are moving Sensu into a highly ephemeral Docker environment -- so there is some concern that if we are spinning up and tearing down Sensu server instances more often, we're going to introduce more potential cases where we throw away work to do.

@xyntrix

This comment has been minimized.

Copy link
Author

commented Feb 9, 2016

cheap way to generate example alarms for a handler in bulk, on any host with a sensu-client local socket:

function bad () {
  # { short_hostname, alarm_name, handler_name}
  # default:  hostname -s, alarmtest, some_handler
  [ -z ${1} ] && ahost="`hostname -s`" || ahost=${1}
  [ -z ${2} ] && check="keepalive" || check="${2}"
  [ -z ${3} ] && handler="some_handler" || handler="${3}"
  echo "`date`: badding ${ahost}, ${check}"
  echo '{ "source": "'${ahost}'", "handlers": [ "'${handler}'" ], "name": "'${check}'", "issued": "'`date +%s`'", "output": "CRITICAL: '${ahost}' keepalive alarm '`date`'", "status": 2 }' | nc -w1 127.0.0.1 3030
}

function good () {
  # { short_hostname, alarm_name, handler_name}
  # default:  hostname -s, alarmtest, some_handler
  date
  [ -z ${1} ] && ahost="`hostname -s`" || ahost=${1}
  [ -z ${2} ] && check="keepalive" || check="${2}"
  [ -z ${3} ] && handler="some_handler" || handler="${3}"
  echo "`date`: gooding ${ahost}, ${check}"
  echo '{ "source": "'${ahost}'", "handlers": [ "'${handler}'" ], "name": "'${check}'", "issued": "'`date +%s`'", "output": "OK: '${ahost}' keepalive clear '`date`'", "status": 0 }' | nc -w1 127.0.0.1 3030

Drop into goodbad.sh, and source the file (ex: . ./goodbad.sh).

Create 50 alarms:

for a in `seq 1 50`; do bad testhost_${a} testalarm callout_handler; done

Resolve 50 alarms:

for a in `seq 1 50`; do good testhost_${a} testalarm callout_handler; done
@xyntrix

This comment has been minimized.

Copy link
Author

commented Feb 9, 2016

{"timestamp":"2016-02-09T12:03:09.217173-0700","level":"warn","message":"unsubscribing from keepalive and result queues"}
{"timestamp":"2016-02-09T12:03:09.217552-0700","level":"info","message":"completing event handling in progress","handling_event_count":0}
{"timestamp":"2016-02-09T12:03:09.718622-0700","level":"warn","message":"stopping reactor"}
--

I'm wondering if because we're using an extension instead of a regular handler, if this disrupts the "handling_event_count"? It seems like that is intended to keep from dropping pending work.

@calebhailey

This comment has been minimized.

Copy link
Member

commented Apr 6, 2016

@xyntrix thank you for this submission! I'll answer your last question first - the handling_event_count counter includes handler extensions. The loss you're seeing corresponds to check results that were pulled from the transport, but hadn't yet completed processing and converted into an event.

We'll discuss this internally and see what solutions might make sense. It would be easy to add a second counter for in-flight check results which would stem the "loss" of check results in the case of a graceful restart. @portertech thinks we can implement this in 0.24 and he'll provide some additional details when he sits down to implement this.

@portertech

This comment has been minimized.

Copy link
Member

commented May 9, 2016

@xyntrix I believe #1257 addresses your issue 👍

@portertech

This comment has been minimized.

Copy link
Member

commented May 17, 2016

Closing this as #1257 was merged and does address these issues/concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.