einhorn should monitor errors on upgrade #2

Open
ebroder opened this Issue Jun 14, 2012 · 6 comments

Comments

Projects
None yet
2 participants
Contributor

ebroder commented Jun 14, 2012

Ran into a problem when upgrading its children process to a new version, if the new children error out immediately (due to a bug). Because they error pre-ack, the old children stick around and handle requests (yay!)

However, einhorn continuously tries to start the new children. It should possibly back off, and almost certainly notify somehow. If I hadn't happened to be doing maintenance on the server in question, who knows how long we would have gone without noticing.

Contributor

ebroder commented Jun 17, 2012

@gdb: At least from a monitoring perspective, what do you think about adding a last_upgraded element to einhorn's state? That way you can have a monitoring script that warns if any child isn't the current version once (Time.now - last_upgraded) is greater than some threshold.

Doesn't solve the underlying problem of getting in this stuck state, but at least gives you an easy way to automatically detect when you should be worried about it.

Contributor

gdb commented Jun 17, 2012

Yeah, I like the idea last_upgraded.

I think adding some sort of flapping detection would also be good, though it's unclear to me exactly what the semantics should be and how it should be reported. I guess you could do something along the lines of watching for workers dying pre-ACK and then putting the result in your state. Depends on exactly what we want to defend against, I guess.

Contributor

ebroder commented Jun 17, 2012

What's the right way for me to get the state in a machine-inspectable way over the control socket? The 'state' command gives me a string from .inspect - am I expected to..eval that or something? (That would sketch me out, but it doesn't matter because it also doesn't seem to work because, e.g., Time objects aren't eval'able)

Contributor

gdb commented Jun 17, 2012

I can see a few possible answers:

  • Have a separate machine_readable_state command, which returns a JSON dump of the state, or
  • Return a YAML dump rather than .inspect for state, or
  • Return a JSON dump for state

I think the third one is superior, though it has the disadvantage of not exposing type information (e.g. symbol keys vs string keys) as well as .inspect does. That's mostly useful for debugging, and so maybe there should be an inspect_state command or something.

Contributor

ebroder commented Jun 17, 2012

Hmm, JSON on its own doesn't seem to be sufficient here. In
particular, Time objects dumped to JSON generate their .inspect
representation, which is still not useful for easy parsing:

[3] pry(main)> JSON.load(JSON.dump(Time.now))
=> "Sun Jun 17 12:53:53 -0700 2012"

YAML.load(YAML.dump(Time.now)) does return a Time object, although
YAML can also do things like serialize arbitrary types, which is a bit
sketch:

[5] pry(main)> puts Set.new.to_yaml
--- !ruby/object:Set
hash: {}

We could pre- and post-process the state dict to coerce timestamps to
seconds-since-epoch and back, but you can't really do that without
encoding details about the keys in the state dict, which I'm not wild
about.

I think I'd lean towards using YAML, and probably hide it under a
separate command. Thoughts?

Contributor

gdb commented Jun 18, 2012

I've actually been meaning to get rid of JSON from the Einhorn protocol. It's the only gem dependency. If YAML allows arbitrary code execution, I'd rather not use it. If however it's safe, then we could switch to it.

gdb added a commit that referenced this issue Sep 27, 2012

command.rb: Add an exponential backoff when spinning up new workers f…
…ails

Helps with #2, though it doesn't do much besides stop trying to spin
up tons of processes.

gdb added a commit that referenced this issue Sep 27, 2012

Add a last_upgraded field to state
This makes it possible to write external monitoring (cf #2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment