Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: timing out requests without exiting is inherently unsafe #39

Closed
aripollak opened this issue Dec 22, 2013 · 20 comments
Closed
Milestone

Comments

@aripollak
Copy link

I ran into this bug rather frequently on a high traffic site with a somewhat high incidence of timeouts, using Timeout.timeout, very similarly to rack-timeout. It was happening almost daily - a request would time out and raise an exception at a place in ActiveRecord that left a model class in an unclean state; all subsequent requests in the same Unicorn worker that used the corrupted class would crash until the worker was restarted, and who knows what else was corrupted in there but wasn't raising an error.

I've taken to doing the following in an around_filter instead. This might be a good thing to do in rack-timeout, or at least have a note in the README about this possibility, since with enough timeouts you're bound to run into something like it sooner or later. Note that this code is not rescuing from Timeout::Error because if the timeout happens inside a Rails view, it will be caught and re-raised as a ActionView::Template::Error.

  def exit_process_on_timeout(timeout, &block)
    # Ruby's Timeout can leave the process in a bad state since libraries tend
    # not to anticipate having an exception raised arbitrarily.
    # So we assume this process is corrupted and exit it after letting the request finish.
    Timeout.timeout timeout, Timeout::Error, &block
  rescue => e
    if e.message == 'execution expired'
      ::NewRelic::Agent.notice_error e, request: request
      ::NewRelic::Agent::Transaction.stop
      ::NewRelic::Agent.shutdown
      Process.kill 'QUIT', Process.pid
    end

    raise e
  end
@kch
Copy link
Contributor

kch commented Dec 23, 2013

What I hear is: Rails is such a shit show. Also what FP'ers have been saying since forever: state is the root of all evil.

I wonder how much you can get around activerecord's bullshit by just using transactions. Are the issues you see inherent to AR, or more about your model implementations?

As for AV rewrapping errors, leaving aside the misguidedness therein, I guess it just goes to show you should do as little logic as possible inside views?

In the end there's not much rack-timeout can do here without turning it into a boil the ocean and rewrite rails project. You embrace a set of trade-offs by using it. I agree that a note in the readme would be warranted.

Also the usual reminder that rack-timeout is a stopgap solution and that you should try to weed out chronic sources of long requests and fix the root of the problem.

@kch
Copy link
Contributor

kch commented Dec 23, 2013

As for the nuclear option, exiting on timeout, I've been considering it, because of #30, mostly. Turns out you can't raise in an IO-blocked thread reliably anyway.

But then, why not just use unicorn's timeout mechanism?

And what happens when the main thread exits with other servers? Rack-timeout is intended to be a generic solution for rack apps, and you can't expect every server to resurrect its processes. Or can you?

@kch
Copy link
Contributor

kch commented Dec 23, 2013

Anyways. I hear what you're saying, and thank you for the report. It just turns out timing things out reliably is a hard problem.

The purpose of this ongoing rewrite of rack-timeout is more about gathering extended information around requests that time out, than providing a thorough solution to all timeout-related woes.

@aripollak
Copy link
Author

I wonder how much you can get around activerecord's bullshit by just using transactions. Are the issues you see inherent to AR, or more about your model implementations?

Hm, I'm not sure how transactions work in Rails, but I haven't heard that idea before. I don't think this particular problem is related to anything bad happening in the model, it's just using Rails' regular scope mechanism.

As for AV rewrapping errors, leaving aside the misguidedness therein, I guess it just goes to show you should do as little logic as possible inside views?

This particular view doesn't have much logic, it just so happens that the view calls out to potentially expensive operations on models. I did also notice after posting the original message that Rails' wrapped errors contain a .original_exception, so I've changed the logic slightly:

  def timed_out?(exception)
    exception.class == Timeout::Error || wrapped_exception_is_timeout?(exception)
  end

  def wrapped_exception_is_timeout?(exception)
    if exception.respond_to?(:original_exception)
      exception.original_exception.class == Timeout::Error
    end
  end

But then, why not just use unicorn's timeout mechanism?

Sure, we also use unicorn's timeout as a last resort (like if it's stuck in IO for some reason), but it kills the worker hard, without any cleanup and no debugging information to inform future improvements. It would certainly be a much simpler solution if we could only use that one.

And what happens when the main thread exits with other servers? Rack-timeout is intended to be a generic solution for rack apps, and you can't expect every server to resurrect its processes. Or can you?

Well, this exact solution may not be correct for every rack server, but I'd be pretty shocked if someone was running a webserver in production that didn't have some mechanism for restarting Ruby workers automatically. Certainly they may not be as fast at restarting as Unicorn with preload_app true, but they should at least recover eventually.

Thanks for your replies. I'm mostly trying to prevent other people from having to spend months futilely tracking down this kind of problem or just wondering in vain what's happening and then switching to some other stack.

@aripollak
Copy link
Author

One modification to this, just for posterity: Ruby 2.1.0's Timeout prevents catching the exception inside the block you pass to timeout if you don't pass an explicit class, so this seems to work correctly without having to explicitly look for a re-raised Rails error:

begin
  Timeout.timeout timeout, &block
rescue => e
   Process.kill [...]
end

UPDATE: This doesn't actually work correctly; if a timeout like that happens inside a nested ActiveRecord transaction, the transaction could be committed instead of aborted or rolled back, which is really bad. So don't rely on the default Ruby 2.1 Timeout class. Instead, pass an explicit timeout error class, and you can use 2.1's new Exception#cause to find the original exception if Rails has wrapped it.

@kch
Copy link
Contributor

kch commented Sep 10, 2014

The sample code for exit_process_on_timeout on the original description is good stuff and might also be welcome in the documentation. (In a less specific form, i.e. no New Relic stuff, etc.)

Related to #49.

@kch
Copy link
Contributor

kch commented Sep 10, 2014

I'm leaving this open strictly as a documentation issue. There's no general fix to the problem of state corruption. If state is modified serially with the expectation that all steps will complete, aborting in the middle of that may cause corruption.

If such corruption can be detected (such as in the original description because of constant exceptions), then we can provide a way to just die. There's still a chance of silent corruption tho…

As far as rails goes, silent corruption is out of our hands. As for application logic, implementors can make sure sets of related state changes happen atomically or transactionally.

@kch kch changed the title Timing out requests without exiting is inherently unsafe Documentation: timing out requests without exiting is inherently unsafe Sep 10, 2014
@kch kch added this to the release-0.1 milestone Sep 10, 2014
@kch
Copy link
Contributor

kch commented Sep 30, 2014

Thread#handle_interrupt is somewhat annoying to use and can be a bit tough to grasp, but it could help in a lot of state corruption situations.

@kch
Copy link
Contributor

kch commented Oct 20, 2014

This is mostly done as part of 238f0b1 / #53, but I'd like to review this ticket more closely before closing it out.

@ankane
Copy link

ankane commented Oct 22, 2014

Thanks for this awesome gem! FWIW, here's how I do the timeout safely for unicorn - based off @aripollak's method:

Rack::Timeout.register_state_change_observer(:slowpoke) do |env|
  case env[Rack::Timeout::ENV_INFO_KEY].state
  when :timed_out
    env[Slowpoke::ENV_KEY] = true
  when :completed
    Process.kill("QUIT", Process.pid) if env[Slowpoke::ENV_KEY]
  end
end

@kch
Copy link
Contributor

kch commented Oct 22, 2014

@ankane can you explain exactly how that relates to unicorn? Thanks

@ankane
Copy link

ankane commented Oct 22, 2014

Yeah, the kill signal tells the worker to gracefully restart once it finishes the current request. This automatically cleans up the state. Should work with other servers that exhibit this same behavior.

@kch
Copy link
Contributor

kch commented Oct 22, 2014

So if you're restarting after every timeout why not just rely on unicorn's timeout?

@ankane
Copy link

ankane commented Oct 22, 2014

Unicorn's timeout is a hard kill, which unicorn discourages overuse. The QUIT is a soft kill.

@aripollak
Copy link
Author

Well, the discouragement there applies to SIGQUIT too, since the worker process is still killed and has to respawn. But if the timeout is caught in the Ruby code, the exception can be re-raised and actually show up in your logs plus services like New Relic or other error trackers.

@ankane
Copy link

ankane commented Oct 22, 2014

@aripollak Makes sense. The main reason is actually what you've stated - error tracking.

@fourseven
Copy link

@ankane I like what you've done, could you put it into a gist please? I'm curious whether you're swallowing the Rack::Timeout::RequestTimeoutError exception, or if there's still an around_filter in place as well as your observer.

@ankane
Copy link

ankane commented Oct 31, 2014

@fourseven Thanks, I've packaged it into the Slowpoke gem.

@kch
Copy link
Contributor

kch commented Aug 14, 2015

I guess I'm happy with the existing docs.

@dentarg
Copy link

dentarg commented Dec 12, 2019

Process kill/restart that was discussed here was implemented in #157

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants