Documentation: timing out requests without exiting is inherently unsafe #39

aripollak · 2013-12-22T20:47:10Z

I ran into this bug rather frequently on a high traffic site with a somewhat high incidence of timeouts, using Timeout.timeout, very similarly to rack-timeout. It was happening almost daily - a request would time out and raise an exception at a place in ActiveRecord that left a model class in an unclean state; all subsequent requests in the same Unicorn worker that used the corrupted class would crash until the worker was restarted, and who knows what else was corrupted in there but wasn't raising an error.

I've taken to doing the following in an around_filter instead. This might be a good thing to do in rack-timeout, or at least have a note in the README about this possibility, since with enough timeouts you're bound to run into something like it sooner or later. Note that this code is not rescuing from Timeout::Error because if the timeout happens inside a Rails view, it will be caught and re-raised as a ActionView::Template::Error.

  def exit_process_on_timeout(timeout, &block)
    # Ruby's Timeout can leave the process in a bad state since libraries tend
    # not to anticipate having an exception raised arbitrarily.
    # So we assume this process is corrupted and exit it after letting the request finish.
    Timeout.timeout timeout, Timeout::Error, &block
  rescue => e
    if e.message == 'execution expired'
      ::NewRelic::Agent.notice_error e, request: request
      ::NewRelic::Agent::Transaction.stop
      ::NewRelic::Agent.shutdown
      Process.kill 'QUIT', Process.pid
    end

    raise e
  end

The text was updated successfully, but these errors were encountered:

kch · 2013-12-23T20:47:38Z

What I hear is: Rails is such a shit show. Also what FP'ers have been saying since forever: state is the root of all evil.

I wonder how much you can get around activerecord's bullshit by just using transactions. Are the issues you see inherent to AR, or more about your model implementations?

As for AV rewrapping errors, leaving aside the misguidedness therein, I guess it just goes to show you should do as little logic as possible inside views?

In the end there's not much rack-timeout can do here without turning it into a boil the ocean and rewrite rails project. You embrace a set of trade-offs by using it. I agree that a note in the readme would be warranted.

Also the usual reminder that rack-timeout is a stopgap solution and that you should try to weed out chronic sources of long requests and fix the root of the problem.

kch · 2013-12-23T20:50:28Z

As for the nuclear option, exiting on timeout, I've been considering it, because of #30, mostly. Turns out you can't raise in an IO-blocked thread reliably anyway.

But then, why not just use unicorn's timeout mechanism?

And what happens when the main thread exits with other servers? Rack-timeout is intended to be a generic solution for rack apps, and you can't expect every server to resurrect its processes. Or can you?

kch · 2013-12-23T20:56:36Z

Anyways. I hear what you're saying, and thank you for the report. It just turns out timing things out reliably is a hard problem.

The purpose of this ongoing rewrite of rack-timeout is more about gathering extended information around requests that time out, than providing a thorough solution to all timeout-related woes.

aripollak · 2013-12-23T21:49:29Z

I wonder how much you can get around activerecord's bullshit by just using transactions. Are the issues you see inherent to AR, or more about your model implementations?

Hm, I'm not sure how transactions work in Rails, but I haven't heard that idea before. I don't think this particular problem is related to anything bad happening in the model, it's just using Rails' regular scope mechanism.

As for AV rewrapping errors, leaving aside the misguidedness therein, I guess it just goes to show you should do as little logic as possible inside views?

This particular view doesn't have much logic, it just so happens that the view calls out to potentially expensive operations on models. I did also notice after posting the original message that Rails' wrapped errors contain a .original_exception, so I've changed the logic slightly:

  def timed_out?(exception)
    exception.class == Timeout::Error || wrapped_exception_is_timeout?(exception)
  end

  def wrapped_exception_is_timeout?(exception)
    if exception.respond_to?(:original_exception)
      exception.original_exception.class == Timeout::Error
    end
  end

But then, why not just use unicorn's timeout mechanism?

Sure, we also use unicorn's timeout as a last resort (like if it's stuck in IO for some reason), but it kills the worker hard, without any cleanup and no debugging information to inform future improvements. It would certainly be a much simpler solution if we could only use that one.

And what happens when the main thread exits with other servers? Rack-timeout is intended to be a generic solution for rack apps, and you can't expect every server to resurrect its processes. Or can you?

Well, this exact solution may not be correct for every rack server, but I'd be pretty shocked if someone was running a webserver in production that didn't have some mechanism for restarting Ruby workers automatically. Certainly they may not be as fast at restarting as Unicorn with preload_app true, but they should at least recover eventually.

Thanks for your replies. I'm mostly trying to prevent other people from having to spend months futilely tracking down this kind of problem or just wondering in vain what's happening and then switching to some other stack.

aripollak · 2014-01-06T21:42:35Z

One modification to this, just for posterity: Ruby 2.1.0's Timeout prevents catching the exception inside the block you pass to timeout if you don't pass an explicit class, so this seems to work correctly without having to explicitly look for a re-raised Rails error:

begin
  Timeout.timeout timeout, &block
rescue => e
   Process.kill [...]
end

UPDATE: This doesn't actually work correctly; if a timeout like that happens inside a nested ActiveRecord transaction, the transaction could be committed instead of aborted or rolled back, which is really bad. So don't rely on the default Ruby 2.1 Timeout class. Instead, pass an explicit timeout error class, and you can use 2.1's new Exception#cause to find the original exception if Rails has wrapped it.

kch · 2014-09-10T15:44:33Z

The sample code for exit_process_on_timeout on the original description is good stuff and might also be welcome in the documentation. (In a less specific form, i.e. no New Relic stuff, etc.)

Related to #49.

kch · 2014-09-10T15:54:20Z

I'm leaving this open strictly as a documentation issue. There's no general fix to the problem of state corruption. If state is modified serially with the expectation that all steps will complete, aborting in the middle of that may cause corruption.

If such corruption can be detected (such as in the original description because of constant exceptions), then we can provide a way to just die. There's still a chance of silent corruption tho…

As far as rails goes, silent corruption is out of our hands. As for application logic, implementors can make sure sets of related state changes happen atomically or transactionally.

kch · 2014-09-30T17:07:12Z

Thread#handle_interrupt is somewhat annoying to use and can be a bit tough to grasp, but it could help in a lot of state corruption situations.

kch · 2014-10-20T07:13:43Z

This is mostly done as part of 238f0b1 / #53, but I'd like to review this ticket more closely before closing it out.

ankane · 2014-10-22T09:54:53Z

Thanks for this awesome gem! FWIW, here's how I do the timeout safely for unicorn - based off @aripollak's method:

Rack::Timeout.register_state_change_observer(:slowpoke) do |env|
  case env[Rack::Timeout::ENV_INFO_KEY].state
  when :timed_out
    env[Slowpoke::ENV_KEY] = true
  when :completed
    Process.kill("QUIT", Process.pid) if env[Slowpoke::ENV_KEY]
  end
end

kch · 2014-10-22T16:26:36Z

@ankane can you explain exactly how that relates to unicorn? Thanks

ankane · 2014-10-22T17:04:31Z

Yeah, the kill signal tells the worker to gracefully restart once it finishes the current request. This automatically cleans up the state. Should work with other servers that exhibit this same behavior.

kch · 2014-10-22T17:23:01Z

So if you're restarting after every timeout why not just rely on unicorn's timeout?

ankane · 2014-10-22T19:32:17Z

Unicorn's timeout is a hard kill, which unicorn discourages overuse. The QUIT is a soft kill.

aripollak · 2014-10-22T19:37:36Z

Well, the discouragement there applies to SIGQUIT too, since the worker process is still killed and has to respawn. But if the timeout is caught in the Ruby code, the exception can be re-raised and actually show up in your logs plus services like New Relic or other error trackers.

ankane · 2014-10-22T19:43:51Z

@aripollak Makes sense. The main reason is actually what you've stated - error tracking.

fourseven · 2014-10-30T22:15:49Z

@ankane I like what you've done, could you put it into a gist please? I'm curious whether you're swallowing the Rack::Timeout::RequestTimeoutError exception, or if there's still an around_filter in place as well as your observer.

ankane · 2014-10-31T21:03:15Z

@fourseven Thanks, I've packaged it into the Slowpoke gem.

kch · 2015-08-14T13:46:38Z

I guess I'm happy with the existing docs.

dentarg · 2019-12-12T23:02:16Z

Process kill/restart that was discussed here was implemented in #157

kch mentioned this issue Aug 26, 2014

Add Rails timeout exception handling example #40

Closed

kch changed the title ~~Timing out requests without exiting is inherently unsafe~~ Documentation: timing out requests without exiting is inherently unsafe Sep 10, 2014

kch added this to the release-0.1 milestone Sep 10, 2014

dwilkie mentioned this issue Mar 16, 2015

[production][phone_calls#create] PG::DuplicatePstatement: ERROR: prepared statement "a5" already exists dwilkie/chibi#150

Closed

kch closed this as completed Aug 14, 2015

vincentwoo mentioned this issue Dec 26, 2015

Still seeing occasional total outages on Heroku puma/puma#797

Closed

mintuhouse mentioned this issue Jan 26, 2017

Any plan to update with latest rack-timeout? ankane/slowpoke#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: timing out requests without exiting is inherently unsafe #39

Documentation: timing out requests without exiting is inherently unsafe #39

aripollak commented Dec 22, 2013

kch commented Dec 23, 2013

kch commented Dec 23, 2013

kch commented Dec 23, 2013

aripollak commented Dec 23, 2013

aripollak commented Jan 6, 2014

kch commented Sep 10, 2014

kch commented Sep 10, 2014

kch commented Sep 30, 2014

kch commented Oct 20, 2014

ankane commented Oct 22, 2014

kch commented Oct 22, 2014

ankane commented Oct 22, 2014

kch commented Oct 22, 2014

ankane commented Oct 22, 2014

aripollak commented Oct 22, 2014

ankane commented Oct 22, 2014

fourseven commented Oct 30, 2014

ankane commented Oct 31, 2014

kch commented Aug 14, 2015

dentarg commented Dec 12, 2019

Documentation: timing out requests without exiting is inherently unsafe #39

Documentation: timing out requests without exiting is inherently unsafe #39

Comments

aripollak commented Dec 22, 2013

kch commented Dec 23, 2013

kch commented Dec 23, 2013

kch commented Dec 23, 2013

aripollak commented Dec 23, 2013

aripollak commented Jan 6, 2014

kch commented Sep 10, 2014

kch commented Sep 10, 2014

kch commented Sep 30, 2014

kch commented Oct 20, 2014

ankane commented Oct 22, 2014

kch commented Oct 22, 2014

ankane commented Oct 22, 2014

kch commented Oct 22, 2014

ankane commented Oct 22, 2014

aripollak commented Oct 22, 2014

ankane commented Oct 22, 2014

fourseven commented Oct 30, 2014

ankane commented Oct 31, 2014

kch commented Aug 14, 2015

dentarg commented Dec 12, 2019