-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: timing out requests without exiting is inherently unsafe #39
Comments
What I hear is: Rails is such a shit show. Also what FP'ers have been saying since forever: state is the root of all evil. I wonder how much you can get around activerecord's bullshit by just using transactions. Are the issues you see inherent to AR, or more about your model implementations? As for AV rewrapping errors, leaving aside the misguidedness therein, I guess it just goes to show you should do as little logic as possible inside views? In the end there's not much rack-timeout can do here without turning it into a boil the ocean and rewrite rails project. You embrace a set of trade-offs by using it. I agree that a note in the readme would be warranted. Also the usual reminder that rack-timeout is a stopgap solution and that you should try to weed out chronic sources of long requests and fix the root of the problem. |
As for the nuclear option, exiting on timeout, I've been considering it, because of #30, mostly. Turns out you can't raise in an IO-blocked thread reliably anyway. But then, why not just use unicorn's timeout mechanism? And what happens when the main thread exits with other servers? Rack-timeout is intended to be a generic solution for rack apps, and you can't expect every server to resurrect its processes. Or can you? |
Anyways. I hear what you're saying, and thank you for the report. It just turns out timing things out reliably is a hard problem. The purpose of this ongoing rewrite of rack-timeout is more about gathering extended information around requests that time out, than providing a thorough solution to all timeout-related woes. |
Hm, I'm not sure how transactions work in Rails, but I haven't heard that idea before. I don't think this particular problem is related to anything bad happening in the model, it's just using Rails' regular scope mechanism.
This particular view doesn't have much logic, it just so happens that the view calls out to potentially expensive operations on models. I did also notice after posting the original message that Rails' wrapped errors contain a def timed_out?(exception)
exception.class == Timeout::Error || wrapped_exception_is_timeout?(exception)
end
def wrapped_exception_is_timeout?(exception)
if exception.respond_to?(:original_exception)
exception.original_exception.class == Timeout::Error
end
end
Sure, we also use unicorn's timeout as a last resort (like if it's stuck in IO for some reason), but it kills the worker hard, without any cleanup and no debugging information to inform future improvements. It would certainly be a much simpler solution if we could only use that one.
Well, this exact solution may not be correct for every rack server, but I'd be pretty shocked if someone was running a webserver in production that didn't have some mechanism for restarting Ruby workers automatically. Certainly they may not be as fast at restarting as Unicorn with Thanks for your replies. I'm mostly trying to prevent other people from having to spend months futilely tracking down this kind of problem or just wondering in vain what's happening and then switching to some other stack. |
One modification to this, just for posterity: Ruby 2.1.0's Timeout prevents catching the exception inside the block you pass to begin
Timeout.timeout timeout, &block
rescue => e
Process.kill [...]
end UPDATE: This doesn't actually work correctly; if a timeout like that happens inside a nested ActiveRecord transaction, the transaction could be committed instead of aborted or rolled back, which is really bad. So don't rely on the default Ruby 2.1 Timeout class. Instead, pass an explicit timeout error class, and you can use 2.1's new |
The sample code for Related to #49. |
I'm leaving this open strictly as a documentation issue. There's no general fix to the problem of state corruption. If state is modified serially with the expectation that all steps will complete, aborting in the middle of that may cause corruption. If such corruption can be detected (such as in the original description because of constant exceptions), then we can provide a way to just die. There's still a chance of silent corruption tho… As far as rails goes, silent corruption is out of our hands. As for application logic, implementors can make sure sets of related state changes happen atomically or transactionally. |
Thread#handle_interrupt is somewhat annoying to use and can be a bit tough to grasp, but it could help in a lot of state corruption situations. |
Thanks for this awesome gem! FWIW, here's how I do the timeout safely for unicorn - based off @aripollak's method: Rack::Timeout.register_state_change_observer(:slowpoke) do |env|
case env[Rack::Timeout::ENV_INFO_KEY].state
when :timed_out
env[Slowpoke::ENV_KEY] = true
when :completed
Process.kill("QUIT", Process.pid) if env[Slowpoke::ENV_KEY]
end
end |
@ankane can you explain exactly how that relates to unicorn? Thanks |
Yeah, the kill signal tells the worker to gracefully restart once it finishes the current request. This automatically cleans up the state. Should work with other servers that exhibit this same behavior. |
So if you're restarting after every timeout why not just rely on unicorn's timeout? |
Unicorn's timeout is a hard kill, which unicorn discourages overuse. The |
Well, the discouragement there applies to SIGQUIT too, since the worker process is still killed and has to respawn. But if the timeout is caught in the Ruby code, the exception can be re-raised and actually show up in your logs plus services like New Relic or other error trackers. |
@aripollak Makes sense. The main reason is actually what you've stated - error tracking. |
@ankane I like what you've done, could you put it into a gist please? I'm curious whether you're swallowing the |
@fourseven Thanks, I've packaged it into the Slowpoke gem. |
I guess I'm happy with the existing docs. |
Process kill/restart that was discussed here was implemented in #157 |
I ran into this bug rather frequently on a high traffic site with a somewhat high incidence of timeouts, using
Timeout.timeout
, very similarly to rack-timeout. It was happening almost daily - a request would time out and raise an exception at a place in ActiveRecord that left a model class in an unclean state; all subsequent requests in the same Unicorn worker that used the corrupted class would crash until the worker was restarted, and who knows what else was corrupted in there but wasn't raising an error.I've taken to doing the following in an around_filter instead. This might be a good thing to do in rack-timeout, or at least have a note in the README about this possibility, since with enough timeouts you're bound to run into something like it sooner or later. Note that this code is not rescuing from
Timeout::Error
because if the timeout happens inside a Rails view, it will be caught and re-raised as aActionView::Template::Error
.The text was updated successfully, but these errors were encountered: