New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292
Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great description! Thank you for it and all investigation.
The intent is really to be a polynomial backoff.
activejob/CHANGELOG.md
Outdated
@@ -205,4 +205,10 @@ | |||
|
|||
*Jacopo Beschi* | |||
|
|||
* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a555ecb
@@ -156,7 +156,8 @@ def determine_delay(seconds_or_duration_or_algorithm:, executions:, jitter: JITT | |||
jitter = jitter == JITTER_DEFAULT ? self.class.retry_jitter : (jitter || 0.0) | |||
|
|||
case seconds_or_duration_or_algorithm | |||
when :exponentially_longer | |||
when :exponentially_longer, :with_polynomial_backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should deprecate the old name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Done in 9246a9f
Testing for the deprecation made me realized that I was using the wrong exception name in that test 🙈
@@ -24,7 +24,7 @@ module ClassMethods | |||
# ==== Options | |||
# * <tt>:wait</tt> - Re-enqueues the job with a delay specified either in seconds (default: 3 seconds), | |||
# as a computing proc that takes the number of executions so far as an argument, or as a symbol reference of | |||
# <tt>:exponentially_longer</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt> | |||
# <tt>:with_polynomial_backoff</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:polynomially_longer
would be a better name for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 3240060
assert_deprecated(ActiveJob.deprecator) do | ||
class ::RetryJob | ||
retry_on LegacyExponentialNamingError, wait: :exponentially_longer, attempts: 10, jitter: nil | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, reopening the class here to call retry_on
means that this test has a side effect, but it is a convenient way of using assert_deprecated
, and I think it's fairly low-risk since this callback is really specific to this test.
I'm almost hesitant to move the definition of LegacyExponentialNamingError
here so that all the setup for this test is local to this file, but I might be overthinking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you subclass the job instead with an anonymous class? (I haven't tried it so this may not work but I believe it should)
klass = Class.new(RetryJob) do
retry_on ...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, subclassing is a good idea. An anonymous class is a bit tricky just to get the klass
variable outside of the assert_deprecated
bock's scope for calling klass.perform_later
, so I ended up just naming the subclass.
I also moved the error definition into this file while I was at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right... one more suggestion:
klass = Class.new(RetryJob)
assert_deprecated do
klass.retry_on ...
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's clever, and it almost works, but ActiveJob doesn't allow anonymous classes: I get this error when trying this implementation:
NoMethodError: undefined method `constantize' for nil:NilClass
/Users/victormours/dev/rails/activejob/lib/active_job/core.rb:63:in `deserialize'
The line raising the error is job = job_data["job_class"].constantize.new
, so the anonymous class can't be deserialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying it out, sorry it didn't quite work 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! it was fun to tinker around a bit 😄
@@ -1,5 +1,11 @@ | |||
## Rails 7.1.0.beta1 (September 13, 2023) ## | |||
|
|||
* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7.1.0.beta1
has already been released, this should go above it (with a newline in between)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I think I finally got it right. Thanks for the review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last thing, can you squash your commits into one? Then a committer can merge 👍
…ot exponential backoff
Sure thing! Thanks for the review! |
f85f82a
to
1bdabc1
Compare
Motivation / Background
This Pull Request has been created because the naming of the
wait: :exponentially_longer
option for retrying jobs is misleading as to how often the job will be retried.The backoff strategy used by ActiveJob is polynomial backoff (which in this case is sometimes also called quartic backoff) rather than exponential backoof.
Part of the app my team works on (an appointment management system for French social services) has had a partial outage recently, because of background jobs that would timeout and retry a lot, overflowing our queues.
When doing a post-mortem for this incident, we looked at how quickly our jobs were rescheduled, and we were surprised to see that after the first few tries, it was much faster than expected with an exponential backoff. This is what led us into investigating this.
Detail
This pull request changes the name of the option to clearly indicate that jobs will be retried with polynomial backoff, and keeps the legacy name working for backwards compatibility.
The difference between polynomial backoff and exponential backoff
The delays as implemented in ActiveJob with
retry_count**4
, increases polynomially, while a classic exponential delay, usually implemented as2**retry_count
increases exponentially (see for example this Google Cloud documentation page).An exponential backoff could also be implemented as
4**retry_count
to better match the initial values of the first few tries (this is what we were expecting to happen).To put this in perspective, this is how long these strategies wait between retries (not including jitter):
Interestingly, ActionCable’s retry strategy uses true exponential backoff (see
actioncable/app/javascript/action_cable/connection_monitor.js:77
and the corresponding testactioncable/test/javascript/src/unit/connection_monitor_test.js:17
).Is this just a naming error or is it actually a bug ?
I was a bit unsure as to whether this should be considered a bug or just a naming error. In other words, should this mismatch between expectation and implementation be fixed by changing the code to
2 ** retry_count
(or maybe4 ** retry_count
?) or by changing the name of the option?After a bit of research, I realized that polynomial backoff using
retry_count**4
is the de facto standard in most major Ruby libraries (more about this in the next section if you’re interested).Moreover, it seems like a genuinely hard problem to decide whether polynomial backoff is better or worse than exponential backoff. I don’t believe there can be a clear cut answer that works for the general case of Rails applications.
However, it does seems reasonable to manage expectations, and make it clear what kind of backoff we're using.
Therefore, in the spirit of “if it ain’t broken, don’t fix it”, it seems to me that the sanest option is to clarify the naming, but keep the same implementation (and keep backwards compatibility).
Maybe ActiveJob should provide a new option for a built-in true exponential backoff strategy, that topic should probably be addressed in a different issue.
Additional information
The confusion between polynomial and exponential backoff seems very pervasive in the Ruby community. I was surprised to find that many other Ruby libraries use a
retry_count ** 4 backoff
, and call it exponential backoff. For an interesting tidbit of history, I tried to find the origin of this, and found this commit in delayed job where this strategy is first introduced (but not named) back in 2008, and then this commit in 2009 when it is mistakenly documented as using an “exponential scale”.A similar implementation later appears in sidekiq, sneaker handler , and resque-retry, before being added to ActiveJob in 2016.
I plan on opening similar pull requests on the various projects where there is this confusion.
Math enthusiasts may also point out that it is a genuinely hard problem to figure out if computations that are solved by algorithms with exponential time complexity can also be solved in polynomial time.
Checklist
Before submitting the PR make sure the following are checked:
[Fix #issue-number]
This is my first contribution to Rails so I'm not sure I got the changelog format right, or if I should break this down into more commits. Of course, I'll happily edit any of this to make it compatible with the codebase, and I'm looking forward to any feedback!
Thanks for reading this lengthy description!