Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

Merged

Conversation

victormours
Copy link
Contributor

Motivation / Background

This Pull Request has been created because the naming of the wait: :exponentially_longer option for retrying jobs is misleading as to how often the job will be retried.

The backoff strategy used by ActiveJob is polynomial backoff (which in this case is sometimes also called quartic backoff) rather than exponential backoof.

Part of the app my team works on (an appointment management system for French social services) has had a partial outage recently, because of background jobs that would timeout and retry a lot, overflowing our queues.

When doing a post-mortem for this incident, we looked at how quickly our jobs were rescheduled, and we were surprised to see that after the first few tries, it was much faster than expected with an exponential backoff. This is what led us into investigating this.

Detail

This pull request changes the name of the option to clearly indicate that jobs will be retried with polynomial backoff, and keeps the legacy name working for backwards compatibility.

The difference between polynomial backoff and exponential backoff

The delays as implemented in ActiveJob with retry_count**4, increases polynomially, while a classic exponential delay, usually implemented as 2**retry_count increases exponentially (see for example this Google Cloud documentation page).

An exponential backoff could also be implemented as 4**retry_count to better match the initial values of the first few tries (this is what we were expecting to happen).

To put this in perspective, this is how long these strategies wait between retries (not including jitter):

retry_count:          1,   2,  3,   4,   5,   6,   7,  8,  9,   10,  11, 12, 13,  14,  15,  16,  17,  18, 19, 20, 
retry_count**4 delay: 1s, 16s, 1m,  4m, 10m, 22m, 40m, 1h, 2h,  3h,  4h, 6h,  8h, 11h, 14h, 18h, 23h, 1d, 2d,  2d, 
4**retry_count delay: 4s, 16s, 1m,  4m, 17m,  1h,  5h,18h, 3d, 12d, 49d, 194d,  
2**retry_count delay: 2s,  4s, 8s, 16s, 32s,  1m,  2m, 4m, 9m, 17m, 34m, 1h,  2h,  5h,  9h, 18h,  2d, 3d, 6d, 12d, 

Interestingly, ActionCable’s retry strategy uses true exponential backoff (see actioncable/app/javascript/action_cable/connection_monitor.js:77 and the corresponding test actioncable/test/javascript/src/unit/connection_monitor_test.js:17).

Is this just a naming error or is it actually a bug ?

I was a bit unsure as to whether this should be considered a bug or just a naming error. In other words, should this mismatch between expectation and implementation be fixed by changing the code to 2 ** retry_count (or maybe 4 ** retry_count ?) or by changing the name of the option?

After a bit of research, I realized that polynomial backoff using retry_count**4 is the de facto standard in most major Ruby libraries (more about this in the next section if you’re interested).

Moreover, it seems like a genuinely hard problem to decide whether polynomial backoff is better or worse than exponential backoff. I don’t believe there can be a clear cut answer that works for the general case of Rails applications.

However, it does seems reasonable to manage expectations, and make it clear what kind of backoff we're using.

Therefore, in the spirit of “if it ain’t broken, don’t fix it”, it seems to me that the sanest option is to clarify the naming, but keep the same implementation (and keep backwards compatibility).

Maybe ActiveJob should provide a new option for a built-in true exponential backoff strategy, that topic should probably be addressed in a different issue.

Additional information

The confusion between polynomial and exponential backoff seems very pervasive in the Ruby community. I was surprised to find that many other Ruby libraries use a retry_count ** 4 backoff, and call it exponential backoff. For an interesting tidbit of history, I tried to find the origin of this, and found this commit in delayed job where this strategy is first introduced (but not named) back in 2008, and then this commit in 2009 when it is mistakenly documented as using an “exponential scale”.

A similar implementation later appears in sidekiq, sneaker handler , and resque-retry, before being added to ActiveJob in 2016.

I plan on opening similar pull requests on the various projects where there is this confusion.

Math enthusiasts may also point out that it is a genuinely hard problem to figure out if computations that are solved by algorithms with exponential time complexity can also be solved in polynomial time.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

This is my first contribution to Rails so I'm not sure I got the changelog format right, or if I should break this down into more commits. Of course, I'll happily edit any of this to make it compatible with the codebase, and I'm looking forward to any feedback!

Thanks for reading this lengthy description!

Copy link
Member

@rafaelfranca rafaelfranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great description! Thank you for it and all investigation.

The intent is really to be a polynomial backoff.

@@ -205,4 +205,10 @@

*Jacopo Beschi*

* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the top of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a555ecb

@@ -156,7 +156,8 @@ def determine_delay(seconds_or_duration_or_algorithm:, executions:, jitter: JITT
jitter = jitter == JITTER_DEFAULT ? self.class.retry_jitter : (jitter || 0.0)

case seconds_or_duration_or_algorithm
when :exponentially_longer
when :exponentially_longer, :with_polynomial_backoff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should deprecate the old name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Done in 9246a9f
Testing for the deprecation made me realized that I was using the wrong exception name in that test 🙈

@@ -24,7 +24,7 @@ module ClassMethods
# ==== Options
# * <tt>:wait</tt> - Re-enqueues the job with a delay specified either in seconds (default: 3 seconds),
# as a computing proc that takes the number of executions so far as an argument, or as a symbol reference of
# <tt>:exponentially_longer</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt>
# <tt>:with_polynomial_backoff</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:polynomially_longer would be a better name for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3240060

assert_deprecated(ActiveJob.deprecator) do
class ::RetryJob
retry_on LegacyExponentialNamingError, wait: :exponentially_longer, attempts: 10, jitter: nil
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, reopening the class here to call retry_on means that this test has a side effect, but it is a convenient way of using assert_deprecated, and I think it's fairly low-risk since this callback is really specific to this test.
I'm almost hesitant to move the definition of LegacyExponentialNamingError here so that all the setup for this test is local to this file, but I might be overthinking it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you subclass the job instead with an anonymous class? (I haven't tried it so this may not work but I believe it should)

klass = Class.new(RetryJob) do
  retry_on ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, subclassing is a good idea. An anonymous class is a bit tricky just to get the klass variable outside of the assert_deprecated bock's scope for calling klass.perform_later, so I ended up just naming the subclass.
I also moved the error definition into this file while I was at it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right... one more suggestion:

klass = Class.new(RetryJob)

assert_deprecated do
  klass.retry_on ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clever, and it almost works, but ActiveJob doesn't allow anonymous classes: I get this error when trying this implementation:

NoMethodError: undefined method `constantize' for nil:NilClass
    /Users/victormours/dev/rails/activejob/lib/active_job/core.rb:63:in `deserialize'

The line raising the error is job = job_data["job_class"].constantize.new, so the anonymous class can't be deserialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying it out, sorry it didn't quite work 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! it was fun to tinker around a bit 😄

@@ -1,5 +1,11 @@
## Rails 7.1.0.beta1 (September 13, 2023) ##

* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7.1.0.beta1 has already been released, this should go above it (with a newline in between)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I think I finally got it right. Thanks for the review!

Copy link
Member

@skipkayhil skipkayhil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last thing, can you squash your commits into one? Then a committer can merge 👍

@victormours
Copy link
Contributor Author

Last thing, can you squash your commits into one? Then a committer can merge 👍

Sure thing! Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants