Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

victormours · 2023-09-15T18:08:11Z

Motivation / Background

This Pull Request has been created because the naming of the wait: :exponentially_longer option for retrying jobs is misleading as to how often the job will be retried.

The backoff strategy used by ActiveJob is polynomial backoff (which in this case is sometimes also called quartic backoff) rather than exponential backoof.

Part of the app my team works on (an appointment management system for French social services) has had a partial outage recently, because of background jobs that would timeout and retry a lot, overflowing our queues.

When doing a post-mortem for this incident, we looked at how quickly our jobs were rescheduled, and we were surprised to see that after the first few tries, it was much faster than expected with an exponential backoff. This is what led us into investigating this.

Detail

This pull request changes the name of the option to clearly indicate that jobs will be retried with polynomial backoff, and keeps the legacy name working for backwards compatibility.

The difference between polynomial backoff and exponential backoff

The delays as implemented in ActiveJob with retry_count**4, increases polynomially, while a classic exponential delay, usually implemented as 2**retry_count increases exponentially (see for example this Google Cloud documentation page).

An exponential backoff could also be implemented as 4**retry_count to better match the initial values of the first few tries (this is what we were expecting to happen).

To put this in perspective, this is how long these strategies wait between retries (not including jitter):

retry_count:          1,   2,  3,   4,   5,   6,   7,  8,  9,   10,  11, 12, 13,  14,  15,  16,  17,  18, 19, 20, 
retry_count**4 delay: 1s, 16s, 1m,  4m, 10m, 22m, 40m, 1h, 2h,  3h,  4h, 6h,  8h, 11h, 14h, 18h, 23h, 1d, 2d,  2d, 
4**retry_count delay: 4s, 16s, 1m,  4m, 17m,  1h,  5h,18h, 3d, 12d, 49d, 194d,  
2**retry_count delay: 2s,  4s, 8s, 16s, 32s,  1m,  2m, 4m, 9m, 17m, 34m, 1h,  2h,  5h,  9h, 18h,  2d, 3d, 6d, 12d,

Interestingly, ActionCable’s retry strategy uses true exponential backoff (see actioncable/app/javascript/action_cable/connection_monitor.js:77 and the corresponding test actioncable/test/javascript/src/unit/connection_monitor_test.js:17).

Is this just a naming error or is it actually a bug ?

I was a bit unsure as to whether this should be considered a bug or just a naming error. In other words, should this mismatch between expectation and implementation be fixed by changing the code to 2 ** retry_count (or maybe 4 ** retry_count ?) or by changing the name of the option?

After a bit of research, I realized that polynomial backoff using retry_count**4 is the de facto standard in most major Ruby libraries (more about this in the next section if you’re interested).

Moreover, it seems like a genuinely hard problem to decide whether polynomial backoff is better or worse than exponential backoff. I don’t believe there can be a clear cut answer that works for the general case of Rails applications.

However, it does seems reasonable to manage expectations, and make it clear what kind of backoff we're using.

Therefore, in the spirit of “if it ain’t broken, don’t fix it”, it seems to me that the sanest option is to clarify the naming, but keep the same implementation (and keep backwards compatibility).

Maybe ActiveJob should provide a new option for a built-in true exponential backoff strategy, that topic should probably be addressed in a different issue.

Additional information

The confusion between polynomial and exponential backoff seems very pervasive in the Ruby community. I was surprised to find that many other Ruby libraries use a retry_count ** 4 backoff, and call it exponential backoff. For an interesting tidbit of history, I tried to find the origin of this, and found this commit in delayed job where this strategy is first introduced (but not named) back in 2008, and then this commit in 2009 when it is mistakenly documented as using an “exponential scale”.

A similar implementation later appears in sidekiq, sneaker handler , and resque-retry, before being added to ActiveJob in 2016.

I plan on opening similar pull requests on the various projects where there is this confusion.

Math enthusiasts may also point out that it is a genuinely hard problem to figure out if computations that are solved by algorithms with exponential time complexity can also be solved in polynomial time.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

This is my first contribution to Rails so I'm not sure I got the changelog format right, or if I should break this down into more commits. Of course, I'll happily edit any of this to make it compatible with the codebase, and I'm looking forward to any feedback!

Thanks for reading this lengthy description!

rafaelfranca

Great description! Thank you for it and all investigation.

The intent is really to be a polynomial backoff.

rafaelfranca · 2023-09-15T18:24:50Z

activejob/CHANGELOG.md

@@ -205,4 +205,10 @@

    *Jacopo Beschi*

+*   Clarify the backoff strategy for the recommended `:wait` option when retrying jobs


This should be in the top of the file.

Done in a555ecb

rafaelfranca · 2023-09-15T18:25:07Z

activejob/lib/active_job/exceptions.rb

@@ -156,7 +156,8 @@ def determine_delay(seconds_or_duration_or_algorithm:, executions:, jitter: JITT
        jitter = jitter == JITTER_DEFAULT ? self.class.retry_jitter : (jitter || 0.0)

        case seconds_or_duration_or_algorithm
-        when :exponentially_longer
+        when :exponentially_longer, :with_polynomial_backoff


We should deprecate the old name

Yep! Done in 9246a9f
Testing for the deprecation made me realized that I was using the wrong exception name in that test 🙈

rafaelfranca · 2023-09-15T18:28:09Z

activejob/lib/active_job/exceptions.rb

@@ -24,7 +24,7 @@ module ClassMethods
      # ==== Options
      # * <tt>:wait</tt> - Re-enqueues the job with a delay specified either in seconds (default: 3 seconds),
      #   as a computing proc that takes the number of executions so far as an argument, or as a symbol reference of
-      #   <tt>:exponentially_longer</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt>
+      #   <tt>:with_polynomial_backoff</tt>, which applies the wait algorithm of <tt>((executions**4) + (Kernel.rand * (executions**4) * jitter)) + 2</tt>


:polynomially_longer would be a better name for this.

Done in 3240060

victormours · 2023-09-15T19:05:44Z

activejob/test/cases/exceptions_test.rb

+    assert_deprecated(ActiveJob.deprecator) do
+      class ::RetryJob
+        retry_on LegacyExponentialNamingError, wait: :exponentially_longer, attempts: 10, jitter: nil
+      end


Strictly speaking, reopening the class here to call retry_on means that this test has a side effect, but it is a convenient way of using assert_deprecated, and I think it's fairly low-risk since this callback is really specific to this test.
I'm almost hesitant to move the definition of LegacyExponentialNamingError here so that all the setup for this test is local to this file, but I might be overthinking it.

Could you subclass the job instead with an anonymous class? (I haven't tried it so this may not work but I believe it should)

klass = Class.new(RetryJob) do retry_on ... end

Yeah, subclassing is a good idea. An anonymous class is a bit tricky just to get the klass variable outside of the assert_deprecated bock's scope for calling klass.perform_later, so I ended up just naming the subclass.
I also moved the error definition into this file while I was at it.

ah you're right... one more suggestion:

klass = Class.new(RetryJob) assert_deprecated do klass.retry_on ... end

that's clever, and it almost works, but ActiveJob doesn't allow anonymous classes: I get this error when trying this implementation:

NoMethodError: undefined method `constantize' for nil:NilClass /Users/victormours/dev/rails/activejob/lib/active_job/core.rb:63:in `deserialize'

The line raising the error is job = job_data["job_class"].constantize.new, so the anonymous class can't be deserialized.

Thanks for trying it out, sorry it didn't quite work 😅

No worries! it was fun to tinker around a bit 😄

skipkayhil · 2023-09-15T21:30:58Z

activejob/CHANGELOG.md

@@ -1,5 +1,11 @@
 ## Rails 7.1.0.beta1 (September 13, 2023) ##

+*   Clarify the backoff strategy for the recommended `:wait` option when retrying jobs


7.1.0.beta1 has already been released, this should go above it (with a newline in between)

okay, I think I finally got it right. Thanks for the review!

skipkayhil

Last thing, can you squash your commits into one? Then a committer can merge 👍

…ot exponential backoff

victormours · 2023-09-15T22:51:56Z

Last thing, can you squash your commits into one? Then a committer can merge 👍

Sure thing! Thanks for the review!

rails-bot bot added activejob activestorage labels Sep 15, 2023

rafaelfranca reviewed Sep 15, 2023

View reviewed changes

victormours commented Sep 15, 2023

View reviewed changes

victormours requested a review from rafaelfranca September 15, 2023 19:06

skipkayhil reviewed Sep 15, 2023

View reviewed changes

skipkayhil approved these changes Sep 15, 2023

View reviewed changes

Clarify that the default retry strategy uses polynomial backoff and n…

1bdabc1

…ot exponential backoff

victormours force-pushed the clarify-backoff-strategy-for-activejob-retries branch from f85f82a to 1bdabc1 Compare September 15, 2023 22:53

byroot merged commit 65f43d7 into rails:main Sep 16, 2023
4 checks passed

Earlopain mentioned this pull request Sep 29, 2023

Including GoodJob::ActiveJobExtensions::Concurrency triggers rails deprecation notice bensheldon/good_job#1096

Closed

peteryates mentioned this pull request Oct 6, 2023

Wait polynomially longer rather than exponentially DFE-Digital/dfe-analytics#89

Closed

mitchellhenke mentioned this pull request Oct 10, 2023

Upgrade to Rails 7.1 18F/identity-idp#9333

Merged

francois-ferrandis mentioned this pull request Jan 17, 2024

Rails 7.1 (étape 1) betagouv/rdv-service-public#3925

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

victormours commented Sep 15, 2023

rafaelfranca left a comment

rafaelfranca Sep 15, 2023

victormours Sep 15, 2023

rafaelfranca Sep 15, 2023

victormours Sep 15, 2023

rafaelfranca Sep 15, 2023

victormours Sep 15, 2023

victormours Sep 15, 2023

skipkayhil Sep 15, 2023

victormours Sep 15, 2023

skipkayhil Sep 15, 2023

victormours Sep 15, 2023

skipkayhil Sep 15, 2023

victormours Sep 15, 2023

skipkayhil Sep 15, 2023

victormours Sep 15, 2023

skipkayhil left a comment

victormours commented Sep 15, 2023

		@@ -205,4 +205,10 @@

		Jacopo Beschi

		* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs

		@@ -1,5 +1,11 @@
		## Rails 7.1.0.beta1 (September 13, 2023) ##

		* Clarify the backoff strategy for the recommended `:wait` option when retrying jobs

Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

Clarify that the default retry strategy uses polynomial backoff instead of exponential backoff #49292

Conversation

victormours commented Sep 15, 2023

Motivation / Background

Detail

The difference between polynomial backoff and exponential backoff

Is this just a naming error or is it actually a bug ?

Additional information

Checklist

rafaelfranca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skipkayhil left a comment

Choose a reason for hiding this comment

victormours commented Sep 15, 2023