Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add :jitter to :exponentially_longer #31872

Merged
merged 1 commit into from Nov 28, 2019

Conversation

@allcentury
Copy link

allcentury commented Feb 2, 2018

Summary

We're seeing issues using :exponentially_longer where failures in downstream systems are causing a thundering herd effect. A great many jobs fail, they all get enqueued to try again in a static interval and bring down the downstream system yet again. This repeats until the retry limit is exhausted.

We've gotten around this using a custom wait block but since exponential backoff with randomness is builtin to some adapters, like Sidekiq, I'm proposing to have it baked into to :exponentially_longer.

@rails-bot

This comment has been minimized.

Copy link

rails-bot commented Feb 2, 2018

Thanks for the pull request, and welcome! The Rails team is excited to review your changes, and you should hear from @sgrif (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

This repository is being automatically checked for code quality issues using Code Climate. You can see results for this analysis in the PR status below. Newly introduced issues should be fixed before a Pull Request is considered ready to review.

Please see the contribution instructions for more information.

@akowalz
akowalz approved these changes Feb 2, 2018
activejob/lib/active_job/exceptions.rb Outdated Show resolved Hide resolved
Copy link
Member

jeremy left a comment

Good call. Desirable for standard :wait as well. Consider introducing a :jitter option that captures both cases, injecting X% jitter into the execution count, with a sensible small default value.

activejob/CHANGELOG.md Outdated Show resolved Hide resolved
activejob/lib/active_job/exceptions.rb Outdated Show resolved Hide resolved
@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Feb 8, 2018

@jeremy I pushed up 5d900e9. Do you mind taking a look and letting me know if this is the calculation/API changes you were suggesting?

Desirable for standard :wait as well

I'm trying to translate this to the code, are you suggesting we add jitter when a user passes in an Integer or ActiveSupport::Duration to retry_on?

I still need to update the changelog and comments for the new :jitter option but I wanted your 👀 on this before I did that.

@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Mar 10, 2018

@jeremy - apologies for the ping but just wanted to get some eyes on this when you have time. Thanks!

@jeremy

This comment has been minimized.

Copy link
Member

jeremy commented Mar 13, 2018

I'm trying to translate this to the code, are you suggesting we add jitter when a user passes in an Integer or ActiveSupport::Duration to retry_on?

Yep!

@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch to 235ca8d Apr 27, 2018
@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Apr 27, 2018

@jeremy Thanks for your patience, I've pushed up a few commits that I think addresses your feedback but feel free to let me know if you have any other suggestions. I certainly appreciate all your input thus far

* Add randomness to :exponentially_longer

Using ActiveJob::Exceptions.retry_on with :exponentially_longer now
uses a random amount of jitter in order to prevent the [thundering herd effect.](https://en.wikipedia.org/wiki/Thundering_herd_problem). Default jitter amount of 0.15 but overridable via the `:jitter` option

This comment has been minimized.

Copy link
@jeremy

jeremy Sep 6, 2018

Member

This suggests that jitter is a duration, or some number with units.

Can we explain how jitter is applied to the wait time?

# * <tt>:attempts</tt> - Re-enqueues the job the specified number of times (default: 5 attempts)
# * <tt>:queue</tt> - Re-enqueues the job on a different queue
# * <tt>:priority</tt> - Re-enqueues the job with a different priority
# * <tt>:jitter</tt> - An offset delay amount so retries don't all happen at the same time (default: 0.15)

This comment has been minimized.

Copy link
@jeremy

jeremy Sep 6, 2018

Member

Ditto. "Delay amount" reads as "duration in seconds."

@@ -19,11 +19,12 @@ module ClassMethods
# ==== Options
# * <tt>:wait</tt> - Re-enqueues the job with a delay specified either in seconds (default: 3 seconds),
# as a computing proc that the number of executions so far as an argument, or as a symbol reference of
# <tt>:exponentially_longer</tt>, which applies the wait algorithm of <tt>(executions ** 4) + 2</tt>
# (first wait 3s, then 18s, then 83s, etc)
# <tt>:exponentially_longer</tt>, which applies the wait algorithm of <tt>(executions ** 4) + 2 + jitter</tt>

This comment has been minimized.

Copy link
@jeremy

jeremy Sep 6, 2018

Member

This no longer reflects the implementation.

This comment has been minimized.

Copy link
@allcentury

allcentury Nov 9, 2018

Author

Thanks @jeremy - I corrected the formula listed, do you think it's valuable to correct the amounts? I currently listed them with a ~ prefix to signify a non-exact figure but I'm not sure if that's how it will be conveyed. Depending on what you think here, I'll mirror that in the comment in the example below too.

@jeremy

This comment has been minimized.

Copy link
Member

jeremy commented Sep 6, 2018

Ran into this ourselves today and thought of your PR, @allcentury! Thanks again.

@jeremy

This comment has been minimized.

Copy link
Member

jeremy commented Sep 6, 2018

One way to express jitter: as a random delay of up to 15% of the wait time. The 15% is the max jitter, then, and it's unitless—max jitter is expressed as a percentage of wait time.

@gmcgibbon gmcgibbon added the activejob label Sep 30, 2018
@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch from 850448c to 3c23594 Nov 9, 2018
@allcentury allcentury changed the title Add Randomness to :exponentially_longer Add :jitter to :exponentially_longer Nov 9, 2018
@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch 4 times, most recently from 26fcd21 to c5c41c3 Nov 9, 2018
@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch from c5c41c3 to 7dd27d6 Jul 7, 2019
@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch from 7dd27d6 to ca16528 Jul 7, 2019
@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Jul 7, 2019

@jeremy - I've rebased and pushed up some better documentation, do you mind reviewing when you have some time? I'd like to get this merged if all looks ok to you.

@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Sep 23, 2019

@jeremy - Bump/ping/👋

@jeremy

This comment has been minimized.

Copy link
Member

jeremy commented Nov 27, 2019

Thanks for sticking with this @allcentury! We've been using an inlined version of this ourselves for quite some time now and it's done the job very nicely.

Bit of feedback on the changelog remains: https://github.com/rails/rails/pull/31872/files#r215762717. Let's clarify that jitter: 0.15 is a percentage of the wait time. It reads like it's a time duration that's added to wait time.

Anthony Ross
Prior to this change, exponentially_longer had adverse consequences
during system-wide downstream failures.  This change adds a random value to the
back off calculation in order to prevent the thundering herd
problem, whereby all retry jobs would retry at the same time.

Specifically this change adds a jitter option to retry_on to enable users of it to
scope the randomness calculation to a reasonable amount.  The default is
15% of the exponential back off calculation.
@allcentury allcentury force-pushed the allcentury:exponential_backoff_with_randomness branch from ca16528 to ac6fb0e Nov 27, 2019
@allcentury

This comment has been minimized.

Copy link
Author

allcentury commented Nov 27, 2019

@jeremy - perfect, thanks for noticing that. I rephrased the changelog, I think it's clearer now.

@jeremy
jeremy approved these changes Nov 28, 2019
@jeremy jeremy added this to the 6.1.0 milestone Nov 28, 2019
@jeremy jeremy merged commit 5f76218 into rails:master Nov 28, 2019
2 checks passed
2 checks passed
build
Details
buildkite/rails Build #65188 passed (40 minutes, 52 seconds)
Details
@jeremy

This comment has been minimized.

Copy link
Member

jeremy commented Nov 28, 2019

Thanks again @allcentury!

tjschuck added a commit to tjschuck/rails that referenced this pull request Dec 4, 2019
Follow up to rails#31872

Particularly, the closing `</tt>` tag of the documented wait algorithm had a misplaced `<`.  Threw in a space between `15% (0.15)` for good measure.

[ci skip]
@aditya-vector aditya-vector mentioned this pull request Feb 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

7 participants
You can’t perform that action at this time.