Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry_on/discard_on for better exception handling #25991

Merged
merged 17 commits into from
Aug 2, 2016

Conversation

dhh
Copy link
Member

@dhh dhh commented Jul 29, 2016

Declarative exception handling of the most common kind: retrying and discarding.

class RemoteServiceJob < ActiveJob::Base
  retry_on Net::OpenTimeout, wait: 30.seconds, attempts: 10

  def perform(*args)
    # Might raise Net::OpenTimeout when the remote service is down
  end
end

class SearchIndexingJob < ActiveJob::Base
  discard_on ActiveJob::DeserializationError

  def perform(record)
    # Will raise ActiveJob::DeserializationError if the record can't be deserialized
  end
end

@kaspth kaspth added this to the 5.1.0 milestone Jul 29, 2016
#
# ==== Options
# * <tt>:wait</tt> - Re-enqueues the job with the specified delay in seconds
# * <tt>:attempts</tt> - Re-enqueues the job the specified number of times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to state the defaults here.

# Discard the job with no attempts to retry, if the exception is raised. This is useful when the subject of the job,
# like an Active Record, is no longer available, and the job is thus no longer relevant.
#
# ==== Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This says "Example" while the other docs say "Examples" (plural).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's only 1 example, it doesn't make sense to me to pluralize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other cases only have one example as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather fix those, then. Use plural when there are multiple examples and singular when just one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant to say, just seems I forgot to actually put the words down 👍

@rafaelfranca
Copy link
Member

:shipit:

# # Might raise Net::OpenTimeout when the remote service is down
# end
# end
def retry_on(exception, wait: 3.seconds, attempts: 5, queue: nil, priority: nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd rearrange these arguments such that wait, queue and priority are together with attempts being last. That gives me the nice pleasure of seeing the same grouping as they have in the method body and brings them closer to their retry_job origin.

I'd do the same in the docs just above. 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that general point, but I think it's trumped by grouping default-value parameters together and option-value parameters together. wait/attempts are the overwhelmingly most likely to be used. Queue/priority much less so.

@kaspth
Copy link
Contributor

kaspth commented Jul 29, 2016

Luuuv the API, :shipit:

# end
def retry_on(exception, wait: 3.seconds, attempts: 5, queue: nil, priority: nil)
rescue_from exception do |error|
logger.error "Retrying #{self.class} in #{wait} seconds, due to a #{exception}. The original exception was #{error.cause.inspect}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims to be retrying when it should actually be saying it's giving up

@matthewd
Copy link
Member

Alternative API:

discard_on Net::OpenTimeout, retries: 10, wait: 30.seconds
rescue_from Net::OpenTimeout, retries: 10, wait: 30.seconds do |error|
  # We failed to connect ten times; give up and send someone an email or something
end

Mostly, I like the fact this clearly states what happens when we run out of retries... retry_on feels a bit subtle for ".. and then almost-silently drop it on the floor".

@dhh
Copy link
Member Author

dhh commented Jul 29, 2016

Ah, I see what you mean @matthewd. Actually what should happen after we try to retry a bunch of time and fail is that we should reraise and let the queue deal with it. Will fix that!

(Though I guess there's still an argument for ALSO allowing custom logic at that point, but).

@matthewd
Copy link
Member

Yeah, I guess I'm positing that (unlike "discard", say,) "retry" isn't a distinct error handling strategy, but an intrinsic.. step? attribute? of any overall error-handling plan. It just happens that the default behaviour is to make zero retries.

@dhh
Copy link
Member Author

dhh commented Jul 29, 2016

Definitely. Failing to retry should absolutely not result in dropping the job. Just fixed that in the latest commit 👍

logger.error "Retrying #{self.class} in #{wait} seconds, due to a #{exception}. The original exception was #{error.cause.inspect}."
retry_job wait: wait, queue: queue, priority: priority
else
logger.error "Stopped retrying #{self.class} due to a #{exception}, which reoccurred on #{executions} attempts. The original exception was #{error.cause.inspect}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpick technicality: this particular exception may not have occurred during the previous executions. Maybe worth rephrasing to something like "Stopping #{self.class} after #{executions} retries due to .."... or maybe not worth bothering with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean because of inheritance? Or how would that particular exception not have been the cause?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have more than one retry_on (i.e., handling more than once possible exception) -- executions keeps counting up. The exception we're naming definitely caused this failure, but if we've done five attempts, they could've hit: Resolv::ResolvTimeout Resolv::ResolvTimeout Resolv::ResolvTimeout Resolv::ResolvTimeout Net::OpenTimeout. It's true we're giving up after 5 tries, but it's not strictly true to blame Net::OpenTimeout for reoccurring 5 times.

@matthewd
Copy link
Member

(Though I guess there's still an argument for ALSO allowing custom logic at that point, but).

FWIW, I think that's the thing I was arguing: if I write a custom rescue_from, I don't want to then have to implement my own retry mechanism -- that feels orthogonal to my decision on how I want to handle the "give up" step.

.. including on the new discard_on -- it seems right that the default after-retrying behaviour is to re-raise, but "try a few times then just forget it" seems just as likely as "try once then forget it".

end

# Reschedules the job to be re-executed. This is useful in combination
# with the +rescue_from+ option. When you rescue an exception from your job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "the +rescue_from+ method" here, not option, no ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not meant as a programmatic option, but rather as "an option for dealing with exceptions". Other options include discard_on, retry_on.

Let’s do it when we actually execute instead. Then the tests dealing
with comparable serializations won’t fail either!
@dhh
Copy link
Member Author

dhh commented Aug 1, 2016

@matthewd Added the power to provide a custom handler if the retry attempts are unsuccessful.

@dhh
Copy link
Member Author

dhh commented Aug 1, 2016

This is now ready to merge from my perspective, lest anyone has any last objections.

@dhh
Copy link
Member Author

dhh commented Aug 2, 2016

Did not. Removed. Thanks.

case seconds_or_algorithm
when :exponentially_longer
(executions ** 4) + 2
when Integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this fare with fixnums, bignums and rationals to name a few on Ruby 2.3?

Does it catch durations as well, if I passed in 1.hour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer is the parent class of those. So should be good. But no, doesn't catch duration. Could add that as a separate clause and .to_i it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, since our documentation uses 3.seconds, I'd expect full duration support as a doc reader.

@kaspth
Copy link
Contributor

kaspth commented Aug 2, 2016

Left some questions, but otherwise good to me 😁

CodeClimate, on the other hand, seems to wish for some climate change 😋

@@ -45,7 +45,7 @@ def retry_on(exception, wait: 3.seconds, attempts: 5, queue: nil, priority: nil)
if executions < attempts
logger.error "Retrying #{self.class} in #{wait} seconds, due to a #{exception}. The original exception was #{error.cause.inspect}."
retry_job wait: determine_delay(wait), queue: queue, priority: priority
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never seen this much whitespace sacrificed for the glory of Rubocup! We must be saved now! Praise be our automaton savior 🙏🤖

@dhh dhh merged commit d46d61e into master Aug 2, 2016
@dhh dhh deleted the retry-and-discard-jobs branch August 2, 2016 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants