Retry strategy #32

nicholasjhenry · 2018-09-26T13:37:06Z

The second way is to implement the Rihanna.Job behaviour and define the perform/1 function. Implementing this behaviour allows you to define retry strategies, show custom error messages on failure etc. See the docs for more details.

Could you please describe how to handle retries? Using the Rihanna.Job behaviour. Do you schedule another job with the same params within the current job? If I wanted to have a global retry, is there any reason why I couldn't use a process to monitor the jobs table for failed jobs and retry them using Rihanna.Job.retry/1? Thank you for your help and this library.

The text was updated successfully, but these errors were encountered:

samsondav · 2018-09-27T22:22:52Z

Congratulations nicholas, you have found a feature that doesn't actually exist yet :).

I meant to write this but never quite got around to it. It should be a fairly easy pull request if you fancy contributing?

nicholasjhenry · 2018-09-28T12:23:50Z

Yes, I would definitely be interested in doing this (or someone on my team). Is there a specific direction you would like to take with this feature?

samsondav · 2018-09-28T19:26:17Z

I have a lot of ideas. It might help to know what your specific requirement is, and base it around that. Can you elaborate at all on that?

nicholasjhenry · 2018-09-28T19:52:08Z

Our requirements are relatively simple. We just need to retry failed jobs in the queue. Right now we don't even need exponential back off or max attempts, but I guess that would be considered the basics. Just after something that you would get of the box with a library like exq.

samsondav · 2018-09-28T20:04:35Z

How about Rihanna.retry/1?

EDIT: Ah I see, you already looked at that. This is not a question with an obvious answer - you want to retry failed jobs... but under what rules? Automatic retry of everything forever? Sounds dangerous. I'm gonna need more info about your requirements :)

nicholasjhenry · 2018-09-29T17:46:25Z

Automatic retry of everything forever? Sounds dangerous.

Sure, that's fair. How about this:

Example:

a job fetching today's currency exchanges for USD/CAD fails an HTTP request
the job is retried 10 times before it gives us (the HTTP endpoint is permanently unavailable)
the attempts are repeated after the first failed attempt at 2s, 4s, 8s, 16s...

Rules:

retry failed job for n number of attempts
the amount of time between each attempt increases (i.e. truncated binary exponential backoff)

Based on the example above, I think this is a standard approach you would get out of the box with other packages.

How does this sound?

samsondav · 2018-09-29T22:01:00Z

Actually, it sounds like you would be better off using something like ElixirRetry.

Wrap your function in that macro, and put that in your perform function. I think this third party library might do better than building it into Rihanna.

nicholasjhenry · 2018-09-30T14:00:18Z

Yes, ElixirRetry could be part of a solution but as it's ephemeral therefore a mechanism to persistent retry information between deployments would be required. I guess I forgot to add that into the requirements listed above. So I do see that there is still an integration component required with Rihanna. Another example of a retry mechanism provided "out of the box" is in the que library you reference in the README:

https://github.com/chanks/que/blob/master/docs/error_handling.md

Thoughts?

samsondav · 2018-10-03T08:21:06Z

I'd be reluctant to enable retries by default, since jobs can have side-effects.

You are right in that it should be optionally enabled though, and handled in the database. A simple exponential backoff approach with a configurable number of retries would work. We could save the retry number on the job row, and use the existing due_at column to set the time at which it will retry. What do you think? Would you be up for having a crack at implementing something like that?

nicholasjhenry · 2018-10-03T12:27:24Z

That sounds great, @samphilipd. We need this for our current project so my team will definitely be up for implementing this.

samsondav · 2018-10-16T11:35:09Z

How did you get on @nicholasjhenry ?

nicholasjhenry · 2018-10-16T12:28:44Z

@samphilipd We added Rihanna to our project last week, so we will need to address this within the next two weeks.

samsondav · 2018-10-21T09:12:24Z

@nicholasjhenry That's great! Keep me posted, I'd love to get this into Rihanna soon. Happy to help with anything you need.

samsondav · 2018-12-07T16:30:41Z

@nicholasjhenry did you get anywhere with this?

nicholasjhenry · 2018-12-07T17:46:34Z

@samphilipd This is on the back burner at the moment. Our project is a pre-release phase so it hasn't been a priority right now. No current ETA.

onomated · 2019-02-24T13:54:19Z

@samphilipd @nicholasjhenry Any word on this yet?
Can take a stab at implementing something along the lines of:

First, expand Rihanna.Job behavior:

defmodule Rihanna.Job do
  require Logger

  @type result :: any
  @type reason :: any
  @type arg :: any
  @type t :: %__MODULE__{}

  @callback perform(arg :: any) :: :ok | {:ok, result} | :error | {:error, reason}
  @callback after_error({:error, reason} | :error | Exception.t(), arg) :: any()
  @callback retry({:error, reason} | :error | Exception.t(), arg, pos_integer()) :: {:ok | DateTime.t()} | any() # <-- Add retry(error, job_args, attempt_count)
  @optional_callbacks after_error: 2, retry: 3

...

On job_raised|on_failure after invoking Rihanna.Job.after_error, invoke Rihanna.Job.retry defined as:

def retry(job_module, reason, arg, attempt_count) do
    if :erlang.function_exported(job_module, :retry, 3) do
      try do
        case job_module.retry(reason, arg, attempt_count) do
          {:ok, %DateTime{} = due_at} -> update_job_due(due_at)
          _ -> :noop
        end
      rescue
        exception ->
          Logger.warn(
            """
            [Rihanna] retry callback failed
            ...
          )
          :noop
      end
    end
  end

Would like to get your thoughts on:

Retry attempt counts cached in rihanna_internal_meta jsonb column
Due date updates for retry attempts before releasing the lock i.e. in mark_failed or fine after job lock is released.

This structure allows clients implement their own custom strategy such as max retries, linear back-off, exponential backoff etc. Job status in rihanna UI for failed jobs can indicate a retry is dispatched by due_at > failed_at

lpil · 2019-02-24T15:11:34Z

Looks sensible to me.

What is the any() success return type in the retry/3 callback? I think I would prefer there to be a value that explicitly is to be given to signify that the job is not to be retried (i.e. :noop)

onomated · 2019-02-25T11:11:19Z

Looks sensible to me.

What is the any() success return type in the retry/3 callback? I think I would prefer there to be a value that explicitly is to be given to signify that the job is not to be retried (i.e. :noop)

With an explicit failure value of :noop, we'll still have to deal with the 3rd case of "anything else" such as another service or context may be called and returned with :error, or {:error, _}, which from what I can envision will result in a no-op as well. noop crossed my mind also, but since its not standard like :ok and :error, felt it'd be treated like a 3rd case. But can that's an easy change in requirement

lpil · 2019-02-25T11:13:52Z

I think that the retry function should never fail, and if it does that is exceptional circumstances which should result in a crash and not be part of the typespec.

If the error case is part of the success type then what should Rihanna do in the case the implementation returns {:error, _}?

onomated · 2019-02-25T11:19:06Z

I think that the retry function should never fail, and if it does that is exceptional circumstances which should result in a crash and not be part of the typespec.

If the error case is part of the success type then what should Rihanna do in the case the implementation returns {:error, _}?

Ah yes, should remove any() it from the typespec. Also, I thought your original comment was a typo in that you said the any() success return type. It's actually meant to indicate the no-op return type. In other words, anything else outside of {ok, DateTime} returned will result in a no-op.

onomated · 2019-02-25T11:21:02Z

@lpil Sounds like you're ok with keeping track of the retry counts in the internal meta jsonb column? Also what are your thoughts on holding the lock while assessing retries?

lpil · 2019-02-25T11:29:54Z

I think the lock should be held until the retry has been either committed to the database or determined to be a :noop, though I'm not familiar with how this part of the system works so I would like to get @samphilipd's thoughts on the design too.

KushalP · 2019-03-25T10:20:01Z

Curious to know what the status of this is? I'd like to use it to build an exponential backoff behaviour into my jobs using "retry". Ideally the metadata required to understand backoff would be stored as part of the job metadata, to handle any worker node failures.

lpil · 2019-03-25T11:18:31Z

I think we're happy with the design, but implementation of this feature has not started. I will not be implementing the feature but will be reviewing any pull requests.

samsondav · 2019-04-03T07:16:25Z

Fine with retry metadata being put in "rihanna_internal_meta".

Can we re-use the due_at functionality to handle backoffs? This might limit our resolution though.

lpil · 2019-04-03T07:36:16Z

Using due_at sounds good to me. Currently it uses timestamp with time zone which is 1 microsecond resolution according to the postgresql docs
https://www.postgresql.org/docs/9.1/datatype-datetime.html

I didn't realise it was so precise!

tpitale · 2019-06-05T18:50:37Z

I've started work on this @lpil @KushalP @nicholasjhenry based on the work and suggestions of @onomated. Please take a look at #66 and let me know what you think. Thanks!

tpitale · 2019-10-04T13:11:49Z

@samphilipd I think this is resolved now.

samsondav · 2019-10-05T15:28:19Z

Yup I think so.

samsondav added the enhancement New feature or request label Sep 27, 2018

nicholasjhenry changed the title ~~Example retry strategy~~ Retry strategy Sep 28, 2018

tpitale mentioned this issue Jun 5, 2019

Basic retry on standard failure #66

Merged

8 tasks

samsondav closed this as completed Oct 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry strategy #32

Retry strategy #32

nicholasjhenry commented Sep 26, 2018

samsondav commented Sep 27, 2018

nicholasjhenry commented Sep 28, 2018 •

edited

Loading

samsondav commented Sep 28, 2018

nicholasjhenry commented Sep 28, 2018

samsondav commented Sep 28, 2018 •

edited

Loading

nicholasjhenry commented Sep 29, 2018

samsondav commented Sep 29, 2018

nicholasjhenry commented Sep 30, 2018

samsondav commented Oct 3, 2018

nicholasjhenry commented Oct 3, 2018

samsondav commented Oct 16, 2018

nicholasjhenry commented Oct 16, 2018

samsondav commented Oct 21, 2018

samsondav commented Dec 7, 2018

nicholasjhenry commented Dec 7, 2018

onomated commented Feb 24, 2019 •

edited by lpil

Loading

lpil commented Feb 24, 2019 •

edited

Loading

onomated commented Feb 25, 2019

lpil commented Feb 25, 2019

onomated commented Feb 25, 2019

onomated commented Feb 25, 2019

lpil commented Feb 25, 2019

KushalP commented Mar 25, 2019

lpil commented Mar 25, 2019 •

edited

Loading

samsondav commented Apr 3, 2019

lpil commented Apr 3, 2019 •

edited

Loading

tpitale commented Jun 5, 2019

tpitale commented Oct 4, 2019

samsondav commented Oct 5, 2019

Retry strategy #32

Retry strategy #32

Comments

nicholasjhenry commented Sep 26, 2018

samsondav commented Sep 27, 2018

nicholasjhenry commented Sep 28, 2018 • edited Loading

samsondav commented Sep 28, 2018

nicholasjhenry commented Sep 28, 2018

samsondav commented Sep 28, 2018 • edited Loading

nicholasjhenry commented Sep 29, 2018

samsondav commented Sep 29, 2018

nicholasjhenry commented Sep 30, 2018

samsondav commented Oct 3, 2018

nicholasjhenry commented Oct 3, 2018

samsondav commented Oct 16, 2018

nicholasjhenry commented Oct 16, 2018

samsondav commented Oct 21, 2018

samsondav commented Dec 7, 2018

nicholasjhenry commented Dec 7, 2018

onomated commented Feb 24, 2019 • edited by lpil Loading

lpil commented Feb 24, 2019 • edited Loading

onomated commented Feb 25, 2019

lpil commented Feb 25, 2019

onomated commented Feb 25, 2019

onomated commented Feb 25, 2019

lpil commented Feb 25, 2019

KushalP commented Mar 25, 2019

lpil commented Mar 25, 2019 • edited Loading

samsondav commented Apr 3, 2019

lpil commented Apr 3, 2019 • edited Loading

tpitale commented Jun 5, 2019

tpitale commented Oct 4, 2019

samsondav commented Oct 5, 2019

nicholasjhenry commented Sep 28, 2018 •

edited

Loading

samsondav commented Sep 28, 2018 •

edited

Loading

onomated commented Feb 24, 2019 •

edited by lpil

Loading

lpil commented Feb 24, 2019 •

edited

Loading

lpil commented Mar 25, 2019 •

edited

Loading

lpil commented Apr 3, 2019 •

edited

Loading