Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solid Queue is not retrying job #110

Closed
zainonrails opened this issue Jan 2, 2024 · 10 comments
Closed

Solid Queue is not retrying job #110

zainonrails opened this issue Jan 2, 2024 · 10 comments

Comments

@zainonrails
Copy link

I am trying to retry a failed job by using retry_on in the job class but it is not working consistently.

class RequestTestimonialJob < ApplicationJob
  queue_as :default
  retry_on StandardError, attempts: 3, priority: 0
  
  def perform(user_id)
    raise StandardError
  end
end

I have also overridden

config.solid_queue.on_thread_error = ->(exception) { Bugsnag.notify(exception) }

but nothing happens. I have tried with and without it. The behaviour seems unpredictable or it is just me.

Also throws this error sometimes,

No live threads left. Deadlock? (fatal)
04:53:43 solid.1 | 6 threads, 6 sleeps current:0x0000000113428570 main thread:0x0000000124004080
04:53:43 solid.1 | * #<Thread:0x0000000122834140 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x0000000124004080 native:0x00000001f4bd1e00 int:0
04:53:43 solid.1 |    
04:53:43 solid.1 | * #<Thread:0x00000001234f93d0@DEBUGGER__::SESSION@server /Users/zain/.rvm/gems/ruby-3.0.0@devtree/gems/debug-1.7.2/lib/debug/session.rb:179 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x00000001118b04c0 native:0x000000016e587000 int:0
04:53:43 solid.1 |    
04:53:43 solid.1 | * #<Thread:0x00000001234e8968@worker-1 /Users/zain/.rvm/gems/ruby-3.0.0@devtree/gems/concurrent-ruby-1.2.2/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:332 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x00000001118b0db0 native:0x000000016e793000 int:0
04:53:43 solid.1 |    
04:53:43 solid.1 | * #<Thread:0x0000000123543d68 /Users/zain/.rvm/gems/ruby-3.0.0@devtree/gems/activerecord-7.0.4.3/lib/active_record/connection_adapters/abstract/connection_pool/reaper.rb:40 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x0000000113428570 native:0x000000016e99f000 int:0
04:53:43 solid.1 |    
04:53:43 solid.1 | * #<Thread:0x0000000123b86dd8@worker-1 /Users/zain/.rvm/gems/ruby-3.0.0@devtree/gems/concurrent-ruby-1.2.2/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:332 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x0000000113580970 native:0x000000016ebab000 int:0 mutex:0x0000000124004500 cond:1
04:53:43 solid.1 |    
04:53:43 solid.1 | * #<Thread:0x0000000123b61b28@io-worker-1 /Users/zain/.rvm/gems/ruby-3.0.0@devtree/gems/concurrent-ruby-1.2.2/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:332 sleep_forever>
04:53:43 solid.1 |    rb_thread_t:0x0000000114c25ae0 native:0x000000016edb7000 int:0 mutex:0x0000000113580bf0 cond:1

How can I make sure the job is retried after any exception? any help would be appreciated.

Thanks

@rosa
Copy link
Member

rosa commented Jan 2, 2024

Hey @zainonrails, what's your workers and dispatchers configuration? How are you enqueuing the job?

@zainonrails
Copy link
Author

zainonrails commented Jan 2, 2024

@rosa I am enqueuing the job simply by MyJob.perform_later(id)

after adding database record in solid_queue_jobs and solid_queue_ready_executions tables it shows logs below.

[SolidQueue] Enqueued job {:queue_name=>"development_default", :active_job_id=>"b720f935-3277-403f-85a3-782794d79937", :priority=>nil, :scheduled_at=>Tue, 02 Jan 2024 00:59:10.827070000 UTC +00:00, :class_name=>"SubmitTestimonialJob", :arguments=>{"job_class"=>"SubmitTestimonialJob", "job_id"=>"b720f935-3277-403f-85a3-782794d79937", "provider_job_id"=>nil, "queue_name"=>"development_default", "priority"=>nil, "arguments"=>[49], "executions"=>0, "exception_executions"=>{}, "locale"=>"en", "timezone"=>"UTC", "enqueued_at"=>"2024-01-02T00:59:10Z"}, :concurrency_key=>nil}
05:59:10 web.1   | [ActiveJob] Enqueued SubmitTestimonialJob (Job ID: b720f935-3277-403f-85a3-782794d79937) to SolidQueue(development_default) with arguments: 49
production:
  dispatchers:
    - polling_interval: 5
      batch_size: 100
  workers:
    - queues: '*'
      threads: 2
      processes: 1
      polling_interval: 5

development:
  dispatchers:
    - polling_interval: 1
      batch_size: 10
  workers:
    - queues: "*"
      threads: 3
      processes: 1
      polling_interval: 1

@rosa
Copy link
Member

rosa commented Jan 2, 2024

Cool, and when you say "it is not working consistently" and "The behaviour seems unpredictable", what do you mean? Could you be more specific?

@zainonrails
Copy link
Author

So, I am throwing exception in my job to see if it would pick up the retry settings or not.

After the job throws error, I quickly update the job code to remove the exception part to see if it retries it but doesn't happen unfortunately. and upon restarting the server it picks up all the jobs.

Maybe I am doing something wrong here to test this behaviour. Can you brief me on if the settings would be picked up if a worker receives an error in the job.

Should I override on_thread_error to something that would make sure the job is retried.

@rosa
Copy link
Member

rosa commented Jan 2, 2024

After the job throws error, I quickly update the job code to remove the exception part to see if it retries it but doesn't happen unfortunately. and upon restarting the server it picks up all the jobs.

Hmm... my guess is that the 3 retries happen before you get a chance to update the job:

retry_on StandardError, attempts: 3, priority: 0

This would be using the default wait time between retries, which is 3 seconds, so after roughly 9 seconds, the 3 attempts would have been done and your job would fail permanently. You should see it in the solid_queue_failed_executions table, and check its arguments, by doing something like:

$ bin/rails c
>> SolidQueue::FailedExecution.last.job

@zainonrails
Copy link
Author

I did try it with longer wait time like 5 and 10 seconds but let me go observe with longer wait times and confirm and report back.

Thanks @rosa

@virolea
Copy link

virolea commented Jan 6, 2024

I too wanted to check if jobs were properly retried in my setup. Turns out the first retry occurences happen quite quickly as suggested by Rosa.

You can also check the failed job arguments to see if any retry was performed. ActiveJob increment the executions value.

job_id = YOUR_JOB_ID
@job = SolidQueue::Job.find(job_id)
@job.arguments["executions"] # This returns the number of times this job was executed
@job.arguments["exception_executions"] # This returns the breakdown of executions per exceptions

Here's an example for one of my jobs:

{
  "executions"=>12, 
  "exception_executions"=>{"[Exception]"=>8, "[ZeroDivisionError]"=>4}
}

@zainonrails
Copy link
Author

@rosa @virolea

I experimented it this timw without changing the code in between, with just 1 minute to wait for retry.

The job was retried as expected and the number of executions were also updated. Just one question though, does attempt: 3 means the job would run for a total of 3 times? or retried 3 times?

What I observed is that it would run a total of that amount we set and not retry counts, that's why executions is attempts - 1

@rosa
Copy link
Member

rosa commented Jan 9, 2024

Thanks @virolea, @zainonrails!

Just one question though, does attempt: 3 means the job would run for a total of 3 times? or retried 3 times?

A total of 3 times according to Active Job's attempts parameter.

@zainonrails
Copy link
Author

Thanks for the help everyone, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants