Reliability

There are three aspects of reliability with Sidekiq and Redis:

pushing jobs to Redis with the client
fetching jobs from Redis with the server
scheduling jobs

Setup

TL;DR To use the Reliability features in Sidekiq Pro, add this to your initializer:

Sidekiq::Client.reliable_push! unless Rails.env.test?

Sidekiq.configure_server do |config|
  config.reliable_fetch!
  config.reliable_scheduler!
end

You will also need to add -i to your sidekiq command line. Read on for more detail.

Server

Sidekiq uses BRPOP to pop a job off the queue in Redis. This is very efficient and simple but it has one drawback: the job is now removed from Redis. If Sidekiq crashes while processing that job, it is lost forever. This is not a problem for many but some businesses need absolute reliability when processing jobs.

Sidekiq does its best to never lose jobs but it can't guarantee it; the only way to guarantee job durability is to not remove it from Redis until it is complete. For instance, if Sidekiq is restarted mid-job, it will try to push the unfinished jobs back to Redis but networking issues can prevent this.

Sidekiq Pro offers an alternative strategy for job processing using Redis' RPOPLPUSH command which keeps jobs in Redis. To enable "reliable fetch" you must tag each process on a machine with a unique index and enable the strategy:

Start Sidekiq with a unique index for each process on the machine:

sidekiq -e production -i 0
sidekiq -e production -i 1
sidekiq -e production -i 2

Require the reliable fetch code:

Sidekiq.configure_server do |config|
  # This needs to be within the configure_server block
  config.reliable_fetch!
end

When Sidekiq starts, you should see ReliableFetch activated:

INFO: Booting Sidekiq 2.6.2 with Redis at redis://localhost:6379/0
INFO: Running in ruby 1.9.3p327 (2012-11-10 revision 37606) [x86_64-darwin11.4.2]
INFO: Sidekiq Pro 0.9.0, commercially licensed.  Thanks for your support!
INFO: ReliableFetch activated
INFO: Starting processing, hit Ctrl-C to stop

Any jobs which are not fully processed (e.g. due to a segfault or network failure) are restarted upon process restart.

Limitations

Sidekiq Pro's reliable fetch does not work with Amazon's Elastic Beanstalk and Container Services. These services don't allow any way to send a unique index per process.

Redis does not provide a blocking reliable queue operation that works for multiple queues. This means that the Reliability feature needs to poll each queue. In practice that means if you have a lot of queues, your reliable Sidekiq processes will call Redis with up to (#queues * #processes) operations per second. If you have 100 queues and 5 processes, they can hit Redis up to 500 times per second! This is why I recommend you minimize the number of queues your application uses.

You must ensure that any old process using reliable fetch is shut down before starting up a new process to replace it during deploy. If old and new processes are running at the same time, it's possible for jobs to be processed twice.

Heroku

Use $DYNO and some bash trickery to set a unique index for each worker process in your Procfile:

# DYNO will be set to worker.1, worker.2, etc for each worker process
worker: bundle exec sidekiq -e production -i ${DYNO:-1}

Cloud66

Cloud66 have implemented a similar option to Heroku. Use {{UNIQUE_INT}} to assign a unique integer to the process. This integer should be unique across processes, so multiple processes won't clash, but is not guaranteed to be unique across servers

worker: bundle exec sidekiq -e production -i {{UNIQUE_INT}}

Autoscaling

Autoscaling down can be dangerous: any jobs left in the private queue for a process won't be executed. Jon Hyman explains how they do it at Appboy:

When we autoscale down, we first run a script which gracefully stops all the Sidekiq workers, and returns an error if they do not stop after the timeout period, and that aborts the auto-scale-down.

As for pending jobs in internal queues, we also wrote our own monitoring that does a scan_each with the pattern queue:*_*_* , saves everything it finds, and diffs every half hour to see if we have any jobs that are "stuck" (where stuck could be due to a frozen process or a machine that died that had jobs in an internal queue). We then manually reconcile, as it is an uncommon occurrence.

Fetch algorithms

Reliable fetch supports the same two fetch algorithms as Sidekiq's basic fetch: strict priority and weighted random.

Strict queue ordering algorithm

sidekiq -e production -i 0 -q critical -q default -q bulk

Beware that strict prioritization can lead to starvation: bulk jobs will only be processed once the critical and default queues are empty. You can switch priorities for different processes to ensure everyone gets processed:

sidekiq -e production -i 0 -q critical -q default -q bulk
sidekiq -e production -i 1 -q bulk -q default -q critical

Weighted random algorithm

sidekiq -e production -i 0 -q critical,3 -q default,2 -q bulk,1

When using weighted queues, sidekiq will randomly choose a queue to check, without blocking, using weighted random choice. For example, in the command given above, sidekiq will sample from the array ["critical", "critical", "critical", "default", "default", "bulk"]

Client

When the Sidekiq client pushes a job to Redis, it just assumes the network call will work. There's no error handling so any exception will trickle up into your app and cause a 500 error. The Sidekiq Pro client offers additional reliability by locally enqueueing the job for delivery once the network connection is successfully re-established.

There are a few limitations:

the local queue is per-process and in-memory so if the client process is restarted, the jobs are lost.
only the last 10,000 pushes are saved, to prevent a long-lasting outage from filling all memory
the local queue doesn't work with Batches so any Redis network issues when creating a batch will still cause an exception and fail.
the local queue is drained the next time a job is pushed. If a push fails and then the process is idle for hours, that job will be unexpectedly delayed. Ideally your production system has enough traffic to ensure timely drainage.

You can activate "reliable push" in your sidekiq initializer:

# This should not go in a Sidekiq.configure_{client,server} block.
Sidekiq::Client.reliable_push! unless Rails.env.test?

You don't want reliable push during testing because you don't want it to swallow unexpected errors and cause your test suite to pass despite problems.

Scheduler

Sidekiq's default scheduler is not atomic, it pops jobs off the scheduled queue and enqueues them with two network round trips. Sidekiq Pro offers a reliable scheduler which uses Lua to perform the same task atomically:

Sidekiq.configure_server do |config|
  config.reliable_scheduler!
end

This feature is optional but highly recommended to enable. It is not safe to enable if you are running Redis Cluster.

Reliability

Setup

Server

Limitations

Heroku

Cloud66

Autoscaling

Fetch algorithms

Strict queue ordering algorithm

Weighted random algorithm

Client

Scheduler

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally