Summary
When a worker is force-killed during shutdown (shutdown timeout exceeded), jobs protected by limits_concurrency can run concurrently after restart.
Steps to reproduce
class SlowJob < ActiveJob::Base
limits_concurrency key: "slow_job", duration: 5.minutes
def perform
sleep 1.hour
end
end
- Enqueue 3 SlowJob instances
- Start SolidQueue supervisor in fork mode (worker thread_pool_size: 3)
- Wait for the first job to start (semaphore acquired, others blocked)
- Send SIGTERM to the supervisor
- SolidQueue.shutdown_timeout (default 5s) expires — supervisor force-kills the worker
- Start a new supervisor
- Two or more jobs start Performing concurrently, violating the concurrency limit of 1
Expected behavior
Only one SlowJob runs at a time after restart, same as before the shutdown.
Actual behavior
Multiple jobs with the same concurrency key run simultaneously after restart.
Root cause
Supervisor#start calls start_processes (line 39), which starts the dispatcher and workers concurrently. The dispatcher's ConcurrencyMaintenance is initialized with Concurrent::TimerTask.new(run_now: true), so it does run expire_semaphores and unblock_blocked_executions at boot — but in a background thread. Meanwhile, the worker starts polling immediately and can claim multiple jobs before the maintenance thread completes.
The sequence:
- Old worker is force-killed mid-job, leaving a stale semaphore in solid_queue_semaphores
- Release claimed jobs runs, putting the interrupted job back in the ready queue
- New supervisor starts — dispatcher and workers boot concurrently
- Dispatcher's maintenance starts in a background thread (Concurrent::TimerTask)
- Worker starts polling (every 0.1s), claims multiple ready jobs before maintenance has expired the stale semaphore and unblocked blocked executions
- Concurrency limit is violated
Observed in production logs
14:38:39 Supervisor wasn't terminated gracefully - shutdown timeout exceeded (5018.5ms)
14:38:39 Release claimed jobs (90.1ms) size: 1
...
14:51:47 ==> Your service is live
14:51:50 [Job ff2291c7] Performing RefreshDataJob (az4n-8mr2)
14:51:50 [Job b1ddfa0c] Performing RefreshDataJob (6sqe-dvqs)
Both jobs use limits_concurrency key: self (limit 1) but started in the same second after a deploy that triggered a non-graceful shutdown.
Possible fix
Run ConcurrencyMaintenance#expire_semaphores and #unblock_blocked_executions synchronously during dispatcher boot, before workers start polling. This would ensure stale semaphores from dead processes are cleaned up before any jobs are claimed.
Environment
- solid_queue 1.4.0
- Rails 8.1
- Ruby 3.4.7
- PostgreSQL 16
- Fork mode supervisor
Summary
When a worker is force-killed during shutdown (shutdown timeout exceeded), jobs protected by limits_concurrency can run concurrently after restart.
Steps to reproduce
Expected behavior
Only one SlowJob runs at a time after restart, same as before the shutdown.
Actual behavior
Multiple jobs with the same concurrency key run simultaneously after restart.
Root cause
Supervisor#start calls start_processes (line 39), which starts the dispatcher and workers concurrently. The dispatcher's ConcurrencyMaintenance is initialized with Concurrent::TimerTask.new(run_now: true), so it does run expire_semaphores and unblock_blocked_executions at boot — but in a background thread. Meanwhile, the worker starts polling immediately and can claim multiple jobs before the maintenance thread completes.
The sequence:
Observed in production logs
14:38:39 Supervisor wasn't terminated gracefully - shutdown timeout exceeded (5018.5ms)
14:38:39 Release claimed jobs (90.1ms) size: 1
...
14:51:47 ==> Your service is live
14:51:50 [Job ff2291c7] Performing RefreshDataJob (az4n-8mr2)
14:51:50 [Job b1ddfa0c] Performing RefreshDataJob (6sqe-dvqs)
Both jobs use limits_concurrency key: self (limit 1) but started in the same second after a deploy that triggered a non-graceful shutdown.
Possible fix
Run ConcurrencyMaintenance#expire_semaphores and #unblock_blocked_executions synchronously during dispatcher boot, before workers start polling. This would ensure stale semaphores from dead processes are cleaned up before any jobs are claimed.
Environment