New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby 3.3 and stuck workers. #1895
Comments
We've been experiencing similar issues with frozen workers, ever since upgrading to ruby 3.3.0: Versions
|
Small Update Historically we have had some workers die, where they freeze and also loose However with ruby 3.3 these workers, they never loose their heartbeat, and thus never @rexmadden, @leboshi, have you had a similar experience? Seems like there is an historical issue here that has mutated with ruby 3.3 |
We didn't get a chance to dive into a diagnosis as deeply as you did, but the symptoms are consistent with yours. Workers would run a few jobs then just hang forever. In the |
We've been having this exact issue. Since upgrading Ruby from 3.1.2 to 3.3.0 about a month ago, resque workers are going to ~0% CPU / stopping most logs, but not dying. They stopped working one by one over the course of a couple days. Bouncing the service restored them for a while, but they eventually stop working again. In every pod's logs I've checked, the last 2 real things logged are similar to the following (always ends with
Normally next we'd see something like Versions:
Update: We are no longer seeing lazy (0% CPU) workers after downgrading to Ruby 3.2.3. Obviously this is not a long-term solution though. |
Just wanted to put an update here since I have been looking into this in the background, but also I'm not running Ruby 3.3 anywhere in production yet so I haven't been able to triage at scale. I could use a little help in nailing down the exact behavior here though and any logs or other info that people can provide. Specifically, are people seeing the actual forked workers die off/exit? Or are they actually running but just no longer receiving jobs? Lastly, what version of redis server and gem are you running? |
Thanks for taking a look! I just added some more version details to my comment above. We did not see the resqueworker service pods exit. They kept running but stopped processing jobs. If I remember correctly, we did continue to see some random logging (unrelated to processing jobs). Not sure if that answers your question. |
I accidentally closed this, and reponed. @PatrickTulskie What we noticed was jobs that should take a few seconds were taking 45+ minutes and the worker simply never stopped working that job. We also saw no logs from those workers, and we have extensive logging around our jobs. The only logs we got was from the enqueuer, We often get workers similar to this, but those workers are missing a heartbeat we have an outside task that detects these and runs However in this case these workers still had active hearbeats and their state was When we unregistered these workers via getting their ID and unregistering them or Restarting or worker dynos we noticed that the failed jobs vanished, they did not end up in the failed queue as expected. |
This. The jobs will not be marked as expired heartbeats and will claim to still be working, but they'll be idle... We can also see via the logs that a particular job for that worker has finished but will be in this idle state anyway, not picking anything else up and still "working". irb (main) > Resque::Worker.all_workers_with_expired_heartbeats
[]
irb (main) > some_stalled_worker.working?
true Unfortunately we've resorted to migrating to Sidekiq for some business critical cron jobs. |
I also haven't upgraded anything significant to ruby 3.3 in production yet (I've been waiting for the 3.3.1 release, which will fix some bugs that affect my codebases). So I can't really offer anything but wild guesses at this point, but... Unfortunately, (if I remember correctly) the heartbeat runs in a separate thread in the worker process, so it's only good for knowing if the worker process itself has frozen. It can't tell you if the code in the main thread has hung indefinitely or entered an infinite loop. And (again, IIRC) the heartbeat only runs in the parent worker process, and not in the forked child process that works each job. The parent simply waits on the child with If (hypothetically) in ruby 3.2, some system calls or any other VM code might potentially hang without a timeout _while holding the global VM lock, and if ruby 3.3 improved that code to no longer hold the global VM lock but it still hangs without a timeout, then that might explain your symptoms. This is is just a wild guess of a hypothesis... but a quick Have you tried running with If you have the ability to point |
we recently updated to ruby 3.3 and noticed that workers were becoming stuck processing a job. Sometimes a few workers became stuck right after a deploy, usually resque would clear these. As Resque processed jobs more and more workers would get stuck, the jobs they were processing should take seconds not hour+. Upon investigating we added logging to the before/after hooks. none of these logs were ever logged from a worker that got stuck.
Versions
It is also possible this is a resque-pool gem issue.
Downgrading back to ruby 3.2 seems to be the answer, we have not noticed anything stuck since
The text was updated successfully, but these errors were encountered: