Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby 3.3 and stuck workers. #1895

Open
ChrisCPO opened this issue Jan 22, 2024 · 9 comments
Open

Ruby 3.3 and stuck workers. #1895

ChrisCPO opened this issue Jan 22, 2024 · 9 comments

Comments

@ChrisCPO
Copy link

ChrisCPO commented Jan 22, 2024

we recently updated to ruby 3.3 and noticed that workers were becoming stuck processing a job. Sometimes a few workers became stuck right after a deploy, usually resque would clear these. As Resque processed jobs more and more workers would get stuck, the jobs they were processing should take seconds not hour+. Upon investigating we added logging to the before/after hooks. none of these logs were ever logged from a worker that got stuck.

Versions

  • ruby: 3.3.0
  • resque: 2.6.0
  • rails: 7.1.2

It is also possible this is a resque-pool gem issue.

Downgrading back to ruby 3.2 seems to be the answer, we have not noticed anything stuck since

@ChrisCPO ChrisCPO changed the title all_workers_with_expired_heartbeats is not returning expired workers. Ruby 3.2 and stuck workers. Jan 23, 2024
@ChrisCPO ChrisCPO changed the title Ruby 3.2 and stuck workers. Ruby 3.3 and stuck workers. Jan 23, 2024
@tylerCaineRhodes
Copy link

We've been experiencing similar issues with frozen workers, ever since upgrading to ruby 3.3.0:

Versions

  • ruby: 3.3.0
  • resque: 2.6.0
  • resque-pool: 0.7.1
  • rails: 7.1.3

@ChrisCPO
Copy link
Author

ChrisCPO commented Feb 5, 2024

Small Update

Historically we have had some workers die, where they freeze and also loose
their heartbeat, it expires. We check for expired workers every few minutes and
kill them. Due to resque-retry these jobs are failed and a new worker performs
the job, not perfect but it works.

However with ruby 3.3 these workers, they never loose their heartbeat, and thus never
appear dead, they seem to "run" until their dyno/server is restarted. Let's call
these Ghost workers, they are not alive or dead. Also we noticed that there
might be an increase number of ghost workers, though at the time we were running
a large amount of background workers.

@rexmadden, @leboshi, have you had a similar experience?

Seems like there is an historical issue here that has mutated with ruby 3.3

@leboshi
Copy link

leboshi commented Feb 7, 2024

We didn't get a chance to dive into a diagnosis as deeply as you did, but the symptoms are consistent with yours. Workers would run a few jobs then just hang forever. In the /jobs page, we'd see the workers reporting that they were still processing the previous job they picked up, even hours later. We got tired of manually rebooting workers and just downgraded to Ruby 3.2.3. Obviously not tenable long-term.

@alexandrachakeres-wk
Copy link

alexandrachakeres-wk commented Feb 19, 2024

We've been having this exact issue. Since upgrading Ruby from 3.1.2 to 3.3.0 about a month ago, resque workers are going to ~0% CPU / stopping most logs, but not dying. They stopped working one by one over the course of a couple days. Bouncing the service restored them for a while, but they eventually stop working again.

In every pod's logs I've checked, the last 2 real things logged are similar to the following (always ends with Running before_fork hooks with... but different jobs/args):

got: (Job{schedule} | ScheduleJob | [195293])
Running before_fork hooks with [(Job{schedule} | ScheduleJob | [195293])]

Normally next we'd see something like Running after_fork hooks with [(Job{schedule} | ScheduleJob | [195293])] logged, but it never gets there. FYI our code doesn't have any Resque.before_fork or Resque.after_fork blocks (as documented in the resque docs here). So it sounds like something could be going wrong in the default before_fork hook, considering the docs say The before_fork hook will be run in the parent process. So, be careful - any changes you make will be permanent for the lifespan of the worker.. Or maybe (more likely) during the fork. I would be curious if others also saw Running before_fork hook... as the final real log before things went basically silent. Could help narrow in on a fix.

Versions:

  • ruby: 3.3.0
  • resque: 2.6.0
  • rails: 6.1.7.6
  • redis gem: 4.6.0
  • redis: 5.0.6
  • We do not use the resque-pool gem

Update: We are no longer seeing lazy (0% CPU) workers after downgrading to Ruby 3.2.3. Obviously this is not a long-term solution though.

@PatrickTulskie
Copy link
Contributor

Just wanted to put an update here since I have been looking into this in the background, but also I'm not running Ruby 3.3 anywhere in production yet so I haven't been able to triage at scale.

I could use a little help in nailing down the exact behavior here though and any logs or other info that people can provide. Specifically, are people seeing the actual forked workers die off/exit? Or are they actually running but just no longer receiving jobs? Lastly, what version of redis server and gem are you running?

@alexandrachakeres-wk
Copy link

Thanks for taking a look! I just added some more version details to my comment above. We did not see the resqueworker service pods exit. They kept running but stopped processing jobs. If I remember correctly, we did continue to see some random logging (unrelated to processing jobs). Not sure if that answers your question.

@ChrisCPO ChrisCPO reopened this Mar 19, 2024
@ChrisCPO
Copy link
Author

ChrisCPO commented Mar 19, 2024

I accidentally closed this, and reponed.

@PatrickTulskie What we noticed was jobs that should take a few seconds were taking 45+ minutes and the worker simply never stopped working that job. We also saw no logs from those workers, and we have extensive logging around our jobs. The only logs we got was from the enqueuer, before_enqueue and after_enqueue which I think happens in the enqueueing code.Resque.enqueue(...) etc.

We often get workers similar to this, but those workers are missing a heartbeat we have an outside task that detects these and runs Resque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker }, kills the worker re-enqueues the job for us.

However in this case these workers still had active hearbeats and their state was working. One of the console commands we ran was Resque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }. It's like these workers were "alive" but not doing anything.

When we unregistered these workers via getting their ID and unregistering them or Restarting or worker dynos we noticed that the failed jobs vanished, they did not end up in the failed queue as expected.

@tylerCaineRhodes
Copy link

However in this case these workers still had active hearbeats and their state was working. One of the console commands we ran was Resque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }. It's like these workers were "alive" but not doing anything.

This. The jobs will not be marked as expired heartbeats and will claim to still be working, but they'll be idle... We can also see via the logs that a particular job for that worker has finished but will be in this idle state anyway, not picking anything else up and still "working".

irb (main) > Resque::Worker.all_workers_with_expired_heartbeats
[]

irb (main) > some_stalled_worker.working?
true

Unfortunately we've resorted to migrating to Sidekiq for some business critical cron jobs.

@nevans
Copy link
Contributor

nevans commented Apr 5, 2024

I also haven't upgraded anything significant to ruby 3.3 in production yet (I've been waiting for the 3.3.1 release, which will fix some bugs that affect my codebases). So I can't really offer anything but wild guesses at this point, but...

Unfortunately, (if I remember correctly) the heartbeat runs in a separate thread in the worker process, so it's only good for knowing if the worker process itself has frozen. It can't tell you if the code in the main thread has hung indefinitely or entered an infinite loop. And (again, IIRC) the heartbeat only runs in the parent worker process, and not in the forked child process that works each job. The parent simply waits on the child with Process.waitpid(@child).

If (hypothetically) in ruby 3.2, some system calls or any other VM code might potentially hang without a timeout _while holding the global VM lock, and if ruby 3.3 improved that code to no longer hold the global VM lock but it still hangs without a timeout, then that might explain your symptoms. This is is just a wild guess of a hypothesis... but a quick git log -G waitpid in the ruby repo does indicate that changes were made to Process.waitpid between 3.2 and 3.3, and its probably not related this bug. But... maybe?

Have you tried running with ENV["FORK_PER_JOB"] = "false" on a percentage of your workers?

If you have the ability to point rbspy, rbtrace, and/or strace at your stuck processes, that could be very informative. Even simpler, sometimes simply looking at the proclines (via something like ps faux | grep resque | grep -v grep can be useful or revealing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants