Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite connection reaper test with timeout #51038

Merged
merged 1 commit into from
Feb 17, 2024
Merged

Conversation

fatkodima
Copy link
Member

Trying to debug reaper_test.rb.

@fatkodima
Copy link
Member Author

fatkodima commented Feb 10, 2024

After the second attempt (#50037), I was finally be able to debug that flaky test that is cancelled after the 30m timeout time!

See https://buildkite.com/rails/rails/builds/104764#018d9373-57da-4bb8-a019-e1e2a9c12f41/1156-1587

We fork a process in

but it never gets run and the runtime is blocked waiting for its termination
Process.waitpid(pid)

We can wrap that waitpid with Timeout.timeout, but it will be better to understand why the process did not run? Do buildkite have a limit on some number of processes and we had leaked processes from somewhere else? Or what can be the reason?

cc @byroot Do you have any ideas?

@byroot
Copy link
Member

byroot commented Feb 11, 2024

but it never gets run and the runtime is blocked waiting for its termination

If the parent process is blocked on wait, it means the child process is still alive. So in a way, it "gets run".

It likely run into some sort of deadlock.

@fatkodima
Copy link
Member Author

@matthewd
Copy link
Member

$stdout.buffer?

@byroot
Copy link
Member

byroot commented Feb 12, 2024

What can be the reason?

This test is about restarting the reaper thread in the child, which we do from an after_fork callback that is executed before everything else. So it probably is deadlocked in there.

def after_fork_callback
new_pid = Process.pid
if @pid != new_pid
@callbacks.each(&:call)
@pid = new_pid
end
end

@fatkodima
Copy link
Member Author

@byroot Do you have an idea on why that can be and how to solve it? This test is so annoying ... If this is not easily solvable/debuggable, maybe we can wrap it into some timeout to reduce the damage, until it gets solved?

@byroot
Copy link
Member

byroot commented Feb 12, 2024

Do you have an idea on why that can be and how to solve it?

No, I'd need to dig into it. I barely had a look.

maybe we can wrap it into some timeout to reduce the damage, until it gets solved?

It's probably a good idea in general when forking to then invoke code. Might be worth extracting some sort of test helper.

The proper way to timeout a forked process being to use a pipe, allowing to do reader_pipe.wait_readable(timeout).

@fatkodima fatkodima changed the title Debug CI flaky test Rewrite connection reaper test with timeout Feb 17, 2024
@fatkodima
Copy link
Member Author

@byroot Rewrote that test with your suggestion. Please take a look.

CI is red, but seems unrelated.

Process.waitpid(pid)
assert_predicate $?, :success?
else
Process.kill("KILL", pid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should call Process.wait even after a SIGKILL otherwise the child will linger around as a zombie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe excessive, but rather than SIGKILL, you could send SIGABRT and cause the child process to print a Ruby crash report, which would include its main thread backtrace, giving more information on where it's stuck.

Comment on lines 155 to 160
if completed
Process.waitpid(pid)
assert_predicate $?, :success?
else
Process.kill("KILL", pid)
flunk("Process timed out")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if completed
Process.waitpid(pid)
assert_predicate $?, :success?
else
Process.kill("KILL", pid)
flunk("Process timed out")
unless completed
Process.kill("KILL", pid)
end
_, status = Process.wait2(pid)
assert_predicate status, :success?

@fatkodima
Copy link
Member Author

Thanks! Applied all the suggestions.

@byroot byroot merged commit 057563a into rails:main Feb 17, 2024
3 of 4 checks passed
@fatkodima fatkodima deleted the debug-ci branch February 17, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants