Skip to content

Fix zombie SSH processes with --sftp-ssh#8930

Merged
ncw merged 4 commits into
masterfrom
copilot/fix-zombie-ssh-processes
Nov 4, 2025
Merged

Fix zombie SSH processes with --sftp-ssh#8930
ncw merged 4 commits into
masterfrom
copilot/fix-zombie-ssh-processes

Conversation

Copilot AI commented Oct 30, 2025

Copy link
Copy Markdown
Contributor

When using external SSH binaries via --sftp-ssh, zombie processes accumulate because sshClientExternal.Wait() is never actually called. The root cause was that sshClientExternal.session was never assigned in NewSession(), so Wait() always returned nil immediately without waiting for the SSH process to exit.

Changes

  • Store the first session: Modified NewSession() to assign the created session to s.session so Wait() can actually wait for the SSH process
  • Use sync.Once: Added sync.Once in sshSessionExternal.Wait() to ensure cmd.Wait() executes exactly once (thread-safety)
  • Store and return wait result: Added waitErr field to store and return the wait result on subsequent calls
  • Add tests: Added tests verifying multiple Wait() calls don't cause process leaks
// In sshClientExternal
func (s *sshClientExternal) NewSession() (sshSession, error) {
	session := s.f.newSSHSessionExternal()
	if s.session == nil {
		// Store the first session so Wait() and Close() can use it
		s.session = session
	} else {
		fs.Debugf(s.f, "ssh external: creating additional session")
	}
	return session, nil
}

// In sshSessionExternal
type sshSessionExternal struct {
    // ... existing fields
    waitOnce    sync.Once
    waitErr     error
}

func (s *sshSessionExternal) Wait() error {
    s.waitOnce.Do(func() {
        s.waitErr = s.cmd.Wait()
        // ... logging
    })
    return s.waitErr
}

Fixes #8929

Original prompt

This section details on the original issue you should resolve

<issue_title>sftp: zombie ssh processes with --sftp-ssh</issue_title>
<issue_description>The integration tests show that when using SFTP with --sftp-ssh lots of zombie processes are left behind.

Not sure if this is integration test specific but it needs investigating.

 848174 pts/30   Sl+    0:01  |   \_ go run ./fstest/test_all -backends sftp
 848254 pts/30   Sl+    0:00  |       \_ /tmp/go-build842052937/b001/exe/test_all -backends sftp
 849546 pts/30   Sl+    0:00  |           \_ ./sync.test -test.v -test.timeout 1h0m0s -remote TestSFTPRcloneSSH:
 849756 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 850029 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 850068 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 851789 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 852312 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 852321 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 852577 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>
 852722 pts/30   Z+     0:00  |           |   \_ [ssh] <defunct>

How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.
    </issue_description>

<agent_instructions>Please fix this bug which is to do with ssh processes not being reaped properly.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits October 30, 2025 17:56
… only once

The issue was that cmd.Wait() was being called multiple times on the same
process - once in the background goroutine and once in Close(). This could
lead to zombie processes because only the first call to Wait() properly
reaps the process.

The fix uses sync.Once to ensure Wait() is only called once per SSH process,
storing and returning the result on subsequent calls.

Added tests to verify the fix works correctly.

Co-authored-by: ncw <536803+ncw@users.noreply.github.com>
Remove unnecessary exited() check - we should always call cmd.Wait()
in the sync.Once block to properly reap the process and capture its
exit status. The sync.Once ensures it's only called once.

Co-authored-by: ncw <536803+ncw@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix zombie SSH processes on SFTP with --sftp-ssh Fix zombie SSH processes with --sftp-ssh Oct 30, 2025
Copilot AI requested a review from ncw October 30, 2025 18:06

@ncw ncw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your analysis is wrong. The ssh zombies are caused by Wait not being called at all, rather than multiple times.

I tested your fix and unfortunately it does not fix the original problem.

However I think the implementation idea is good as are the tests, I just think there needs to be more to it.

There seems to be a problem with session in the sshClientExternal structure - it is never assigned to which means that sshClientExternal.Wait() will never actually call Wait. Maybe this is the problem. Please investigate further.

Comment thread backend/sftp/ssh_external.go Outdated
}
return err
// Use sync.Once to ensure we only wait for the process once
// This prevents zombie processes that occur when Wait() is called multiple times

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is wrong - zombies do not ocurr when Wait() is called multiple times.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected in 46bc876 - the comment now accurately reflects that sync.Once provides thread-safety, not zombie prevention.

@ncw

ncw commented Oct 31, 2025

Copy link
Copy Markdown
Member

@copilot did you see my review?

…ernal

The real issue was that sshClientExternal.session was never assigned,
so Wait() always returned nil without waiting for the SSH process to exit.
This caused zombie processes because the process was never reaped.

The fix:
- Store the first session created in NewSession() to s.session
- This allows Wait() to actually wait for the SSH process
- The sync.Once pattern is still useful for thread-safety
- Updated comments to reflect the correct behavior

Fixes the zombie process issue reported in #8929

Co-authored-by: ncw <536803+ncw@users.noreply.github.com>

Copilot AI commented Oct 31, 2025

Copy link
Copy Markdown
Contributor Author

I think your analysis is wrong. The ssh zombies are caused by Wait not being called at all, rather than multiple times.

I tested your fix and unfortunately it does not fix the original problem....

You're absolutely right! The real issue was that s.session was never assigned in NewSession(), so Wait() always returned nil without actually waiting. Fixed in 46bc876 by storing the first session so Wait() can properly reap the SSH process.

@copilot did you see my review?

Yes, addressed in 46bc876 - the session is now properly stored so Wait() actually waits for the SSH process.

Copilot AI requested a review from ncw October 31, 2025 11:55
@ncw ncw marked this pull request as ready for review November 4, 2025 12:00

@ncw ncw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good now.

I will merge now - thank you.

@ncw ncw merged commit ee92673 into master Nov 4, 2025
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sftp: zombie ssh processes with --sftp-ssh

2 participants