Skip to content

nixos/test-driver: add backdoor based on systemd-ssh-proxy & AF_VSOCK #392030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 9, 2025

Conversation

Ma27
Copy link
Member

@Ma27 Ma27 commented Mar 22, 2025

Note: Only the last commit is relevant right now.
Depends on #390996 & #372979.


With this it's possible to trivially SSH into running machines from the
test-driver. This is especially useful when running VM tests
interactively on a remote system.

This is based on systemd-ssh-proxy(1), so there's no need to configure
any additional networking on the host-side.

Suggested-by: Ryan Lahfa masterancpp@gmail.com

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@Ma27 Ma27 requested review from tfc, RaitoBezarius and K900 March 22, 2025 08:22
@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 6.topic: testing Tooling for automated testing of packages and modules 6.topic: systemd Software suite that provides an array of system components for Linux operating systems. 8.has: documentation This PR adds or changes documentation 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. labels Mar 22, 2025
@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 2, 2025
@Ma27 Ma27 force-pushed the testdriver-ssh-backdoor-interactive branch from 057f211 to 0200cad Compare April 26, 2025 09:05
@github-actions github-actions bot removed the 6.topic: systemd Software suite that provides an array of system components for Linux operating systems. label Apr 26, 2025
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 26, 2025
@github-actions github-actions bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. and removed 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. labels Apr 26, 2025
With this it's possible to trivially SSH into running machines from the
test-driver. This is especially useful when running VM tests
interactively on a remote system.

This is based on `systemd-ssh-proxy(1)`, so there's no need to configure
any additional networking on the host-side.

Suggested-by: Ryan Lahfa <masterancpp@gmail.com>
@Ma27 Ma27 force-pushed the testdriver-ssh-backdoor-interactive branch from 0200cad to 1bd8073 Compare April 26, 2025 09:35
@Ma27 Ma27 marked this pull request as ready for review April 26, 2025 09:43
@Ma27
Copy link
Member Author

Ma27 commented Apr 26, 2025

@tfc this would be the next one in line: it essentially adds an optional SSH backdoor that can be used on interactive runs, based on AF_VSOCK (so you can skip key management and don't need to modify the networking for the test which is probably useful on networking-related testcases).

I don't really know Darwin, so I can't say how easy/hard it'd be to get it to run there.

@Ma27 Ma27 requested a review from ctheune April 26, 2025 09:47
Copy link
Member

@RaitoBezarius RaitoBezarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea for Darwin, but for Linux: LGTM.

…door

Co-authored-by: Ryan Lahfa <masterancpp@gmail.com>
@tfc
Copy link
Contributor

tfc commented Apr 26, 2025

@Ma27 when activated, it would be great if users don't need to find out which VM to reach via which magic vsock number. can you add a print message please that is something like vsock debug interface: Connect to VM <machine name> via "ssh --foo --bar ..." (The driver could print that somewhere)?

@Ma27
Copy link
Member Author

Ma27 commented May 8, 2025

@tfc nice idea!
Implemented as suggested.

@tfc tfc merged commit 8b3baa1 into NixOS:master May 9, 2025
32 checks passed
@Ma27 Ma27 deleted the testdriver-ssh-backdoor-interactive branch May 9, 2025 06:19
Ma27 added a commit to Ma27/nixpkgs that referenced this pull request May 9, 2025
I'm a little annoyed at myself that I only realized this _after_ NixOS#392030
got merged. But I realized that if something else is using AF_VSOCK or
you simply have another interactive test running (e.g. by another user
on a larger builder), starting up VMs in the driver fails with

    qemu-system-x86_64: -device vhost-vsock-pci,guest-cid=3: vhost-vsock: unable to set guest cid: Address already in use

Multi-user setups are broken anyways because you usually don't have
permissions to remove the VM state from another user and thus starting
the driver fails with

    PermissionError: [Errno 13] Permission denied: PosixPath('/tmp/vm-state-machine')

but this is something you can work around at least.

I was considering to generate random offsets, but that's not feasible
given we need to know the numbers at eval time to inject them into the
QEMU args. Also, while we could do this via the test-driver, we should
also probe if the vsock numbers are unused making the code even more
complex for a use-case I consider rather uncommon.

Hence the solution is to do

    sshBackdoor.vsockOffset = 23542;

when encountering conflicts.
Ma27 added a commit to Ma27/nixpkgs that referenced this pull request May 9, 2025
I'm a little annoyed at myself that I only realized this _after_ NixOS#392030
got merged. But I realized that if something else is using AF_VSOCK or
you simply have another interactive test running (e.g. by another user
on a larger builder), starting up VMs in the driver fails with

    qemu-system-x86_64: -device vhost-vsock-pci,guest-cid=3: vhost-vsock: unable to set guest cid: Address already in use

Multi-user setups are broken anyways because you usually don't have
permissions to remove the VM state from another user and thus starting
the driver fails with

    PermissionError: [Errno 13] Permission denied: PosixPath('/tmp/vm-state-machine')

but this is something you can work around at least.

I was considering to generate random offsets, but that's not feasible
given we need to know the numbers at eval time to inject them into the
QEMU args. Also, while we could do this via the test-driver, we should
also probe if the vsock numbers are unused making the code even more
complex for a use-case I consider rather uncommon.

Hence the solution is to do

    sshBackdoor.vsockOffset = 23542;

when encountering conflicts.
@Kranzes
Copy link
Member

Kranzes commented May 9, 2025

Is it possible to set sshBackdoor.enable to true when using driverInteractive? I think that could be nice.

@Ma27
Copy link
Member Author

Ma27 commented May 9, 2025

Is it possible to set sshBackdoor.enable to true when using driverInteractive? I think that could be nice.

That's exactly the point of this, i.e. easier ssh for interactive debugging.

There's also #392117 which allows the same in the sandbox (and pdb), but I expect this to be a little more controversial.

@Kranzes
Copy link
Member

Kranzes commented May 9, 2025

Right, but it's not enabled by default when using driverInteractive, or can I not read?

@Ma27
Copy link
Member Author

Ma27 commented May 9, 2025

Oh you mean by default! No, that's not the case yet.

@r-vdp
Copy link
Contributor

r-vdp commented May 11, 2025

I'm getting the following error on non-interactive tests (which have the new option enabled), after having used the vsock stuff interactively:

> qemu-system-x86_64: -device vhost-vsock-pci,guest-cid=3: vhost-vsock: failed to open vhost device: No such file or directory

It seems that something got messed up on my system after the interactive test run exited.

@Ma27
Copy link
Member Author

Ma27 commented May 11, 2025

I'll take a look tomorrow.

@Ma27
Copy link
Member Author

Ma27 commented May 12, 2025

So, I was a little too tired to spot it last evening, but: the vsock stuff only works when passing /dev/vhost-vsock into the sandbox.
See also #392117 on how to use this within the sandbox.

This is a debugging tool only. Considering that the vsockOffset option added in #405508 is something that's explicitly supposed to be set to a random value when needed and that you usually build your own driver (driverInteractive) for debugging, I'd argue that this is OK to only set on-demand. Or is there a case where it'd make sense to have it always active?

Anyways, I agree though that the error-message is bad and that this is not sufficiently documented. I guess I'll file a follow-up for this soon :)

@r-vdp
Copy link
Contributor

r-vdp commented May 12, 2025

Ah ok, yeah I think that it's OK then, but indeed it's confusing if you're not aware of this, since the error message is not very obvious.

Maybe there's a way to detect that a test has this backdoor enabled and is being run non-interactively, so that we can print a clearer error message to explain what's going on. Or we could ignore this option for non-interactive builds, so that it just gets ignored in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: testing Tooling for automated testing of packages and modules 8.has: documentation This PR adds or changes documentation 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants