Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of SSH connection to SUT #2890

Open
The-Mule opened this issue Apr 25, 2024 · 3 comments
Open

Improve robustness of SSH connection to SUT #2890

The-Mule opened this issue Apr 25, 2024 · 3 comments

Comments

@The-Mule
Copy link
Contributor

TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:

  1. Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test (it is very likely that all disruptive test actions will be completed and reverted at that point).

  2. In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.

@happz
Copy link
Collaborator

happz commented Apr 25, 2024

TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:

  1. Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test

Please, check #2696, it seems to be related to your situation, and if it would not help, it would be very useful if you could share with us what should be included to help with your case.

(it is very likely that all disruptive test actions will be completed and reverted at that point).

Yeah, in your case, when you and your test are causing the changes on purpose. But telling the difference between "expected" and "the lab in the US is burning down" is the hard part, I would very much dispute the "likely" bit :)

  1. In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.

Re-running "from the scratch" would be a possible solution, with or without dropping the test. Restarting the plan, including provisioning a brand new guest to avoid running tests in a tainted environment. Well beyond what tmt can do now though.

@The-Mule
Copy link
Contributor Author

The-Mule commented May 6, 2024

IMO #2696 solves (2) for me.

Ad (1). I am starting to realize that this might not be possible. What I am looking for is to be able to "resurrect" closed SSH session. This is not possible by design of SSH unless some other tool handles it underneath the ssh connection - so that once SSH connection is closed and SSH reconnects it can continue where it was closed before (e.g. something like screen or tmux). To be more specific, I have the following plan:

❯ cat crasher.fmf
prepare:
    - name: Enable FIPS mode
      how: ansible
      order: 99
      playbook:
        - /playbooks/enable-fips.yaml

discover:
    - how: shell
      tests:
        - name: crasher
          test: |
            set -x
            # Backup.
            cp /usr/lib64/ossl-modules/fips.so .

            # Configure (break openssl).
            dd if=/dev/zero of=fips_hmac bs=8 count=1 conv=notrunc
            objcopy --update-section .rodata1=fips_hmac fips.so fips_bad_hmac.so
            cp fips_bad_hmac.so /usr/lib64/ossl-modules/fips.so

            # Trigger.
            openssl dgst -sha256 <<<'some text'

            # Restore.
            cp fips.so /usr/lib64/ossl-modules/fips.so
        - name: second test
          test: |
            openssl dgst -sha256 <<<'some text'

execute:
    - how: tmt

It breaks openssl library, that will cause ssh connection to drop and run it aborted. In theory, ssh is able to reconnect even while openssl library is still broken. But obviously tmt won't do that because, if I understand it correctly, it wouldn't be able to just to resume the test anyway. It just does not work that way. So (1) is basically not possible unless the test can resume itself (then it would work thanks to #2696).

So all in all, it seems that aforementioned plan is simply tmt-incompatible.

@The-Mule
Copy link
Contributor Author

The-Mule commented May 6, 2024

I would have to modify 'crasher' test to detect reboot and to attempt to restore the library (and then use options added in #2696) to get to 'second test'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants