Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm adaptor getJobStatus fails in the wrong way when SSH connection is lost #668

Closed
jmaassen opened this issue Mar 3, 2020 · 2 comments
Assignees

Comments

@jmaassen
Copy link
Member

jmaassen commented Mar 3, 2020

When getJobStatus is called on the slurm adaptor it will execute up to 3 different slurm commands in an attempt to find the job: squeue, sinfo and sactt. It sends these commands using an interactive job on a subscheduler such as SSH. If the first command does not produce a result it tries the next, etc.

However, if the ssh connection is down, the first command will produce an exception instead of a result. The slurm adaptor will then print a debug message (which is ignored by default), and goes on to try the next command. This command will again produce an exception, etc.

When all commands are tried and there is no result, a NoSuchJobException is thrown, regardless of whether the slurm commands executed correctly (but without finding the job) or incorrectly.

As a result, client applications such as xenon-flow can not see the difference between a job that cannot be found or losing the underlying SSH connection completely.

This is incorrect behavior. Instead, the NoSuchJobException should only be thrown if the slurm commands where executed successfully, but the job could not be found. When the commands fail to run, a XenonException should be thrown.

In addition, the debug messages explaining why the commands failed may be printed as a warning instead of debug?

@jmaassen
Copy link
Member Author

jmaassen commented Mar 3, 2020

We should also check the other scripting adaptors if they show the same incorrect behavior.

@jmaassen
Copy link
Member Author

Fixed in 3.1.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant