-
Notifications
You must be signed in to change notification settings - Fork 117
Description
I think I found a bug in the SLURM scheduler implementation of ReFrame. The situation is the following: I'm writing a test that runs on a specific node, i.e. that passes --nodelist=<somenode> to SLURM. I have a couple of variants of this test. Now, if you submit more than two jobs to the same node using --nodelist, you see something like this in the queue:
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES PRIORITY START_TIME NODELIST(REASON)
8724936 gpu_titanrtx sleep <myuser> PENDING 0:00 5:00 1 689130 2022-02-01T18:32:40 (ReqNodeNotAvail, UnavailableNodes:r34n5)
8724937 gpu_titanrtx sleep <myuser> PENDING 0:00 5:00 1 689129 2022-02-01T18:32:40 (ReqNodeNotAvail, May be reserved for other job)
8724938 gpu_titanrtx sleep <myuser> PENDING 0:00 5:00 1 689129 2022-02-01T18:32:40 (ReqNodeNotAvail, May be reserved for other job)
8724935 gpu_titanrtx sleep <myuser> RUNNING 1:22 5:00 1 689100 2022-02-01T18:27:40 r34n5
I.e. one job is running, the next one that is eligible gives REASON: ReqNodeNotAvail, UnavailableNodes:r34n5 and any job after that gives REASON: ReqNodeNotAvail, May be reserved for other job.
The issue is that ReFrame will cancel all jobs that get the ReqNodeNotAvail, May be reserved for other job because it ends up in this part of the logic. I.e. I see a test failure with:
* Reason: job blocked error: [jobid=8724916] job cancelled because it was blocked due to a perhaps non-recoverable reason: ReqNodeNotAvail, May be reserved for other job
In my opinion, this test should not fail. The test is fully eligible, the node is not down, and if it remains in the queue long enough it will get scheduled once the other jobs that were submitted to that specific node finish.
My suggested fix would be to catch this REASON in a similar way as ReqNodeNotAvail, UnavailableNodes:r34n5 is caught here and check if the node is offline. If not, we can assume the job will be scheduled at some point and it should not be cancelled by the ReFrame framework.
I can probably come up with a PR for this if that's helpful, but I probably won't have time the coming days. The fix should be pretty quick and straightforward though...