Skip to content

Job that gets cancelled prematurely by ReFrame #2411

@casparvl

Description

@casparvl

I think I found a bug in the SLURM scheduler implementation of ReFrame. The situation is the following: I'm writing a test that runs on a specific node, i.e. that passes --nodelist=<somenode> to SLURM. I have a couple of variants of this test. Now, if you submit more than two jobs to the same node using --nodelist, you see something like this in the queue:

             JOBID            PARTITION                 NAME     USER    STATE     TIME TIME_LIMI  NODES   PRIORITY           START_TIME                                   NODELIST(REASON)
           8724936         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689130  2022-02-01T18:32:40          (ReqNodeNotAvail, UnavailableNodes:r34n5)
           8724937         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689129  2022-02-01T18:32:40   (ReqNodeNotAvail, May be reserved for other job)
           8724938         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689129  2022-02-01T18:32:40   (ReqNodeNotAvail, May be reserved for other job)
           8724935         gpu_titanrtx                sleep  <myuser> RUNNING     1:22      5:00      1     689100  2022-02-01T18:27:40                                              r34n5

I.e. one job is running, the next one that is eligible gives REASON: ReqNodeNotAvail, UnavailableNodes:r34n5 and any job after that gives REASON: ReqNodeNotAvail, May be reserved for other job.

The issue is that ReFrame will cancel all jobs that get the ReqNodeNotAvail, May be reserved for other job because it ends up in this part of the logic. I.e. I see a test failure with:

 * Reason: job blocked error: [jobid=8724916] job cancelled because it was blocked due to a perhaps non-recoverable reason: ReqNodeNotAvail,  May be reserved for other job

In my opinion, this test should not fail. The test is fully eligible, the node is not down, and if it remains in the queue long enough it will get scheduled once the other jobs that were submitted to that specific node finish.

My suggested fix would be to catch this REASON in a similar way as ReqNodeNotAvail, UnavailableNodes:r34n5 is caught here and check if the node is offline. If not, we can assume the job will be scheduled at some point and it should not be cancelled by the ReFrame framework.

I can probably come up with a PR for this if that's helpful, but I probably won't have time the coming days. The fix should be pretty quick and straightforward though...

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions