Job that gets cancelled prematurely by ReFrame

I think I found a bug in the SLURM scheduler implementation of ReFrame. The situation is the following: I'm writing a test that runs on a specific node, i.e. that passes `--nodelist=<somenode>` to SLURM. I have a couple of variants of this test. Now, if you submit _more_ than two jobs to the same node using `--nodelist`, you see something like this in the queue:

```
             JOBID            PARTITION                 NAME     USER    STATE     TIME TIME_LIMI  NODES   PRIORITY           START_TIME                                   NODELIST(REASON)
           8724936         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689130  2022-02-01T18:32:40          (ReqNodeNotAvail, UnavailableNodes:r34n5)
           8724937         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689129  2022-02-01T18:32:40   (ReqNodeNotAvail, May be reserved for other job)
           8724938         gpu_titanrtx                sleep  <myuser> PENDING     0:00      5:00      1     689129  2022-02-01T18:32:40   (ReqNodeNotAvail, May be reserved for other job)
           8724935         gpu_titanrtx                sleep  <myuser> RUNNING     1:22      5:00      1     689100  2022-02-01T18:27:40                                              r34n5
```

I.e. one job is running, the next one that is eligible gives `REASON: ReqNodeNotAvail, UnavailableNodes:r34n5` and any job after that gives `REASON: ReqNodeNotAvail, May be reserved for other job`.

The issue is that ReFrame will cancel all jobs that get the `ReqNodeNotAvail, May be reserved for other job` because it ends up in [this](https://github.com/eth-cscs/reframe/blob/535d694480ba653f14ea2a8b8713203671014e84/reframe/core/schedulers/slurm.py#L518) part of the logic. I.e. I see a test failure with:
```
 * Reason: job blocked error: [jobid=8724916] job cancelled because it was blocked due to a perhaps non-recoverable reason: ReqNodeNotAvail,  May be reserved for other job
```

In my opinion, this test should not fail. The test is fully eligible, the node is not down, and if it remains in the queue long enough it will get scheduled once the other jobs that were submitted to that specific node finish.

My suggested fix would be to catch this `REASON` in a similar way as `ReqNodeNotAvail, UnavailableNodes:r34n5` is caught [here](https://github.com/eth-cscs/reframe/blob/535d694480ba653f14ea2a8b8713203671014e84/reframe/core/schedulers/slurm.py#L502) and check if the node is offline. If not, we can assume the job will be scheduled _at some point_ and it should not be cancelled by the ReFrame framework.

I can probably come up with a PR for this if that's helpful, but I probably won't have time the coming days. The fix should be pretty quick and straightforward though...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Job that gets cancelled prematurely by ReFrame #2411

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Job that gets cancelled prematurely by ReFrame #2411

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions