ReqNodeNotAvail reason code causes ReFrame to incorrectly kill jobs on Cori at NERSC

The Slurm scheduler on the Cori system at NERSC takes a long time to compute backfill - up to 15 minutes per backfill cycle, due to having a huge number of enqueued jobs (several 10,000s). During this process, Slurm will often temporarily change the reason code for an enqueued job to `ReqNodeNotAvail, UnavailableNodes:nid[...]`, where the list of nodes is essentially the entire system, around 11,000 nodes. Although ReFrame attempts to check if the nodes are healthy (cf. #303 and resulting PR), this check seems to fail on Cori while the backfill cycle is running. I don't understand why it fails, but wonder if Slurm refuses to service the `scontrol show node` command with such a large list of nodes.

I have worked around this problem by simply removing `ReqNodeNotAvail` from the `_cancel_reasons` list in `reframe/core/schedulers/slurm.py` in NERSC's fork of ReFrame, but this is probably not the best long-term solution. If it is true that Slurm is refusing to service the large `scontrol` command, can you think of a way to work around this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ReqNodeNotAvail reason code causes ReFrame to incorrectly kill jobs on Cori at NERSC #1049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ReqNodeNotAvail reason code causes ReFrame to incorrectly kill jobs on Cori at NERSC #1049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions