Skip to content

ReqNodeNotAvail reason code causes ReFrame to incorrectly kill jobs on Cori at NERSC #1049

@bcfriesen

Description

@bcfriesen

The Slurm scheduler on the Cori system at NERSC takes a long time to compute backfill - up to 15 minutes per backfill cycle, due to having a huge number of enqueued jobs (several 10,000s). During this process, Slurm will often temporarily change the reason code for an enqueued job to ReqNodeNotAvail, UnavailableNodes:nid[...], where the list of nodes is essentially the entire system, around 11,000 nodes. Although ReFrame attempts to check if the nodes are healthy (cf. #303 and resulting PR), this check seems to fail on Cori while the backfill cycle is running. I don't understand why it fails, but wonder if Slurm refuses to service the scontrol show node command with such a large list of nodes.

I have worked around this problem by simply removing ReqNodeNotAvail from the _cancel_reasons list in reframe/core/schedulers/slurm.py in NERSC's fork of ReFrame, but this is probably not the best long-term solution. If it is true that Slurm is refusing to service the large scontrol command, can you think of a way to work around this problem?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions