The Slurm scheduler on the Cori system at NERSC takes a long time to compute backfill - up to 15 minutes per backfill cycle, due to having a huge number of enqueued jobs (several 10,000s). During this process, Slurm will often temporarily change the reason code for an enqueued job to ReqNodeNotAvail, UnavailableNodes:nid[...], where the list of nodes is essentially the entire system, around 11,000 nodes. Although ReFrame attempts to check if the nodes are healthy (cf. #303 and resulting PR), this check seems to fail on Cori while the backfill cycle is running. I don't understand why it fails, but wonder if Slurm refuses to service the scontrol show node command with such a large list of nodes.
I have worked around this problem by simply removing ReqNodeNotAvail from the _cancel_reasons list in reframe/core/schedulers/slurm.py in NERSC's fork of ReFrame, but this is probably not the best long-term solution. If it is true that Slurm is refusing to service the large scontrol command, can you think of a way to work around this problem?