Skip to content

slurm is_down check doesn't look like it's handling MAINTENANCE mode properly #3141

@lagerhardt

Description

@lagerhardt

At NERSC we (unfortunately) have a non-negligible pool of nodes awaiting hardware servicing (aka "debug nodes"). We put these nodes into a special debug reservation tagged with the MAINTENANCE flag so that our on-site staff can investigate issues while still keeping user jobs away from them. During system maintenances we use reframe to run full system tests in a reservation that contains all the nodes including the debug nodes. We rely on the MAINTENANCE tag to keep jobs off the debug nodes awaiting service. However, we run into a problem because these nodes have a state of IDLE+MAINTENANCE+RESERVED. Currently this isn't working in 4.3.2 because the is_down function was looking for MAINT instead of MAINTENANCE but we can make a local patch for that. I came to make a bug, but wanted to check if this was fixed in more recent code. However, looking at the current head of the code (https://github.com/reframe-hpc/reframe/blob/8e9036fa2f2ccbf56e4c608c5104de390f7bf131/reframe/core/schedulers/slurm.py#L662C5-L670C35)
it looks like our debug nodes would still be included as viable when they aren't because they would match for IDLE because the states are split on the "+" symbol.

def in_statex(self, state):
        return self._states == set(state.upper().split('+'))

    def is_avail(self):
        return any(self.in_statex(s)
                   for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))

    def is_down(self):
        return not self.is_avail()

Am I missing something? Is there a way to exclude nodes tagged with MAINTENANCE when doing runs with num_tasks = 0?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions