-
Notifications
You must be signed in to change notification settings - Fork 117
Description
At NERSC we (unfortunately) have a non-negligible pool of nodes awaiting hardware servicing (aka "debug nodes"). We put these nodes into a special debug reservation tagged with the MAINTENANCE flag so that our on-site staff can investigate issues while still keeping user jobs away from them. During system maintenances we use reframe to run full system tests in a reservation that contains all the nodes including the debug nodes. We rely on the MAINTENANCE tag to keep jobs off the debug nodes awaiting service. However, we run into a problem because these nodes have a state of IDLE+MAINTENANCE+RESERVED. Currently this isn't working in 4.3.2 because the is_down function was looking for MAINT instead of MAINTENANCE but we can make a local patch for that. I came to make a bug, but wanted to check if this was fixed in more recent code. However, looking at the current head of the code (https://github.com/reframe-hpc/reframe/blob/8e9036fa2f2ccbf56e4c608c5104de390f7bf131/reframe/core/schedulers/slurm.py#L662C5-L670C35)
it looks like our debug nodes would still be included as viable when they aren't because they would match for IDLE because the states are split on the "+" symbol.
def in_statex(self, state):
return self._states == set(state.upper().split('+'))
def is_avail(self):
return any(self.in_statex(s)
for s in ('ALLOCATED', 'COMPLETING', 'IDLE'))
def is_down(self):
return not self.is_avail()
Am I missing something? Is there a way to exclude nodes tagged with MAINTENANCE when doing runs with num_tasks = 0?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status