-
Notifications
You must be signed in to change notification settings - Fork 117
Add a knob for controlling the 'sacct'/'squeue' ratio in Slurm 'sacct' backend #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Reduce the `squeue` rate by introducing the constant `SACCT_SQUEUE_RATIO`. * Fix some warnings regarding regexes.
victorusu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| 'ReqNodeNotAvail', # Inaccurate SLURM doc | ||
| 'QOSUsageThreshold'] | ||
| self._is_cancelling = False | ||
| self._update_state_count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it state or status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorusu I think 'state' is fine.
reframe/core/schedulers/slurm.py
Outdated
| self._state = SlurmJobState(state_match.group('state')) | ||
| self._cancel_if_blocked() | ||
|
|
||
| if self._update_state_count == SACCT_SQUEUE_RATIO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply do a modulo operation here:
if not self._update_state_count % SACCT_SQUEUE_RATIO:
self._cancel_if_blocked()
reframe/core/schedulers/slurm.py
Outdated
| SLURM_JOB_TIMEOUT = SlurmJobState('TIMEOUT') | ||
|
|
||
| # Number of _update_state calls per which _cancel_if_blocked is called | ||
| SACCT_SQUEUE_RATIO = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this constant should be defined inside the SlurmJob class as class attribute, since it is very specific to this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorusu Should we start with 10 and see if ops are happy?
reframe/core/schedulers/slurm.py
Outdated
| SLURM_JOB_SUSPENDED = SlurmJobState('SUSPENDED') | ||
| SLURM_JOB_TIMEOUT = SlurmJobState('TIMEOUT') | ||
|
|
||
| # Number of _update_state calls per which _cancel_if_blocked is called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this comment very much; it contradicts with the name of the variable. If we use the SACCT_SQUEUE_RATIO as a name, I'd like to see a comment describing the rationale behind this instead, because what this variable controls is obvious from its name.
Reduce the
squeuerate by introducing the constantSACCT_SQUEUE_RATIO.Fix some warnings regarding regexes.
Fixes #283