Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 83 additions & 83 deletions docs/config_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,6 @@ It consists of the following properties:
A list of `logging configuration objects <#logging-configuration>`__.


.. py:attribute:: .schedulers

:required: No

A list of `scheduler configuration objects <#scheduler-configuration>`__.


.. py:attribute:: .modes

:required: No
Expand All @@ -77,6 +70,12 @@ It consists of the following properties:
A list of `general configuration objects <#general-configuration>`__.


.. warning::
.. versionchanged:: 4.0.0
The ``schedulers`` section is removed.
Scheduler options should be set per partition using the ``sched_options`` attribute.


System Configuration
--------------------

Expand Down Expand Up @@ -205,6 +204,18 @@ System Configuration
This list must have at least one element.


.. js:attribute:: .systems[].sched_options

:required: No
:default: ``{}``

Scheduler options for the local scheduler that is associated with the ReFrame's execution context.
To understand the difference between the different execution contexts, please refer to ":ref:`execution-contexts`"
For the available scheduler options, see the :obj:`sched_options` in the partition configuration below.

.. versionadded:: 4.0.0


------------------------------
System Partition Configuration
------------------------------
Expand Down Expand Up @@ -294,6 +305,71 @@ System Partition Configuration
the :py:attr:`extra_resources` will be simply ignored in this case and the scheduler backend will interpret the different test fields in the appropriate way.


.. js:attribute:: .systems[].partitions[].sched_options

:required: No
:default: ``{}``

Scheduler-specific options for this partition.
See below for the available options.

.. versionadded:: 4.0.0


.. js:attribute:: .systems[].partitions[].sched_options.ignore_reqnodenotavail

:required: No
:default: ``false``

Ignore the ``ReqNodeNotAvail`` Slurm state.

If a job associated to a test is in pending state with the Slurm reason ``ReqNodeNotAvail`` and a list of unavailable nodes is also specified, ReFrame will check the status of the nodes and, if all of them are indeed down, it will cancel the job.
Sometimes, however, when Slurm's backfill algorithm takes too long to compute, Slurm will set the pending reason to ``ReqNodeNotAvail`` and mark all system nodes as unavailable, causing ReFrame to kill the job.
In such cases, you may set this parameter to ``true`` to avoid this.

This option is relevant for the Slurm backends only.

.. js:attribute:: .systems[].partitions[].sched_options.job_submit_timeout

:required: No
:default: ``60``

Timeout in seconds for the job submission command.

If timeout is reached, the test issuing that command will be marked as a failure.


.. js:attribute:: .systems[].partitions[].sched_options.resubmit_on_errors

:required: No
:default: ``[]``

If any of the listed errors occur, try to resubmit the job after some seconds.

As an example, you could have ReFrame trying to resubmit a job in case that the maximum submission limit per user is reached by setting this field to ``["QOSMaxSubmitJobPerUserLimit"]``.
You can ignore multiple errors at the same time if you add more error strings in the list.

This option is relevant for the Slurm backends only.

.. versionadded:: 3.4.1

.. warning::
Job submission is a synchronous operation in ReFrame.
If this option is set, ReFrame's execution will block until the error conditions specified in this list are resolved.
No other test would be able to proceed.


.. js:attribute:: .systems[].partitions[].sched_options.use_nodes_option

:required: No
:default: ``false``

Always emit the ``--nodes`` Slurm option in the preamble of the job script.
This option is relevant to Slurm backends only.

This option is relevant for the Slurm backends only.


.. js:attribute:: .systems[].partitions[].launcher

:required: Yes
Expand Down Expand Up @@ -1287,82 +1363,6 @@ An example configuration of this handler for performance logging is shown here:
This handler transmits the whole log record, meaning that all the information will be available and indexable at the remote end.


Scheduler Configuration
-----------------------

A scheduler configuration object contains configuration options specific to the scheduler's behavior.


------------------------
Common scheduler options
------------------------


.. js:attribute:: .schedulers[].name

:required: Yes

The name of the scheduler that these options refer to.
It can be any of the supported job scheduler `backends <#.systems[].partitions[].scheduler>`__.


.. js:attribute:: .schedulers[].job_submit_timeout

:required: No
:default: 60

Timeout in seconds for the job submission command.
If timeout is reached, the regression test issuing that command will be marked as a failure.


.. js:attribute:: .schedulers[].target_systems

:required: No
:default: ``["*"]``

A list of systems or system/partitions combinations that this scheduler configuration is valid for.
For a detailed description of this property, you may refer `here <#.environments[].target_systems>`__.

.. js:attribute:: .schedulers[].use_nodes_option

:required: No
:default: ``false``

Always emit the ``--nodes`` Slurm option in the preamble of the job script.
This option is relevant to Slurm backends only.


.. js:attribute:: .schedulers[].ignore_reqnodenotavail

:required: No
:default: ``false``

This option is relevant to the Slurm backends only.

If a job associated to a test is in pending state with the Slurm reason ``ReqNodeNotAvail`` and a list of unavailable nodes is also specified, ReFrame will check the status of the nodes and, if all of them are indeed down, it will cancel the job.
Sometimes, however, when Slurm's backfill algorithm takes too long to compute, Slurm will set the pending reason to ``ReqNodeNotAvail`` and mark all system nodes as unavailable, causing ReFrame to kill the job.
In such cases, you may set this parameter to ``true`` to avoid this.


.. js:attribute:: .schedulers[].resubmit_on_errors

:required: No
:default: ``[]``

This option is relevant to the Slurm backends only.

If any of the listed errors occur, ReFrame will try to resubmit the job after some seconds.
As an example, you could have ReFrame trying to resubmit a job in case that the maximum submission limit per user is reached by setting this field to ``["QOSMaxSubmitJobPerUserLimit"]``.
You can ignore multiple errors at the same time if you add more error strings in the list.

.. versionadded:: 3.4.1

.. warning::
Job submission is a synchronous operation in ReFrame.
If this option is set, ReFrame's execution will block until the error conditions specified in this list are resolved.
No other test would be able to proceed.


Execution Mode Configuration
----------------------------

Expand Down
8 changes: 3 additions & 5 deletions docs/configure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,12 +216,10 @@ However, there are several options that can go into this section, but the reader
Other configuration options
---------------------------

There are finally two more optional configuration sections that are not discussed here:

1. The ``schedulers`` section holds configuration variables specific to the different scheduler backends and
2. the ``modes`` section defines different execution modes for the framework.
Execution modes are discussed in the :doc:`pipeline` page.
There is finally one additional optional configuration section that is not discussed here:

The ``modes`` section defines different execution modes for the framework.
Execution modes are discussed in the :doc:`pipeline` page.


Building the Final Configuration
Expand Down
2 changes: 2 additions & 0 deletions docs/pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,8 @@ There are a number of things to notice in this diagram:
The ``compile`` stage is now also executed asynchronously.


.. _execution-contexts:

--------------------------------------
Where each pipeline stage is executed?
--------------------------------------
Expand Down
35 changes: 34 additions & 1 deletion reframe/core/schedulers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,45 @@ class JobMeta(RegressionTestMeta, abc.ABCMeta):
'''Job metaclass.'''


class JobScheduler(abc.ABC):
class JobSchedulerMeta(abc.ABCMeta):
'''Metaclass for JobSchedulers.

The purpose of this metaclass is to intercept the constructor call and
consume the `part_name` argument for setting up the configuration prefix
without requiring the users to call `super().__init__()` in their
constructors. This allows the base class to have the look and feel of a
pure interface.

:meta private:

'''
def __call__(cls, *args, **kwargs):
part_name = kwargs.pop('part_name', None)
obj = cls.__new__(cls, *args, **kwargs)
if part_name:
obj._config_prefix = (
f'systems/0/paritions/@{part_name}/sched_options'
)
else:
obj._config_prefix = 'systems/0/sched_options'

obj.__init__(*args, **kwargs)
return obj


class JobScheduler(abc.ABC, metaclass=JobSchedulerMeta):
'''Abstract base class for job scheduler backends.

:meta private:
'''

def get_option(self, name):
'''Get scheduler-specific option.

:meta private:
'''
return runtime.runtime().get_option(f'{self._config_prefix}/{name}')

@abc.abstractmethod
def make_job(self, *args, **kwargs):
'''Create a new job to be managed by this scheduler.
Expand Down
5 changes: 1 addition & 4 deletions reframe/core/schedulers/flux.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
import os
import time

import reframe.core.runtime as rt
from reframe.core.backends import register_scheduler
from reframe.core.exceptions import JobError
from reframe.core.schedulers import JobScheduler, Job
Expand Down Expand Up @@ -65,9 +64,7 @@ def completed(self):
class FluxJobScheduler(JobScheduler):
def __init__(self):
self._fexecutor = flux.job.FluxExecutor()
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._submit_timeout = self.get_option('job_submit_timeout')

def emit_preamble(self, job):
# We don't need to submit with a file, so we don't need a preamble.
Expand Down
5 changes: 1 addition & 4 deletions reframe/core/schedulers/lsf.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
import re
import time

import reframe.core.runtime as rt
import reframe.utility.osext as osext
from reframe.core.backends import register_scheduler
from reframe.core.exceptions import JobSchedulerError
Expand All @@ -27,9 +26,7 @@
class LsfJobScheduler(PbsJobScheduler):
def __init__(self):
self._prefix = '#BSUB'
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._submit_timeout = self.get_option('job_submit_timeout')

def _format_option(self, var, option):
if var is not None:
Expand Down
5 changes: 1 addition & 4 deletions reframe/core/schedulers/oar.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
import re
import time

import reframe.core.runtime as rt
import reframe.utility.osext as osext
from reframe.core.backends import register_scheduler
from reframe.core.exceptions import JobError, JobSchedulerError
Expand Down Expand Up @@ -60,9 +59,7 @@ def oar_state_pending(state):
class OarJobScheduler(PbsJobScheduler):
def __init__(self):
self._prefix = '#OAR'
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._submit_timeout = self.get_option('job_submit_timeout')

def emit_preamble(self, job):
# host is de-facto nodes and core is number of cores requested per node
Expand Down
5 changes: 1 addition & 4 deletions reframe/core/schedulers/pbs.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
import re
import time

import reframe.core.runtime as rt
import reframe.core.schedulers as sched
import reframe.utility.osext as osext
from reframe.core.backends import register_scheduler
Expand Down Expand Up @@ -76,9 +75,7 @@ class PbsJobScheduler(sched.JobScheduler):

def __init__(self):
self._prefix = '#PBS'
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._submit_timeout = self.get_option('job_submit_timeout')

def _emit_lselect_option(self, job):
num_tasks_per_node = job.num_tasks_per_node or 1
Expand Down
5 changes: 1 addition & 4 deletions reframe/core/schedulers/sge.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
import time
import xml.etree.ElementTree as ET

import reframe.core.runtime as rt
import reframe.utility.osext as osext
from reframe.core.backends import register_scheduler
from reframe.core.exceptions import JobSchedulerError
Expand All @@ -28,9 +27,7 @@
class SgeJobScheduler(PbsJobScheduler):
def __init__(self):
self._prefix = '#$'
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._submit_timeout = self.get_option('job_submit_timeout')

def emit_preamble(self, job):
preamble = [
Expand Down
16 changes: 4 additions & 12 deletions reframe/core/schedulers/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,22 +132,14 @@ def __init__(self):
'QOSJobLimit',
'QOSResourceLimit',
'QOSUsageThreshold']
ignore_reqnodenotavail = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/ignore_reqnodenotavail'
)
ignore_reqnodenotavail = self.get_option('ignore_reqnodenotavail')
if not ignore_reqnodenotavail:
self._cancel_reasons.append('ReqNodeNotAvail')

self._update_state_count = 0
self._submit_timeout = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/job_submit_timeout'
)
self._use_nodes_opt = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/use_nodes_option'
)
self._resubmit_on_errors = rt.runtime().get_option(
f'schedulers/@{self.registered_name}/resubmit_on_errors'
)
self._submit_timeout = self.get_option('job_submit_timeout')
self._use_nodes_opt = self.get_option('use_nodes_option')
self._resubmit_on_errors = self.get_option('resubmit_on_errors')

def make_job(self, *args, **kwargs):
return _SlurmJob(*args, **kwargs)
Expand Down
Loading