-
Notifications
You must be signed in to change notification settings - Fork 117
PBS scheduler back-end support #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can I test this patch? |
|
@jenkins-cscs retry all |
vkarak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Rafa for your PR. I was quite busy the weeks before, so I couldn't review it.
I have some comments that could improve a bit the implementation. Another thing is to add some unit tests. This isn't difficult for the scheduler backends, because we already have an infrastructure in place. You may have a look at TestSlurmJob in unittests/test_schedulers.py. Since these unit tests are configuration specific, in order to trigger them you will have use your site configuration file that uses PBS:
RFM_CONFIG_FILE=your-site-config.py ./test_reframe.py unittests/test_schedulers.py
| import os | ||
| import re | ||
| import time | ||
| from datetime import datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not needed.
| import reframe.core.schedulers as sched | ||
| import reframe.utility.os_ext as os_ext | ||
| from reframe.core.exceptions import (SpawnedProcessError, | ||
| JobBlockedError, JobError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the JobBlockedError here. This is thrown by the SLURM backend when a job is indefinitely blocked due to reasons that require sysadmin intervention to be resolved.
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self._prefix = '#PBS' | ||
| self._is_cancelling = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this in your implementation. We use this in the SLURM backend to support the automatic cancellation of the job if it's blocked in an unrecoverable state.
| # fix for regression tests with compile | ||
| if os.path.dirname(self.command) is '.': | ||
| self._command = os.path.join( | ||
| self.workdir, os.path.basename(self.command)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way is to emit a cd workdir at the beginning of your job script right after the preamble. Now, you are just fixing the case where your script contains just the executable, but what if it defines commands in pre_run and post_run. The user expects them to run inside the stage directory of the test.
| self._emit_job_option(self.name, '-N "{0}"', builder) | ||
|
|
||
| extra_options = '' | ||
| if len(self.options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply check with if self.options:.
| extra_options = ':' + ':'.join(self.options) | ||
|
|
||
| self._emit_job_option((int(self._num_tasks/self._num_tasks_per_node), self._num_tasks_per_node, self._num_tasks_per_node, extra_options), | ||
| '-lselect={0}:ncpus={1}:mpiprocs={2}{3}', builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments here:
- This line is too long. Could you fit wrap it at 79 columns?
- There is not need to
int()to take the integer part of the division. You may simply use//instead of/. - You need not use
self._num_tasks_per_nodetwice. You could simple repeat the corresponding placeholder in your format string:'-lselect={0}:ncpus={1}:mpiprocs={1}{2}'
| cmd = 'qsub %s' % self.script_filename | ||
| completed = self._run_command(cmd, settings.job_submit_timeout) | ||
| jobid_match = re.search('^(?P<jobid>\d+)', | ||
| completed.stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can fit nicely in a single line.
| super().cancel() | ||
| getlogger().debug('cancelling job (id=%s)' % self._jobid) | ||
| self._is_cancelling = True | ||
| jobid, server = self._fulljobid.split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use single quotes for strings.
| self._run_command('qdel %s@pbs11' % self._fulljobid, | ||
| settings.job_submit_timeout) | ||
| else: | ||
| raise JobError('Did not recognize server', server) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second argument is not gonna be printed. You may first format the message and pass it to JobError. I'd suggest also to use "could not" instead of "did not".
| if os.path.isfile(self.stdout): | ||
| return True | ||
| else: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may simply return as return os.path.isfile(self.stdout) here.
| settings.job_submit_timeout) | ||
| elif server == 'pbs11': | ||
| self._run_command('qdel %s@pbs11' % self._fulljobid, | ||
| settings.job_submit_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rafa-esco I am adapting your PR in order to prepare it for merging. I am using the Torque wrappers of Slurm to test it and I was wondering, why you need this special treatment here? Couldn't qdel JOBID just work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!, yep, you are totally right... to be honest this is quite particular to our installation. We have two pbs servers working side by side: pbspro (the default name) and pbs11. So yes, in a normal installation with only one server, a qdel JOBID should work. So removing the if,elif,else and just go with a 'qdel %s' % self._fulljobid will be general enough for anybody.
|
@rafa-esco Since I didn't want you to bother more with Git stuff, I've submitted a new PR from my fork with an almost final implementation. So, I will close this one. |
Adding PBS backend to ReFrame. This implementation should be mostly general.
It uses the fairly common qsub resource option
"-lselect=<N/n>:ncpus=<n>:mpiprocs=<n>[:extra options]"... e.g. to run a regression test with a total of 1024 of cores using 16 cores per node, we will generate the following line in the pbs script:#PBS -lselect=64:ncpus=16:mpiprocs=16it will optionally add any extra resource option you add in the settings.py file via the "access" array of each partition, e.g.
'access': [ 'cpu_type=ivy_bridge', 'mem=120GB' ],will produce:
#PBS -lselect=64:ncpus=16:mpiprocs=16:cpu_type=ivy_bridge:mem=120GBThe PBS queue (analog to SLURM partitions) can be passed to the job scheduler using the
--partition=<queue>option of reframe, e.g. runningreframe -n openfoam_motorbike -r --partition=htcwill add a this line to the batch script:
#PBS -q htcThe time limit for jobs is taken into account by adding a
#PBS -l walltime=0:10:0