PBS scheduler back-end support #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

rafa-esco wants to merge 1 commit into reframe-hpc:master from rafa-esco:feature/pbs-scheduler

rafa-esco commented May 2, 2018

Adding PBS backend to ReFrame. This implementation should be mostly general.

It uses the fairly common qsub resource option "-lselect=<N/n>:ncpus=<n>:mpiprocs=<n>[:extra options]"... e.g. to run a regression test with a total of 1024 of cores using 16 cores per node, we will generate the following line in the pbs script:

#PBS -lselect=64:ncpus=16:mpiprocs=16

it will optionally add any extra resource option you add in the settings.py file via the "access" array of each partition, e.g.

'access': [ 'cpu_type=ivy_bridge', 'mem=120GB' ],

will produce:

#PBS -lselect=64:ncpus=16:mpiprocs=16:cpu_type=ivy_bridge:mem=120GB

The PBS queue (analog to SLURM partitions) can be passed to the job scheduler using the --partition=<queue> option of reframe, e.g. running

reframe -n openfoam_motorbike -r --partition=htc

will add a this line to the batch script:

#PBS -q htc

The time limit for jobs is taken into account by adding a

#PBS -l walltime=0:10:0


          PBS scheduler back-end support

e606820

Collaborator

jenkins-cscs commented May 2, 2018

Can I test this patch?

vkarak self-requested a review

May 2, 2018 13:35

vkarak self-assigned this

vkarak added request for enhancement prio: normal enhancement labels

Contributor

vkarak commented May 2, 2018

@jenkins-cscs retry all

vkarak requested a review from victorusu

May 2, 2018 15:46

vkarak added this to the Upcoming sprint milestone

vkarak requested changes

View reviewed changes

Contributor

vkarak left a comment

Thanks a lot Rafa for your PR. I was quite busy the weeks before, so I couldn't review it.

I have some comments that could improve a bit the implementation. Another thing is to add some unit tests. This isn't difficult for the scheduler backends, because we already have an infrastructure in place. You may have a look at TestSlurmJob in unittests/test_schedulers.py. Since these unit tests are configuration specific, in order to trigger them you will have use your site configuration file that uses PBS:

RFM_CONFIG_FILE=your-site-config.py ./test_reframe.py unittests/test_schedulers.py

reframe/core/schedulers/pbs.py

+              import os
+              import re
+              import time
+              from datetime import datetime

Contributor

vkarak May 16, 2018

This import is not needed.

reframe/core/schedulers/pbs.py

+              import reframe.core.schedulers as sched
+              import reframe.utility.os_ext as os_ext
+              from reframe.core.exceptions import (SpawnedProcessError,
+                                                   JobBlockedError, JobError)

Contributor

vkarak May 16, 2018

You don't need the JobBlockedError here. This is thrown by the SLURM backend when a job is indefinitely blocked due to reasons that require sysadmin intervention to be resolved.

reframe/core/schedulers/pbs.py

+                  def __init__(self, *args, **kwargs):
+                      super().__init__(*args, **kwargs)
+                      self._prefix  = '#PBS'
+                      self._is_cancelling = False

Contributor

vkarak May 16, 2018

You don't need this in your implementation. We use this in the SLURM backend to support the automatic cancellation of the job if it's blocked in an unrecoverable state.

reframe/core/schedulers/pbs.py

+                      # fix for regression tests with compile
+                      if os.path.dirname(self.command) is '.':
+                          self._command = os.path.join(
+                              self.workdir, os.path.basename(self.command))

Contributor

vkarak May 16, 2018

The best way is to emit a cd workdir at the beginning of your job script right after the preamble. Now, you are just fixing the case where your script contains just the executable, but what if it defines commands in pre_run and post_run. The user expects them to run inside the stage directory of the test.

reframe/core/schedulers/pbs.py

+                      self._emit_job_option(self.name, '-N "{0}"', builder)
+                      extra_options = ''
+                      if len(self.options):

Contributor

vkarak May 16, 2018

You can simply check with if self.options:.

reframe/core/schedulers/pbs.py

+                          extra_options = ':' + ':'.join(self.options)
+                      self._emit_job_option((int(self._num_tasks/self._num_tasks_per_node), self._num_tasks_per_node, self._num_tasks_per_node, extra_options),
+                                            '-lselect={0}:ncpus={1}:mpiprocs={2}{3}', builder)

Contributor

vkarak May 16, 2018

A couple of comments here:

This line is too long. Could you fit wrap it at 79 columns?
There is not need to int() to take the integer part of the division. You may simply use // instead of /.
You need not use self._num_tasks_per_node twice. You could simple repeat the corresponding placeholder in your format string: '-lselect={0}:ncpus={1}:mpiprocs={1}{2}'

reframe/core/schedulers/pbs.py

+                      cmd = 'qsub %s' % self.script_filename
+                      completed = self._run_command(cmd, settings.job_submit_timeout)
+                      jobid_match = re.search('^(?P<jobid>\d+)',
+                                              completed.stdout)

Contributor

vkarak May 16, 2018

I think this can fit nicely in a single line.

reframe/core/schedulers/pbs.py

+                      super().cancel()
+                      getlogger().debug('cancelling job (id=%s)' % self._jobid)
+                      self._is_cancelling = True
+                      jobid, server = self._fulljobid.split(".")

Contributor

vkarak May 16, 2018

We use single quotes for strings.

reframe/core/schedulers/pbs.py

+                          self._run_command('qdel %s@pbs11' % self._fulljobid,
+                                            settings.job_submit_timeout)
+                      else:
+                          raise JobError('Did not recognize server', server)

Contributor

vkarak May 16, 2018

The second argument is not gonna be printed. You may first format the message and pass it to JobError. I'd suggest also to use "could not" instead of "did not".

reframe/core/schedulers/pbs.py

+                      if os.path.isfile(self.stdout):
+                          return True
+                      else:
+                          return False

Contributor

vkarak May 16, 2018

You may simply return as return os.path.isfile(self.stdout) here.

vkarak modified the milestones: Upcoming sprint, ReFrame sprint 2018w20

vkarak assigned victorusu

vkarak modified the milestones: ReFrame sprint 2018w20, Upcoming sprint

vkarak reviewed

View reviewed changes

reframe/core/schedulers/pbs.py

+                                            settings.job_submit_timeout)
+                      elif server == 'pbs11':
+                          self._run_command('qdel %s@pbs11' % self._fulljobid,
+                                            settings.job_submit_timeout)

Contributor

vkarak Jun 6, 2018

@rafa-esco I am adapting your PR in order to prepare it for merging. I am using the Torque wrappers of Slurm to test it and I was wondering, why you need this special treatment here? Couldn't qdel JOBID just work?

Author

rafa-esco Jun 6, 2018

Hi!, yep, you are totally right... to be honest this is quite particular to our installation. We have two pbs servers working side by side: pbspro (the default name) and pbs11. So yes, in a normal installation with only one server, a qdel JOBID should work. So removing the if,elif,else and just go with a 'qdel %s' % self._fulljobid will be general enough for anybody.

vkarak mentioned this pull request

PBS scheduler backend #316

Merged

2 tasks

Contributor

vkarak commented Jun 6, 2018

@rafa-esco Since I didn't want you to bother more with Git stuff, I've submitted a new PR from my fork with an almost final implementation. So, I will close this one.

vkarak closed this

vkarak removed this from the Upcoming sprint milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement prio: normal request for enhancement