Skip to content

Conversation

@rafa-esco
Copy link

Adding PBS backend to ReFrame. This implementation should be mostly general.

It uses the fairly common qsub resource option "-lselect=<N/n>:ncpus=<n>:mpiprocs=<n>[:extra options]"... e.g. to run a regression test with a total of 1024 of cores using 16 cores per node, we will generate the following line in the pbs script:

#PBS -lselect=64:ncpus=16:mpiprocs=16

it will optionally add any extra resource option you add in the settings.py file via the "access" array of each partition, e.g.

'access': [ 'cpu_type=ivy_bridge', 'mem=120GB' ],

will produce:

#PBS -lselect=64:ncpus=16:mpiprocs=16:cpu_type=ivy_bridge:mem=120GB

The PBS queue (analog to SLURM partitions) can be passed to the job scheduler using the --partition=<queue> option of reframe, e.g. running

reframe -n openfoam_motorbike -r --partition=htc

will add a this line to the batch script:

#PBS -q htc

The time limit for jobs is taken into account by adding a

#PBS -l walltime=0:10:0

@jenkins-cscs
Copy link
Collaborator

Can I test this patch?

@vkarak
Copy link
Contributor

vkarak commented May 2, 2018

@jenkins-cscs retry all

@vkarak vkarak requested a review from victorusu May 2, 2018 15:46
@vkarak vkarak added this to the Upcoming sprint milestone May 16, 2018
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Rafa for your PR. I was quite busy the weeks before, so I couldn't review it.

I have some comments that could improve a bit the implementation. Another thing is to add some unit tests. This isn't difficult for the scheduler backends, because we already have an infrastructure in place. You may have a look at TestSlurmJob in unittests/test_schedulers.py. Since these unit tests are configuration specific, in order to trigger them you will have use your site configuration file that uses PBS:

RFM_CONFIG_FILE=your-site-config.py ./test_reframe.py unittests/test_schedulers.py

import os
import re
import time
from datetime import datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is not needed.

import reframe.core.schedulers as sched
import reframe.utility.os_ext as os_ext
from reframe.core.exceptions import (SpawnedProcessError,
JobBlockedError, JobError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the JobBlockedError here. This is thrown by the SLURM backend when a job is indefinitely blocked due to reasons that require sysadmin intervention to be resolved.

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._prefix = '#PBS'
self._is_cancelling = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this in your implementation. We use this in the SLURM backend to support the automatic cancellation of the job if it's blocked in an unrecoverable state.

# fix for regression tests with compile
if os.path.dirname(self.command) is '.':
self._command = os.path.join(
self.workdir, os.path.basename(self.command))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way is to emit a cd workdir at the beginning of your job script right after the preamble. Now, you are just fixing the case where your script contains just the executable, but what if it defines commands in pre_run and post_run. The user expects them to run inside the stage directory of the test.

self._emit_job_option(self.name, '-N "{0}"', builder)

extra_options = ''
if len(self.options):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply check with if self.options:.

extra_options = ':' + ':'.join(self.options)

self._emit_job_option((int(self._num_tasks/self._num_tasks_per_node), self._num_tasks_per_node, self._num_tasks_per_node, extra_options),
'-lselect={0}:ncpus={1}:mpiprocs={2}{3}', builder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments here:

  1. This line is too long. Could you fit wrap it at 79 columns?
  2. There is not need to int() to take the integer part of the division. You may simply use // instead of /.
  3. You need not use self._num_tasks_per_node twice. You could simple repeat the corresponding placeholder in your format string: '-lselect={0}:ncpus={1}:mpiprocs={1}{2}'

cmd = 'qsub %s' % self.script_filename
completed = self._run_command(cmd, settings.job_submit_timeout)
jobid_match = re.search('^(?P<jobid>\d+)',
completed.stdout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can fit nicely in a single line.

super().cancel()
getlogger().debug('cancelling job (id=%s)' % self._jobid)
self._is_cancelling = True
jobid, server = self._fulljobid.split(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use single quotes for strings.

self._run_command('qdel %s@pbs11' % self._fulljobid,
settings.job_submit_timeout)
else:
raise JobError('Did not recognize server', server)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument is not gonna be printed. You may first format the message and pass it to JobError. I'd suggest also to use "could not" instead of "did not".

if os.path.isfile(self.stdout):
return True
else:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may simply return as return os.path.isfile(self.stdout) here.

@vkarak vkarak modified the milestones: Upcoming sprint, ReFrame sprint 2018w20 May 16, 2018
settings.job_submit_timeout)
elif server == 'pbs11':
self._run_command('qdel %s@pbs11' % self._fulljobid,
settings.job_submit_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafa-esco I am adapting your PR in order to prepare it for merging. I am using the Torque wrappers of Slurm to test it and I was wondering, why you need this special treatment here? Couldn't qdel JOBID just work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!, yep, you are totally right... to be honest this is quite particular to our installation. We have two pbs servers working side by side: pbspro (the default name) and pbs11. So yes, in a normal installation with only one server, a qdel JOBID should work. So removing the if,elif,else and just go with a 'qdel %s' % self._fulljobid will be general enough for anybody.

@vkarak vkarak mentioned this pull request Jun 6, 2018
2 tasks
@vkarak
Copy link
Contributor

vkarak commented Jun 6, 2018

@rafa-esco Since I didn't want you to bother more with Git stuff, I've submitted a new PR from my fork with an almost final implementation. So, I will close this one.

@vkarak vkarak closed this Jun 6, 2018
@vkarak vkarak removed this from the Upcoming sprint milestone Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants