Skip to content

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented Jun 6, 2018

Fixes #46 and replaces #277.

To test on Dom, you should do the following:

export PATH=/apps/dom/UES/karakasv/slurm-wrappers/bin:$PATH
./test_reframe.py --rfm-user-config=config/cscs-pbs.py

This special configuration is only valid for Dom and uses pbs+mpiexec.

Still todo:

  • Update user documentation
  • Enable CI testing

@vkarak
Copy link
Contributor Author

vkarak commented Jun 6, 2018

@rafa-esco Would you bother testing this version of the backend? Downloading the pbs.py file and replacing yours should be enough.

self.sourcesdir = os.path.join(self.current_system.resourcesdir,
'CUDA', 'essentials')
self.modules = ['cudatoolkit']
self.modules = ['craype-accel-nvidia60']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's irrelevant here. I should remove it.

@rafa-esco
Copy link

rafa-esco commented Jun 6, 2018

@vkarak No problemo, I just had to download a zip of the entire repository (because of the import reframe.core.config failing). It runs but there is a problem when cancelling the jobs... you have to change:

if self._pbs_server:
           jobid += ':' + self._pbs_server

by

if self._pbs_server:
           jobid += '.' + self._pbs_server

a part from that everything (even the async exec policy) seems to work nicely :).

@vkarak
Copy link
Contributor Author

vkarak commented Jun 6, 2018

@rafa-esco Thanks for testing! Good catch, that was a typo. I wanted . actually!

@rafa-esco
Copy link

@vkarak I just noticed a small bug in the _emit_lselect_function... If I do not define _num_cpus_per_task I do not get the "ncpus=" value so my job will only get 1 cpu per node. The default value for _num_cpus_per_tasks should be 1, so maybe adding a num_cpus_per_task = self._num_cpus_per_task or 1 and then using this variable in the downstream will always generate the correct ncpus option:

 def _emit_lselect_option(self, builder):
        num_tasks_per_node = self._num_tasks_per_node or 1
        num_cpus_per_task = self._num_cpus_per_task or 1
        num_nodes = self._num_tasks // num_tasks_per_node
        ret = '-l select=%s:mpiprocs=%s' % (num_nodes, num_tasks_per_node)
        num_cpus_per_node = num_tasks_per_node * num_cpus_per_task
        ret += ':ncpus=%s' % num_cpus_per_node

        if self.options:
            ret += ':' + ':'.join(self.options)

        self._emit_job_option(ret, builder)`

@vkarak
Copy link
Contributor Author

vkarak commented Jun 6, 2018

@rafa-esco So, in your setup, the ncpus must always be present? For our case, there wasn't a problem of not specifying it at all, if it wasn't specified.

The general idea for the backends is to generate the minimum required fields based on the user input in his regression test. So if the user does not set self.num_cpus_per_tasks and ncpus is not required by the scheduler, then it is omitted.

@rafa-esco
Copy link

@vkarak ncpus does not need to be present but it will default to 1 core per node, so pbs will only allow me to use one cpu in each node (we have cgroups in here to make things even harder ;) ) even if mpiprocs is larger.

@vkarak
Copy link
Contributor Author

vkarak commented Jun 6, 2018

@rafa-esco OK, now I get it. You are right. I'll fix that.

@vkarak
Copy link
Contributor Author

vkarak commented Jun 6, 2018

@rafa-esco I've pushed the fixes you suggested. Could you test once more?

@codecov-io
Copy link

codecov-io commented Jun 7, 2018

Codecov Report

Merging #316 into master will decrease coverage by 0.21%.
The diff coverage is 80.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
- Coverage   91.45%   91.24%   -0.22%     
==========================================
  Files          66       67       +1     
  Lines        7807     7935     +128     
==========================================
+ Hits         7140     7240     +100     
- Misses        667      695      +28
Impacted Files Coverage Δ
reframe/core/pipeline.py 94.69% <ø> (ø) ⬆️
reframe/__init__.py 80% <100%> (ø) ⬆️
reframe/core/schedulers/registry.py 85% <100%> (+0.78%) ⬆️
reframe/core/schedulers/pbs.py 67.85% <67.85%> (ø)
unittests/test_schedulers.py 97.68% <95.58%> (-0.34%) ⬇️
reframe/core/schedulers/__init__.py 94.53% <0%> (+0.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bd86b2...a62f63f. Read the comment docs.

@rafa-esco
Copy link

@vkarak Yep, all my tests passed :)

Vasileios Karakasis added 3 commits June 7, 2018 10:32
- This test checks if the `ncpus` resource option is emitted correctly
  when `num_cpus_per_task` is not defined.
- Also some cosmetic changes in `reframe/__init__.py`.
PBS fixes:
- Treat separately options and `-lselect` resources.
- Treat custom prefixes. Though I don't know of cases that need that, it
  adds more flexibility and is in accordance with the Slurm backend.
@vkarak vkarak changed the title WIP: PBS scheduler backend PBS scheduler backend Jun 7, 2018
Copy link
Contributor

@teojgo teojgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vkarak
Copy link
Contributor Author

vkarak commented Jun 8, 2018

@victorusu I will merge this one. Is this ok?

@vkarak vkarak merged commit a9e6692 into reframe-hpc:master Jun 8, 2018
@vkarak vkarak deleted the feature/pbs-scheduler-backend branch June 8, 2018 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants