Skip to content

Conversation

@ethanjjjjjjj
Copy link

@ethanjjjjjjj ethanjjjjjjj commented Nov 26, 2022

I believe that being able to explicitly ask for a number of nodes would be a useful feature, especially where the same set of tests are reused between clusters and partitions. This should give the ability to request full nodes from the Slurm scheduler without having to be explicit about how many tasks to run with, or how many cores the node has.

Helps solve some of the issues mentioned in #2093

@jenkins-cscs
Copy link
Collaborator

Can I test this patch?

@ethanjjjjjjj ethanjjjjjjj marked this pull request as ready for review November 26, 2022 16:32
@ethanjjjjjjj
Copy link
Author

I'd appreciate some guidance on how to cleanly introduce this feature into ReFrame, as it stands, my needs are met by this small change, but I'm happy to keep working on it until many other people are satisfied.

Fix my poorformatting
@vkarak
Copy link
Contributor

vkarak commented Nov 27, 2022

Wouldn't setting num_tasks = N and num_tasks_per_node = 1 have the same effect? Also, you can instruct reframe to emit the -N option always by setting the use_nodes_option configuration option.

If you use that option, note that in 4.0 all the scheduler-specific options will be moved inside the partition definition (see also issue #2669).

@ethanjjjjjjj
Copy link
Author

ethanjjjjjjj commented Nov 27, 2022

Wouldn't setting num_tasks = N and num_tasks_per_node = 1 have the same effect? Also, you can instruct reframe to emit the -N option always by setting the use_nodes_option configuration option.

This is very close to what I want, it adds the --nodes line to the job script, but then only allows me to allocate a fixed number of mpi tasks per node.

I'd like that to be automatically filled in by the scheduler, which slurm will do as long as you leave --ntasks out of your script. Allowing Slurm to come up with the number of tasks, means that I can run the test on a new cluster without having to specify how many tasks will be required to fill the node (assuming 1 cpu per task).

For example with this PR:

class hpcg(rfm.RegressionTest):
    sourcepath="hpcg-3.1"
    num_cpus_per_task=1
    num_tasks=None
    exclusive_access=True
    num_nodes=4

Generates:

#!/bin/bash
#SBATCH --job-name="rfm_job"
#SBATCH --cpus-per-task=1
#SBATCH --nodes=4
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=1:0:0
#SBATCH --exclusive
srun --cpus-per-task=1 hpcg-3.1/bin/xhpcg

Which in turn allocates 28 tasks per node for a total of 96 tasks on my system.

@vkarak
Copy link
Contributor

vkarak commented Nov 28, 2022

You are right, I wasn't aware of this capability of the --cpus-per-task:

If -c is specified without -n, as many tasks will be allocated per node as possible while satisfying the -c restriction.

My concern regarding the implementation is that Introducing simply a num_nodes in parallel with the num_tasks will require changes in many places and might have strange side-effects if both are specified. It's not only the Slurm backend that should be updated; all the other backends should treat that option and should do something if it is specified along with the num_tasks and num_tasks_per_node. Slurm gives much more context in all of these options compared to other schedulers, and I would rather avoid implementing Slurm's behaviour in all other backends. That was also the rationale of keeping minimal and well defined what is specified in a test by default. Adding a num_nodes would require us to introduce logic and possibly reframe-specific interpretations on what to do if a job spec is over-specified. There is also the flexible node allocation and the --distribute option that should take care of a test that specifies num_nodes, num_tasks and num_tasks_per_node. On the other hand, what you propose here is a valid request that reframe should have a way to support this behaviour for the Slurm backend. I'm thinking what could be an alternative implementation that would be rather Slurm-specific and would not have to be exposed to the test API, if possible.

@vkarak vkarak changed the title Add ability to specify number of nodes for slurm scheduler [feat] Add ability to specify number of nodes for slurm scheduler Nov 28, 2022
@ethanjjjjjjj
Copy link
Author

ethanjjjjjjj commented Nov 28, 2022

It seems to me that Slurm is already given special treatment with many options as shown by the table here. While adding another is definitely not the cleanest way to deal with this looking to the future it certainly wouldn't be out of place with the current state of things. I agree and I'm on board with finding a more long term solution to all those other examples too though.

In this case, my requirements depends on the omission of --ntasks from the job script, therefore I believe there should still be a way to set num_tasks=None and this requires at least some small modification to the test API.

I think having the option of being explicit within the frontend about exactly how the job is run by the scheduler behind the scenes would be desirable to many people. They might prefer to set num_nodes and num_tasks explicitly as you would in a job script, which is currently how this PR deals with assigning both of these options. Perhaps a dictionary of options within something like RegressionTest.scheduler_opts.slurm could satisfy this?

I think --distribute still launches one test per node that meets the filter which can still be multi node jobs, so in this case, I think specifying num_nodes explicitly doesn't change the behaviour here too much. Possibly if --distribute filters down to a list of 20 nodes, and you request num_nodes=4 then ReFrame should spawn 5 jobs, with 4 nodes each from the 20, what do you think?

@vkarak
Copy link
Contributor

vkarak commented Nov 29, 2022

What about the following? If num_tasks could be set to None, which I agree is unavoidable for achieving this scenario, then you could write your test as follows:

class my_test(...):
    num_tasks = None

    @run_before('run')    # or anywhere after setup
    def setup_job(self):
        self.job.options = [f'--nodes={N}']

Setting num_tasks to None, which is the only scheduler-specific variable that is not allowed to be None now, would be somehow the equivalent of telling the framework "I know what scheduler I'm using and I know what I'm doing with my job options." That's fine, I think, and then you can pass any additional options as I've shown above. Regarding the implementation, we should treat the case for other backends, but I believe this should also be straightforward: we don't emit anything derived from num_tasks and we simply process the user specified job options (that's already done).

@vkarak vkarak modified the milestone: ReFrame sprint 22.12.1 Nov 30, 2022
@ethanjjjjjjj
Copy link
Author

Yep, that makes sense to me. Would definitely give me the control that I need over the job script.

@vkarak vkarak added this to the ReFrame Sprint 23.01 milestone Jan 12, 2023
@vkarak vkarak self-assigned this Jan 12, 2023
@vkarak
Copy link
Contributor

vkarak commented Jan 26, 2023

@ethanjjjjjjj I will close and reopen this against the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants