Skip to content

Getting 2 nodes on what was supposed to be an intranode test #3580

@casparvl

Description

@casparvl

I have an OSU test that is supposed to test point-to-point GPU communication. Essentially, it sets num_tasks=2, and num_tasks_per_node=2. The job script produced is:

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_OSU_pt2pt_GPU_87fbf5ce"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=0:30:0
#SBATCH -p gpu_h100
#SBATCH --export=None
#SBATCH --mem=737280M
#SBATCH --gpus-per-node=4
source /cvmfs/software.eessi.io/versions/2023.06/init/bash
module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
...
mpirun -np 2 osu_bw -m 4194304 -x 5 -i 10 -c -d cuda D D
...

I saw strongly varying performance: either 25 GB/s, or 120 GB/s. Based on our interconnect and the connectivity between GPUs, 25 GB/s matches our internode GPU to GPU performance, whereas 120 GB/s matches the intranode GPU to GPU performance. Checking the run-report, I saw:

          "outputdir": "/home/jenkins/EESSI/reframe_CI_runs/output/snellius/gpu_H100/default/EESSI_OSU_pt2pt_GPU_87fbf5ce",
...
          "job_nodelist": [
            "gcn114",
            "gcn149"
          ],

I.e. this particular test was being scheduled to two nodes. I was a bit surprised by this behavior, but reading the SLURM documentation carefully, it becomes clear why:

--ntasks-per-node=
Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node. In some cases, it is more convenient to be able to request that no more than a specific number of tasks be invoked on each node. Examples of this include submitting a hybrid MPI/OpenMP app where only one MPI "task/rank" should be assigned to each node while allowing the OpenMP portion to utilize all of the parallelism present in the node, or submitting a single setup/cleanup/monitoring job to each node of a pre-existing allocation as one step in a larger job script.

Note in particular

If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.

I.e. they basically say: you should use it with --nodes, and if you use it with --ntasks instead, it's considered a maximum count of tasks per node. That gives SLURM the liberty of actually scheduling 2 nodes with 1 tasks per node - which is what is happening in my case. From a regression testing perspective, this is clearly undesirable, as it leads to unexpected changes in performance from one run to the next. Actually, I'd consider it a bug, because the ReFrame docs specify:

num_tasks_per_node= None
Number of tasks per node required by this test.

Which suggest thats exactly the number of tasks per node you'll get (and not a maximum, like it is for SLURM), but then by specifying --ntasks and --ntasks-per-node (and not --nnodes), ReFrame doesn't give the right instructions to the SLURM backend in order to trigger the promised behavior.

Now, I know the use_nodes_option exists, and it does resolve the issue, but it's default value is False. I'd consider it preferable to change the default to True, so that the behavior of num_tasks_per_node as it is documented in the ReFrame docs matches with the behavior it triggers on the SLURM side.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions