Skip to content

ReFrame fails to detect job status of Slurm job arrays #839

@bcfriesen

Description

@bcfriesen

When running a Slurm job array using #SBATCH --array=<array_specs>, ReFrame fails to detect the status of the job from its sacct polling. The logs include things like:

[2019-06-20T12:18:49] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: entering stage: poll
[2019-06-20T12:18:49] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: executing OS command: sacct -S 2019-06-20 -P -j 2242
3095 -o jobid,state,exitcode,nodelist
[2019-06-20T12:18:49] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: job state not matched (stdout follows)
JobID|State|ExitCode|NodeList
22423095_1|RUNNING|0:0|nid0[2544-2545]
22423095_1.extern|RUNNING|0:0|nid0[2544-2545]
22423095_0|RUNNING|0:0|nid0[2517-2518]
22423095_0.extern|RUNNING|0:0|nid0[2517-2518]

and

[2019-06-20T12:19:04] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: entering stage: poll
[2019-06-20T12:19:04] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: executing OS command: sacct -S 2019-06-20 -P -j 22423095 -o jobid,state,exitcode,nodelist
[2019-06-20T12:19:04] debug: slurm_job_arrays on cori:knl using PrgEnv-intel: job state not matched (stdout follows)
JobID|State|ExitCode|NodeList
22423095_1|COMPLETED|0:0|nid0[2544-2545]
22423095_1.batch|COMPLETED|0:0|nid02544
22423095_1.extern|COMPLETED|0:0|nid0[2544-2545]
22423095_1.0|COMPLETED|0:0|nid0[2544-2545]
22423095_0|COMPLETED|0:0|nid0[2517-2518]
22423095_0.batch|COMPLETED|0:0|nid02517
22423095_0.extern|COMPLETED|0:0|nid0[2517-2518]
22423095_0.0|COMPLETED|0:0|nid0[2517-2518]

I wonder if ReFrame gets confused by the output having the form <job_id>_<array_task_num> as opposed to just job_id?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions