Problems with parsing arrayjobs on DAS5 #672

jmaassen · 2020-08-20T11:16:48Z

As reported in #671

jmaassen · 2020-08-20T11:20:13Z

On DAS5 we have a bunch of array jobs running at the moment.

Output of squeue --format="%i %P %j %u %T %M %l %D %R %k" is:

JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON) COMMENT  
2651316 proq pretrain_small_reduced_nop3.sh mda420 RUNNING 7-15:37:35 8-08:00:00 1 node078 (null)  
2651437 proq real_nvp ama228 RUNNING 6-01:32:43 20-20:00:00 1 node069 (null)  
2653318 defq prun-job syg340 RUNNING 3:41:00 13:54:00 1 node024 (null)  
2653320 defq bash sghiassi RUNNING 3:28:07 1-00:00:00 1 node007 (null)  
2653325_1 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node053 (null)  
2653325_2 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node054 (null)  
2653325_3 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node055 (null)  
2653325_4 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node056 (null)  
2653329 defq pytorch-ib.job mprovokl RUNNING 2:43 20:00 2 node[001-002] (null)

Output of scontrol show job 2653325_1 is:

JobId=2653326 ArrayJobId=2653325 ArrayTaskId=1 JobName=neat5
   UserId=zgo600(2422) GroupId=zgo600(2422) MCS_label=N/A
   Priority=4294039317 Nice=0 Account=zgo600 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:59:02 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2020-08-20T11:16:15 EligibleTime=2020-08-20T11:16:15
   StartTime=2020-08-20T11:16:15 EndTime=2020-08-20T21:16:15 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=fs0:25488
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node053
   BatchHost=node053
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/zgo600/digits/standard/runstandard.slurm
   WorkDir=/home/zgo600/digits/standard
   StdErr=/home/zgo600/digits/standard/error.out
   StdIn=/dev/null
   StdOut=/home/zgo600/digits/standard/neat5.2653326.out
   Power=

So interestingly, the job is shown by squeue as having ID 2653325_1, but scontrol shows the ID as 2653326.

jmaassen · 2020-08-20T11:48:12Z

I wrote the following test:

package you;

import java.util.Arrays;

import nl.esciencecenter.xenon.schedulers.JobStatus;
import nl.esciencecenter.xenon.schedulers.Scheduler;

public class Main {

    public static void main(String[] args) throws Exception {

        Scheduler scheduler = Scheduler.create("slurm", "ssh://fs0.das5.cs.vu.nl");

        String[] jobIDs = scheduler.getJobs("defq");

        System.out.println("Got jobs: " + Arrays.toString(jobIDs));

        JobStatus[] s = scheduler.getJobStatuses(jobIDs);

        for (JobStatus j : s) {
            System.out.println(j.toString());
        }

        scheduler.close();
    }
}

This produces the expected output:

Got jobs: [2653318, 2653320, 2653325_1, 2653325_2, 2653325_3, 2653325_4, 2653331]
JobStatus [jobIdentifier=2653318, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=4:13:16, PARTITION=defq, TIME_LIMIT=13:54:00, USER=syg340, NODELIST(REASON)=node024, COMMENT=(null), JOBID=2653318, NAME=prun-job}]
JobStatus [jobIdentifier=2653320, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=4:00:24, PARTITION=defq, TIME_LIMIT=1-00:00:00, USER=sghiassi, NODELIST(REASON)=node007, COMMENT=(null), JOBID=2653320, NAME=bash}]
JobStatus [jobIdentifier=2653325_1, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:42, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node053, COMMENT=(null), JOBID=2653325_1, NAME=neat5}]
JobStatus [jobIdentifier=2653325_2, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:43, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node054, COMMENT=(null), JOBID=2653325_2, NAME=neat5}]
JobStatus [jobIdentifier=2653325_3, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:43, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node055, COMMENT=(null), JOBID=2653325_3, NAME=neat5}]
JobStatus [jobIdentifier=2653325_4, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:44, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node056, COMMENT=(null), JOBID=2653325_4, NAME=neat5}]
JobStatus [jobIdentifier=2653331, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=2, STATE=RUNNING, TIME=15:01, PARTITION=defq, TIME_LIMIT=30:00, USER=mprovokl, NODELIST(REASON)=node[001-002], COMMENT=(null), JOBID=2653331, NAME=pytorch-ib.job}]

yh882317 · 2020-08-20T12:29:50Z

I tried to reproduce the problem.
It comes from the specification of the max number of running tasks.
My test code runs the same.
`
import nl.esciencecenter.xenon.schedulers.JobStatus;
import nl.esciencecenter.xenon.schedulers.Scheduler;

import java.util.Arrays;

public class test {
public static void main(String[] args) throws Exception {

    Scheduler scheduler = Scheduler.create("slurm", "ssh://fs0.das5.cs.vu.nl");

    String[] jobIDs = scheduler.getJobs("defq");

    System.out.println("Got jobs: " + Arrays.toString(jobIDs));



    for (String j : jobIDs) {
        JobStatus s = scheduler.getJobStatus(j);
        System.out.println(s.toString());
    }

    scheduler.close();
}

}`

When I submit a job array as command line :
sbatch --array=1-4%2 -J array ./scripts/sleapme 86400

The output from squeue shows:

`
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

2653343_[3-4%2] defq array yhu310 PD 0:00 1 (JobArrayTaskLimit)

2651316 proq pretrain mda420 R 7-16:35:19 1 node078

2651437 proq real_nvp ama228 R 6-02:30:27 1 node069

2653320 defq bash sghiassi R 4:25:51 1 node007

2653325_1 defq neat5 zgo600 R 2:57:09 1 node053

2653325_2 defq neat5 zgo600 R 2:57:09 1 node054

2653325_3 defq neat5 zgo600 R 2:57:09 1 node055

2653325_4 defq neat5 zgo600 R 2:57:09 1 node056

2653330 proq prun-job mao540 R 56:22 1 node070

2653335 defq pytorch- mprovokl R 5:04 2 node[024-025]

2653343_1 defq array yhu310 R 0:01 1 node057

2653343_2 defq array yhu310 R 0:01 1 node058
`

And error raises due to the 2653343_[3-4%2].

[yhu310@fs0 Test_Xenon]$ java -classpath ".:$XENON_HOME/lib/*:CLASSPATH" -Dlog4j.configuration=file:log4j.properties test Got jobs: [2653343_[3-4%2], 2653320, 2653325_1, 2653325_2, 2653325_3, 2653325_4, 2653335, 2653343_1, 2653343_2] 14:13:32.058 [main] WARN n.e.x.a.s.slurm.SlurmScheduler - Sacct produced error output nl.esciencecenter.xenon.XenonException: slurm adaptor: Error in getting sacct job status: CommandRunner[exitCode=1,output=,error=sacct: fatal: Bad job array element specified: 2653343 ] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getSacctInfo(SlurmScheduler.java:341) ~[xenon-3.1.0.jar:na] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getJobStatus(SlurmScheduler.java:380) ~[xenon-3.1.0.jar:na] at test.main(test.java:18) ~[Test_Xenon-1.0-SNAPSHOT.jar:na] Exception in thread "main" nl.esciencecenter.xenon.schedulers.NoSuchJobException: slurm adaptor: Unknown Job: 2653343_[3-4%2] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getJobStatus(SlurmScheduler.java:407) at test.main(test.java:18)

If we query on job 2653343 instead of 2653343_[3-4%2], by:
for (String j : jobIDs) { if(j.contains("[") ) { j=j.substring(0,j.indexOf("_")); } JobStatus s = scheduler.getJobStatus(j); System.out.println(s.toString()); }
There will be an error as Unkown job exception

yh882317 · 2020-08-20T12:35:57Z

For my own project, I can ignore the pending jobs 2653343_[3-4%2] and move on.
But by scontrol show job -dd 2653343, it returns the information of this job and its subtasks.

jmaassen self-assigned this Aug 20, 2020

jmaassen mentioned this issue Aug 20, 2020

Support GPU information and job array query job status for SLURM #671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with parsing arrayjobs on DAS5 #672

Problems with parsing arrayjobs on DAS5 #672

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020 •

edited

Loading

yh882317 commented Aug 20, 2020

yh882317 commented Aug 20, 2020

Problems with parsing arrayjobs on DAS5 #672

Problems with parsing arrayjobs on DAS5 #672

Comments

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020 • edited Loading

yh882317 commented Aug 20, 2020

yh882317 commented Aug 20, 2020

jmaassen commented Aug 20, 2020 •

edited

Loading