Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with parsing arrayjobs on DAS5 #672

Open
jmaassen opened this issue Aug 20, 2020 · 4 comments
Open

Problems with parsing arrayjobs on DAS5 #672

jmaassen opened this issue Aug 20, 2020 · 4 comments
Assignees

Comments

@jmaassen
Copy link
Member

As reported in #671

@jmaassen jmaassen self-assigned this Aug 20, 2020
@jmaassen
Copy link
Member Author

On DAS5 we have a bunch of array jobs running at the moment.

Output of squeue --format="%i %P %j %u %T %M %l %D %R %k" is:

JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON) COMMENT  
2651316 proq pretrain_small_reduced_nop3.sh mda420 RUNNING 7-15:37:35 8-08:00:00 1 node078 (null)  
2651437 proq real_nvp ama228 RUNNING 6-01:32:43 20-20:00:00 1 node069 (null)  
2653318 defq prun-job syg340 RUNNING 3:41:00 13:54:00 1 node024 (null)  
2653320 defq bash sghiassi RUNNING 3:28:07 1-00:00:00 1 node007 (null)  
2653325_1 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node053 (null)  
2653325_2 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node054 (null)  
2653325_3 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node055 (null)  
2653325_4 defq neat5 zgo600 RUNNING 1:59:25 10:00:00 1 node056 (null)  
2653329 defq pytorch-ib.job mprovokl RUNNING 2:43 20:00 2 node[001-002] (null)  

Output of scontrol show job 2653325_1 is:

JobId=2653326 ArrayJobId=2653325 ArrayTaskId=1 JobName=neat5
   UserId=zgo600(2422) GroupId=zgo600(2422) MCS_label=N/A
   Priority=4294039317 Nice=0 Account=zgo600 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:59:02 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2020-08-20T11:16:15 EligibleTime=2020-08-20T11:16:15
   StartTime=2020-08-20T11:16:15 EndTime=2020-08-20T21:16:15 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=fs0:25488
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node053
   BatchHost=node053
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=32,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/zgo600/digits/standard/runstandard.slurm
   WorkDir=/home/zgo600/digits/standard
   StdErr=/home/zgo600/digits/standard/error.out
   StdIn=/dev/null
   StdOut=/home/zgo600/digits/standard/neat5.2653326.out
   Power=

So interestingly, the job is shown by squeue as having ID 2653325_1, but scontrol shows the ID as 2653326.

@jmaassen
Copy link
Member Author

jmaassen commented Aug 20, 2020

I wrote the following test:

package you;

import java.util.Arrays;

import nl.esciencecenter.xenon.schedulers.JobStatus;
import nl.esciencecenter.xenon.schedulers.Scheduler;

public class Main {

    public static void main(String[] args) throws Exception {

        Scheduler scheduler = Scheduler.create("slurm", "ssh://fs0.das5.cs.vu.nl");

        String[] jobIDs = scheduler.getJobs("defq");

        System.out.println("Got jobs: " + Arrays.toString(jobIDs));

        JobStatus[] s = scheduler.getJobStatuses(jobIDs);

        for (JobStatus j : s) {
            System.out.println(j.toString());
        }

        scheduler.close();
    }
}

This produces the expected output:

Got jobs: [2653318, 2653320, 2653325_1, 2653325_2, 2653325_3, 2653325_4, 2653331]
JobStatus [jobIdentifier=2653318, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=4:13:16, PARTITION=defq, TIME_LIMIT=13:54:00, USER=syg340, NODELIST(REASON)=node024, COMMENT=(null), JOBID=2653318, NAME=prun-job}]
JobStatus [jobIdentifier=2653320, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=4:00:24, PARTITION=defq, TIME_LIMIT=1-00:00:00, USER=sghiassi, NODELIST(REASON)=node007, COMMENT=(null), JOBID=2653320, NAME=bash}]
JobStatus [jobIdentifier=2653325_1, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:42, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node053, COMMENT=(null), JOBID=2653325_1, NAME=neat5}]
JobStatus [jobIdentifier=2653325_2, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:43, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node054, COMMENT=(null), JOBID=2653325_2, NAME=neat5}]
JobStatus [jobIdentifier=2653325_3, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:43, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node055, COMMENT=(null), JOBID=2653325_3, NAME=neat5}]
JobStatus [jobIdentifier=2653325_4, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=1, STATE=RUNNING, TIME=2:31:44, PARTITION=defq, TIME_LIMIT=10:00:00, USER=zgo600, NODELIST(REASON)=node056, COMMENT=(null), JOBID=2653325_4, NAME=neat5}]
JobStatus [jobIdentifier=2653331, state=RUNNING, exitCode=null, exception=null, running=true, done=false, schedulerSpecificInformation={NODES=2, STATE=RUNNING, TIME=15:01, PARTITION=defq, TIME_LIMIT=30:00, USER=mprovokl, NODELIST(REASON)=node[001-002], COMMENT=(null), JOBID=2653331, NAME=pytorch-ib.job}]

@yh882317
Copy link

I tried to reproduce the problem.
It comes from the specification of the max number of running tasks.
My test code runs the same.
`
import nl.esciencecenter.xenon.schedulers.JobStatus;
import nl.esciencecenter.xenon.schedulers.Scheduler;

import java.util.Arrays;

public class test {
public static void main(String[] args) throws Exception {

    Scheduler scheduler = Scheduler.create("slurm", "ssh://fs0.das5.cs.vu.nl");

    String[] jobIDs = scheduler.getJobs("defq");

    System.out.println("Got jobs: " + Arrays.toString(jobIDs));



    for (String j : jobIDs) {
        JobStatus s = scheduler.getJobStatus(j);
        System.out.println(s.toString());
    }

    scheduler.close();
}

}`

When I submit a job array as command line :
sbatch --array=1-4%2 -J array ./scripts/sleapme 86400

The output from squeue shows:

`
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

2653343_[3-4%2] defq array yhu310 PD 0:00 1 (JobArrayTaskLimit)

2651316 proq pretrain mda420 R 7-16:35:19 1 node078

2651437 proq real_nvp ama228 R 6-02:30:27 1 node069

2653320 defq bash sghiassi R 4:25:51 1 node007

2653325_1 defq neat5 zgo600 R 2:57:09 1 node053

2653325_2 defq neat5 zgo600 R 2:57:09 1 node054

2653325_3 defq neat5 zgo600 R 2:57:09 1 node055

2653325_4 defq neat5 zgo600 R 2:57:09 1 node056

2653330 proq prun-job mao540 R 56:22 1 node070

2653335 defq pytorch- mprovokl R 5:04 2 node[024-025]

2653343_1 defq array yhu310 R 0:01 1 node057

2653343_2 defq array yhu310 R 0:01 1 node058
`

And error raises due to the 2653343_[3-4%2].

[yhu310@fs0 Test_Xenon]$ java -classpath ".:$XENON_HOME/lib/*:CLASSPATH" -Dlog4j.configuration=file:log4j.properties test Got jobs: [2653343_[3-4%2], 2653320, 2653325_1, 2653325_2, 2653325_3, 2653325_4, 2653335, 2653343_1, 2653343_2] 14:13:32.058 [main] WARN n.e.x.a.s.slurm.SlurmScheduler - Sacct produced error output nl.esciencecenter.xenon.XenonException: slurm adaptor: Error in getting sacct job status: CommandRunner[exitCode=1,output=,error=sacct: fatal: Bad job array element specified: 2653343 ] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getSacctInfo(SlurmScheduler.java:341) ~[xenon-3.1.0.jar:na] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getJobStatus(SlurmScheduler.java:380) ~[xenon-3.1.0.jar:na] at test.main(test.java:18) ~[Test_Xenon-1.0-SNAPSHOT.jar:na] Exception in thread "main" nl.esciencecenter.xenon.schedulers.NoSuchJobException: slurm adaptor: Unknown Job: 2653343_[3-4%2] at nl.esciencecenter.xenon.adaptors.schedulers.slurm.SlurmScheduler.getJobStatus(SlurmScheduler.java:407) at test.main(test.java:18)

If we query on job 2653343 instead of 2653343_[3-4%2], by:
for (String j : jobIDs) { if(j.contains("[") ) { j=j.substring(0,j.indexOf("_")); } JobStatus s = scheduler.getJobStatus(j); System.out.println(s.toString()); }
There will be an error as Unkown job exception

@yh882317
Copy link

For my own project, I can ignore the pending jobs 2653343_[3-4%2] and move on.
But by scontrol show job -dd 2653343, it returns the information of this job and its subtasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants