Support GPU information and job array query job status for SLURM #671

yh882317 · 2020-08-19T13:18:53Z

Context

I am working on a project which aims to provide a user side solution for higher resource utilization on SLURM cluster.
It requires information on pending jobs in the queue and running jobs.

Problem

The interface JobQueueScheduler.getJobStatus(jobIdentifier) in this line, returns jobstatus of the job.
However, only contains basic information like start time, time limit, required number of nodes. For jobs have GPU requirement, they can not be recognized.
Besides, there is also a problem on querying jobs generated by job array. The job array and the running jobs can be found by String[] jobIDs=scheduler.getJobs(PartitionName);. However, when I am trying to get the status of those jobs, there will be error raised saying no such jobs. The job array on the pending has the id like 1080_[5-1024] and jobs on the run have ids like 1080_2.
When JobQueueScheduler.getJobStatus(jobIdentifier) is invoked, the error raise.

Question

Is it possible to provide information about GPU and job array via job status? After all, the implementation of jobstatus maintains a map schedulerSpecificInformation. Perhaps the related information can be added to this map. And also job array queries need to fix.

The text was updated successfully, but these errors were encountered:

jmaassen · 2020-08-20T10:15:26Z

It is very scheduler and site configuration specific how GPUs or other 'special' types of nodes are indicated. There is no way for Xenon to know how this is done on each of the sites, as there is no standard way of doing this. It is up to the sysadmin.

Some sites use separate queues for GPU nodes (Cartesius for example), while others use node properties to indicate the GPU nodes (DAS5 is doing this). Both are using SLURM, but chose a different way to expose their GPU nodes.

So there is some knowledge from the user needed in the application to solve this problem, there is no way around that. For DAS5, there is extra information on each job in the JobStatus.getSchedulerSpecificInformation() which may contain the extra scheduler specific options for the jobs. These may also have the GPU specific flags, such as -C gpunode or -C GTX980 (there are many different flags for gpu nodes).

You should note however, that since the GPU nodes are not in a special queue on DAS5, normal CPU jobs can also be scheduled onto a GPU node. So the flag only says something about the job, not necessarily the node it is running on.

jmaassen · 2020-08-20T10:18:46Z

For the second question about job arrays, This seems to be an issue with the SLURM parser not properly recognizing the individual jobs produced by array jobs. I'll see if I can reproduce this problem.

jmaassen · 2020-08-20T11:50:00Z

I cannot seem to reproduce the issue. See #672 for my test

jmaassen mentioned this issue Aug 20, 2020

Problems with parsing arrayjobs on DAS5 #672

Open

jmaassen changed the title ~~Support GPU information and job array query job status for SLRUM~~ Support GPU information and job array query job status for SLURM Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU information and job array query job status for SLURM #671

Support GPU information and job array query job status for SLURM #671

yh882317 commented Aug 19, 2020

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020

Support GPU information and job array query job status for SLURM #671

Support GPU information and job array query job status for SLURM #671

Comments

yh882317 commented Aug 19, 2020

Context

Problem

Question

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020

jmaassen commented Aug 20, 2020