Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU information and job array query job status for SLURM #671

Open
yh882317 opened this issue Aug 19, 2020 · 3 comments
Open

Support GPU information and job array query job status for SLURM #671

yh882317 opened this issue Aug 19, 2020 · 3 comments

Comments

@yh882317
Copy link

Context

I am working on a project which aims to provide a user side solution for higher resource utilization on SLURM cluster.
It requires information on pending jobs in the queue and running jobs.

Problem

The interface JobQueueScheduler.getJobStatus(jobIdentifier) in this line, returns jobstatus of the job.
However, only contains basic information like start time, time limit, required number of nodes. For jobs have GPU requirement, they can not be recognized.
Besides, there is also a problem on querying jobs generated by job array. The job array and the running jobs can be found by String[] jobIDs=scheduler.getJobs(PartitionName);. However, when I am trying to get the status of those jobs, there will be error raised saying no such jobs. The job array on the pending has the id like 1080_[5-1024] and jobs on the run have ids like 1080_2.
When JobQueueScheduler.getJobStatus(jobIdentifier) is invoked, the error raise.

Question

Is it possible to provide information about GPU and job array via job status? After all, the implementation of jobstatus maintains a map schedulerSpecificInformation. Perhaps the related information can be added to this map. And also job array queries need to fix.

@jmaassen
Copy link
Member

It is very scheduler and site configuration specific how GPUs or other 'special' types of nodes are indicated. There is no way for Xenon to know how this is done on each of the sites, as there is no standard way of doing this. It is up to the sysadmin.

Some sites use separate queues for GPU nodes (Cartesius for example), while others use node properties to indicate the GPU nodes (DAS5 is doing this). Both are using SLURM, but chose a different way to expose their GPU nodes.

So there is some knowledge from the user needed in the application to solve this problem, there is no way around that. For DAS5, there is extra information on each job in the JobStatus.getSchedulerSpecificInformation() which may contain the extra scheduler specific options for the jobs. These may also have the GPU specific flags, such as -C gpunode or -C GTX980 (there are many different flags for gpu nodes).

You should note however, that since the GPU nodes are not in a special queue on DAS5, normal CPU jobs can also be scheduled onto a GPU node. So the flag only says something about the job, not necessarily the node it is running on.

@jmaassen
Copy link
Member

For the second question about job arrays, This seems to be an issue with the SLURM parser not properly recognizing the individual jobs produced by array jobs. I'll see if I can reproduce this problem.

@jmaassen
Copy link
Member

I cannot seem to reproduce the issue. See #672 for my test

@jmaassen jmaassen changed the title Support GPU information and job array query job status for SLRUM Support GPU information and job array query job status for SLURM Jan 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants