Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any way to help when sbatch/srun exit immediately because of invalid submission #2

Open
paciorek opened this issue Oct 15, 2020 · 7 comments
Assignees

Comments

@paciorek
Copy link
Collaborator

Here's a standard use case that is not amendable to use of sq because the job exits immediately.

@nicolaschan any ideas of how we might help users in such cases?

[paciorek@ln002 ~]$ srun --pty -A ac_scsguest -p savio2_gpu --gres=gpu:1 -t 00:00:30  bash -i
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

This case is missing -c 2.

@nicolaschan
Copy link
Collaborator

Incorrect job submission parameters are definitely an issue in general, but sq won't be able to catch the ones that don't make it into the queue. In the example you provided, I don't think the job gets assigned a job ID so there's no way for sq to see it by checking Slurm.

That being said, I have a couple ideas:

  • Somehow read the last few commands a user ran and identify issues
  • Create a wrapper script for srun

For the srun wrapper, we would have to add it to the PATH instead of the normal srun. To read the last commands, we'd probably have to get some stuff added to each user's ~/.bashrc. Both of these would be significant changes, but could be worth it if it will help a lot of users. I can bring this up to Krishna next week and see what he thinks.

@paciorek
Copy link
Collaborator Author

Use of srun in my example was just an example, so the same question arises for sbatch.

@nicolaschan
Copy link
Collaborator

Right, so there could be an sbatch wrapper, or some other way to analyze commands for common mistakes. I think the most general solution might be to hook some sort of "mistake analyzer" program into the ~/.bashrc and then if it identifies common mistakes like the ones you mentioned then it can give advice.

@paciorek
Copy link
Collaborator Author

Yeah, I like the "mistake analyzer" idea, though not sure what the interface would be exactly.

@nicolaschan
Copy link
Collaborator

nicolaschan commented Oct 16, 2020

I'm imagining it just hooks into the shell so then immediately after srun fails it outputs some helpful text. Maybe it suggests the correct command (adding -c 2) and then you just press y/n to accept or decline.

@paciorek
Copy link
Collaborator Author

Even for jobs that get queued, if they violate a constraint, ideally we would warn the user upon submission rather than wait for them to wonder why the job isn't starting.

Here are some examples from running squeue today of jobs that are just sitting there because of too many cores for savio_long, too long a time limit, and too many cpus for a given condo.

    6673582 savio2_ht              end188a kvegesna PD 7-00:00:00      0:00    24 QOSMaxCp      2 (QOSMaxCpuPerJobLimit) 0.00023329630500 savio_long
     6746618     savio  negpos2free-100-200   mpeyro PD 3-18:00:00      0:00     1 QOSMaxWa      1 (QOSMaxWallDurationPerJobLimit) 0.00023329630500 savio_normal
     6746622     savio  negpos2free-100-200   mpeyro PD 3-08:00:00      0:00     1 QOSMaxWa      1 (QOSMaxWallDurationPerJobLimit) 0.00023329630500 savio_normal
     6920018     savio   0.85_5050_30.6nlte  kenshen PD 41-16:00:00      0:00   320 QOSGrpNo     16 (QOSGrpNodeLimit) 0.00023286743095 astro_savio_normal

Related to your mistake analyzer idea but would presumably need to be triggered in a different way.

nicolaschan added a commit that referenced this issue Oct 20, 2020
@nicolaschan
Copy link
Collaborator

Great suggestion, sq now supports identifies and warns about these issues:

  • QOSMaxWallDurationPerJobLimit
  • QOSMaxCpuPerJobLimit
  • QOSMaxNodePerJobLimit

Screenshot_20201020_154544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants