Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snakemake hangs with --slurm and doesnt recognize neither failed nor completed jobs #2496

Closed
chrarnold opened this issue Oct 26, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@chrarnold
Copy link
Contributor

chrarnold commented Oct 26, 2023

Snakemake version
7.32.3

Describe the bug
snakemake hangs in --slurm mode when used with a profile and some quite standard options (that used to work before actually). This affects ALL jobs, dosnt matter which return status (both failed and successfully completed ones).

slurm: True
jobs: 100
max-jobs-per-second: 10
max-status-checks-per-second: 10
show-failed-logs: True

The only thing I've really changed is adding the slurm_extra, but why would this cause a hang?

set-resources:
  fastqc_RNA:
    mem_mb: 10000
    runtime: "2h"
    slurm_extra: "--job-name=RNA.fastqc_RNA"

What could be the reason for the hang?

One of the slurm jobs information after it finished without error:

JobId=41546021 JobName=RNA.STARsolo
   UserId=bla(22069) GroupId=bla(668) MCS_label=N/A
   Priority=3275 Nice=0 Account=bla QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:14:18 TimeLimit=05:00:00 TimeMin=N/A
   SubmitTime=2023-10-26T09:14:35 EligibleTime=2023-10-26T09:14:35
   AccrueTime=2023-10-26T09:14:35
   StartTime=2023-10-26T09:18:04 EndTime=2023-10-26T09:32:22 Deadline=N/A
   PreemptEligibleTime=2023-10-26T09:18:04 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-10-26T09:18:04 Scheduler=Backfill
   Partition=htc-el8 AllocNode:Sid=0.0.0.0:2614267
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=smem06-3
   BatchHost=smem06-3
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=40000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=40000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=bla//src/workflow
   StdErr=bla//src/workflow/.snakemake/slurm_logs/rule_STARsolo/41546021.log
   StdIn=/dev/null
   StdOut=bla/src/workflow/.snakemake/slurm_logs/rule_STARsolo/41546021.log
   Power=

I dont see anything unusual here.

@chrarnold chrarnold added the bug Something isn't working label Oct 26, 2023
@chrarnold
Copy link
Contributor Author

I can now confirm that adding slurm_extra: "--job-name=kneeplot_cell_identification" to the profile in the section set-resources reproducibly causes snakemake to hang and job status is not recongized anymore, neither failing nor successfully ones. No idea why, though, should be addressed @johanneskoester

@conchoecia
Copy link
Contributor

Possibly related to: #2739 ^^

@cmeesters
Copy link
Contributor

cmeesters commented May 8, 2024

@conchoecia thanks for bringing this to my attention. I am sorry, but there are too many issues and the code for the executor plugin moved to its own repo.

As noted in its documentation, it relies on using the jobname to inquire the stati for the entire job group of a workflow to minimize the query overhead. Job comments are used to make the rule name (and soon wildcards) transparent to users during execution.

So, overwriting this mechanism will cause the plugin to fail. I will code a safeguard. A.s.a.p.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants