Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

walltime option not effective for resubmitted job #1455

Closed
hsun3163 opened this issue Feb 16, 2022 · 5 comments
Closed

walltime option not effective for resubmitted job #1455

hsun3163 opened this issue Feb 16, 2022 · 5 comments
Assignees

Comments

@hsun3163
Copy link

hsun3163 commented Feb 16, 2022

After the old jobs is killed for exceeding walltime. New PEER commands is reran with the option --walltime 24h.

However, this job is killed again after five hour. Checking the qacct shows that the walltime 24h is not registered. This may be due to a cache problem, however, removing ~/.sos every time something needs to be reran is unrealistic for it will interrupt all the running jobs at the moment.

hs3163@csglogin:~$ qacct -j 2253236
==============================================================
qname        csg.q
hostname     node55
group        hs3163
owner        hs3163
project      NONE
department   defaultdepartment
jobname      job_t1070a54c8a5bf7d7
jobnumber    2253236
taskid       undefined
account      sge
priority     0
qsub_time    Wed Feb 16 09:37:18 2022
start_time   Wed Feb 16 09:37:33 2022
end_time     Wed Feb 16 14:37:34 2022
granted_pe   orte
slots        8
failed       37  : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status  137                  (Killed)
ru_wallclock 18001s
ru_utime     0.080s
ru_stime     0.020s
ru_maxrss    5.738KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    5691
ru_majflt    0
ru_nswap     0
ru_inblock   56
ru_oublock   8
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     3005
ru_nivcsw    18
cpu          17933.000s
mem          38.852TBs
io           131.046MB
iow          0.000s
maxvmem      3.118GB
arid         undefined
ar_sub_time  undefined
category     -U statg-users -u hs3163 -q csg.q -l h_rt=18000,h_vmem=16G -pe orte 8

The log of jobs fails and submission is:

INFO: Running PEER:
INFO: t1070a54c8a5bf7d7 already runnng
INFO: Waiting for the completion of 1 task.
WARNING: Task t1070a54c8a5bf7d7 inactive for more than 33440 seconds, might have been killed.
INFO: t1070a54c8a5bf7d7 submitted to csg with job id 2253236
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.

then the second job fail is:

INFO: Waiting for the completion of 1 task.
WARNING: Task t1070a54c8a5bf7d7 inactive for more than 85 seconds, might have been killed.
ERROR: [PEER]: [t1070a54c8a5bf7d7]: No result is received for task t1070a54c8a5bf7d7

The command I used is:

nohup sos run pipeline/PEER_factor.ipynb PEER     --cwd Ast/covariate/     --phenoFile data/phenotype_data/Ast.log2cpm.bed.gz     --covFile  /mnt/mfs/statgen/snuc_pseudo_bulk/Ast/covariate/Ast.covariate.pca.cov.gz --name "Ast"     --container ./containers/xqtl_pipeline_sif/PEER.sif -J 1 -c csg.yml -q csg --walltime 24h

With the only difference between the commands of two submission being --walltime 24h

@BoPeng
Copy link
Contributor

BoPeng commented Feb 16, 2022

Yes, the problem is likely caused by caching because only changing the walltime will not change the content of the task, therefore the ID of the task, and the locally re-generated tasks are not sent to cluster.

I thought that this problem has been fixed but maybe it was not, or the problem is caused by something else. Before I try to reproduce it, could you check your version of sos and sos-pbs and make sure that you are using the (most) recent versions?

@BoPeng BoPeng self-assigned this Feb 16, 2022
@hsun3163
Copy link
Author

Yes, the problem is likely caused by caching because only changing the walltime will not change the content of the task, therefore the ID of the task, and the locally re-generated tasks are not sent to cluster.

I thought that this problem has been fixed but maybe it was not, or the problem is caused by something else. Before I try to reproduce it, could you check your version of sos and sos-pbs and make sure that you are using the (most) recent versions?

The sos version is sos 0.22.5, I am not sure how to check the version of sos-pbs though

@BoPeng
Copy link
Contributor

BoPeng commented Feb 16, 2022

pip list | grep sos, or conda list.

@hsun3163
Copy link
Author

pip list | grep sos, or conda list.

The pbs version is sos-pbs 0.20.8

@BoPeng
Copy link
Contributor

BoPeng commented Feb 17, 2022

It was fixed in 2af3927 but I somehow never made a new release in a year.

I have just released sos 0.22.6. Please re-open the ticket if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants