New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: adding batch-cos #29
Conversation
I have been able to add batch COS as suggested to run a hello world workflow, but now the original workflows are no longer running, and there is not sufficient error message in the log beyond WorkflowError to understand what is happening. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Will try to look into this again on Monday. |
Thank you!🙏 |
Took a look at this, and was able to get logs from the container by adding |
That’s great! Heads up we are having 80-100mph winds and they shut off power across the county so I won’t be around until maybe tomorrow evening if I’m lucky. I’ll take a look at everything earliest then, more likely next week. |
Stay safe! No rush at all on this. |
I'm back! Are you planning to rebase / do you want a review? I just saved this one notification so let me know what you need from me. |
Hey! I opened a #46 with my changes, rebased from main. Never actually used rebase before lol 🙃 - let me know if it looks good! |
Let's try this again :) Checked out `testing-cos` locally, made my changes, and created this PR. I think this can be merged into `testing-cos` which can then be merged into main? Hope I got this right!
okay I'm trying these from scratch - first cos then the older ones (that weren't working) and fingers crossed your fix @cademirch adds more verbose error output! |
@johanneskoester do you see logs now? I'm seeing an error from upstream snakemake about resources: This is with the hello-world-cos example. |
hello-world was successful! That's a start :) |
For hello-world-intel-mpi it seems to succeed in batch but some issue locally: Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
[Tue Apr 9 12:26:35 2024]
Error in rule compile:
message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f' exceeded deadline. For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 2
input: s3://my-snakemake-testing/pi_MPI.c (retrieve from storage)
output: s3://my-snakemake-testing/pi_MPI (send to storage)
log: s3://my-snakemake-testing/logs/compile.log (send to storage), .snakemake/googlebatch_logs/compile.log (check log file(s) for error details)
shell:
mpicc -o .snakemake/storage/s3/my-snakemake-testing/pi_MPI .snakemake/storage/s3/my-snakemake-testing/pi_MPI.c &> .snakemake/storage/s3/my-snakemake-testing/logs/compile.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f
cannot access local variable 'response' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
/home/vanessa/Desktop/Code/snek/env/lib/python3.11/site-packages/snakemake/dag.py:413: RuntimeWarning: coroutine '_IOFile.remove' was never awaited
f.remove(only_local=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Complete log: .snakemake/log/2024-04-09T122253.419016.snakemake.log
WorkflowError:
At least one job did not complete successfully. |
okay I see the issue there (we need to return) will try fixing it. |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
okay that one worked too - going to wait for @johanneskoester on the first bug with upstream snakemake before next step. |
Sounds good. I tested hello-world with batch-cos and it seems to be all good here. I can see all of Snakemake's output in the batch logs. Which upstream bug are you referring to? |
I got the error about resources, this one: #29 (comment) |
Ah I see. Which workflow/example did this come from? |
The hello-world-cos one. |
Oops you said that in the comment. Weird I'm not hitting that. |
Looking at my entrypoint.sh I do have |
huh, but if it works for you that's great! Let's get @johanneskoester to try it out for another test. |
I can confirm that this PR works in CI with the true API tests. |
Problem: the COS container and credentials should be exposed in the executor settings. Solution: add them there. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Ping @johanneskoester can you review again? |
I have started a final run with the true api CI: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/8796288376 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@johanneskoester I'm going to bed, but if you see some hint about the error in the cloud logs that would help me to debug. Goodnight! |
Works again! That was a bug in Snakemake that I fixed yesterday. |
I have been able to add batch COS as suggested to run a hello world workflow, but now the original workflows are no longer running, and there is not sufficient error message in the log beyond WorkflowError to understand what is happening.