Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: adding batch-cos #29

Merged
merged 5 commits into from Apr 25, 2024
Merged

test: adding batch-cos #29

merged 5 commits into from Apr 25, 2024

Conversation

vsoch
Copy link
Collaborator

@vsoch vsoch commented Feb 29, 2024

I have been able to add batch COS as suggested to run a hello world workflow, but now the original workflows are no longer running, and there is not sufficient error message in the log beyond WorkflowError to understand what is happening.

I have been able to add batch COS as suggested to run a hello world
workflow, but now the original workflows are no longer running,
and there is not sufficient error message in the log beyond
WorkflowError to understand what is happening.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@johanneskoester
Copy link
Collaborator

Will try to look into this again on Monday.

@vsoch
Copy link
Collaborator Author

vsoch commented Mar 10, 2024

Thank you!🙏

@cademirch
Copy link
Contributor

Took a look at this, and was able to get logs from the container by adding -e PYTHONUNBUFFERED=1 to the runnable container options. Can open a new PR from my fork with this if preferred.

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 6, 2024

That’s great! Heads up we are having 80-100mph winds and they shut off power across the county so I won’t be around until maybe tomorrow evening if I’m lucky. I’ll take a look at everything earliest then, more likely next week.

@cademirch
Copy link
Contributor

Stay safe! No rush at all on this.

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 8, 2024

I'm back! Are you planning to rebase / do you want a review? I just saved this one notification so let me know what you need from me.

@cademirch
Copy link
Contributor

Hey! I opened a #46 with my changes, rebased from main. Never actually used rebase before lol 🙃 - let me know if it looks good!

cademirch and others added 2 commits April 9, 2024 11:30
Let's try this again :)

Checked out `testing-cos` locally, made my changes, and created this PR.
I think this can be merged into `testing-cos` which can then be merged
into main? Hope I got this right!
@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

okay I'm trying these from scratch - first cos then the older ones (that weren't working) and fingers crossed your fix @cademirch adds more verbose error output!

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

@johanneskoester do you see logs now? I'm seeing an error from upstream snakemake about resources:

image

This is with the hello-world-cos example.

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

hello-world was successful! That's a start :)

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

For hello-world-intel-mpi it seems to succeed in batch but some issue locally:

Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
[Tue Apr  9 12:26:35 2024]
Error in rule compile:
    message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f' exceeded deadline. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2
    input: s3://my-snakemake-testing/pi_MPI.c (retrieve from storage)
    output: s3://my-snakemake-testing/pi_MPI (send to storage)
    log: s3://my-snakemake-testing/logs/compile.log (send to storage), .snakemake/googlebatch_logs/compile.log (check log file(s) for error details)
    shell:
        mpicc -o .snakemake/storage/s3/my-snakemake-testing/pi_MPI .snakemake/storage/s3/my-snakemake-testing/pi_MPI.c &> .snakemake/storage/s3/my-snakemake-testing/logs/compile.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f

cannot access local variable 'response' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
/home/vanessa/Desktop/Code/snek/env/lib/python3.11/site-packages/snakemake/dag.py:413: RuntimeWarning: coroutine '_IOFile.remove' was never awaited
  f.remove(only_local=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Complete log: .snakemake/log/2024-04-09T122253.419016.snakemake.log
WorkflowError:
At least one job did not complete successfully.

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

okay I see the issue there (we need to return) will try fixing it.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

okay that one worked too - going to wait for @johanneskoester on the first bug with upstream snakemake before next step.

@cademirch
Copy link
Contributor

Sounds good. I tested hello-world with batch-cos and it seems to be all good here. I can see all of Snakemake's output in the batch logs. Which upstream bug are you referring to?

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

I got the error about resources, this one: #29 (comment)

@cademirch
Copy link
Contributor

Ah I see. Which workflow/example did this come from?

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 9, 2024

The hello-world-cos one.

@cademirch
Copy link
Contributor

Oops you said that in the comment. Weird I'm not hitting that.

@cademirch
Copy link
Contributor

Looking at my entrypoint.sh I do have --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= in the snakemake command, which is what your error seems to be complaining about

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 10, 2024

huh, but if it works for you that's great! Let's get @johanneskoester to try it out for another test.

@johanneskoester
Copy link
Collaborator

I can confirm that this PR works in CI with the true API tests.

docs/further.md Outdated Show resolved Hide resolved
docs/further.md Outdated Show resolved Hide resolved
example/hello-world-cos/README.md Show resolved Hide resolved
Problem: the COS container and credentials should be exposed
in the executor settings.
Solution: add them there.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Collaborator Author

vsoch commented Apr 17, 2024

Ping @johanneskoester can you review again?

@johanneskoester
Copy link
Collaborator

I have started a final run with the true api CI: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/8796288376

Copy link
Collaborator

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@vsoch
Copy link
Collaborator Author

vsoch commented Apr 23, 2024

@johanneskoester I'm going to bed, but if you see some hint about the error in the cloud logs that would help me to debug. Goodnight!

@johanneskoester
Copy link
Collaborator

Works again! That was a bug in Snakemake that I fixed yesterday.

@johanneskoester johanneskoester merged commit 3dcfc8c into main Apr 25, 2024
5 of 6 checks passed
@johanneskoester johanneskoester deleted the testing-cos branch April 25, 2024 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants