Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed GLS job log files are not uploaded to bucket AND --show-failed-logs does not work on instance #1992

Open
cademirch opened this issue Dec 6, 2022 · 9 comments
Labels
bug Something isn't working

Comments

@cademirch
Copy link
Contributor

cademirch commented Dec 6, 2022

Snakemake version

7.18.2
Describe the bug

When a job fails on GLS the stdout/stderr from the instance is captured and uploaded to the specified bucket. However, if the stderr/stdout is captured in a log file, that file is not uploaded to the bucket. Looking at the executor code, it doesn't seem like this is a feature - so I guess this isn't really a bug - but it would be a nice feature.

I attempted a quick work around by passing the option --show-failed-logs, expecting that would cat the log file to stdout (on the instance), which would be captured. However, it seems that --show-failed-logs is not passed to the snakemake command on the instance, rather it tries to display the log file locally, which of course does not exist.
Logs

  • Execute the ME with --show-failed-logs:
    snakemake --google-lifesciences --default-remote-prefix gls-log-test -j1 --show-failed-logs
  • It fails as expected and tries to display the log file (which does not exist locally):
Error in rule hi:
   jobid: 1
   output: gls-log-test/hi.txt
   log: gls-log-test/logs/hi/log.txt (check log file(s) for error message)
   shell:
       hi 2> gls-log-test/logs/hi/log.txt
       (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
   jobid: 6527137249714480570
Logfile gls-log-test/logs/hi/log.txt not found.
  • Check the command executed on the cloud:
> gcloud beta lifesciences operations describe projects/snaketest/locations/us-central1/operations/4249048113650117362

/tmp/workdir.tar.gz && tar -xzvf /tmp/workdir.tar.gz && python -m snakemake
       --snakefile 'Snakefile' --target-jobs 'hi:' --allowed-rules 'hi' --cores 'all'
       --attempt 1 --force-use-threads  --resources 'mem_mb=1000' 'disk_mb=1000'  --force
       --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp
       --no-hooks --nolock --ignore-incomplete --rerun-triggers 'params' 'mtime'
       'software-env' 'code' 'input' --skip-script-cleanup  --conda-frontend 'mamba'
       --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait
       5 --scheduler 'ilp' --default-remote-prefix 'gls-log-test' --default-remote-provider
       'GS' --default-resources 'mem_mb=1000' 'disk_mb=1000' 'tmpdir=system_tmpdir'
       --mode 2

Minimal example

rule all:
    input: "hi.txt"

rule hi:
    output: "hi.txt"
    log: "logs/hi/log.txt"
    shell: "hi 2> {log}" # /bin/bash: hi: command not found

Additional context

Would appreciate your insight @vsoch. It seems to me that uploading the job log file could be handled by gls_helper.py but it would probably take me a bit to implement that. It would probably be easier to pass --show-failed-logs similar to how --use-conda. Though I'm not sure where/how that is handled.

@CowanCS1 Interested if/how you handle this too!

@cademirch cademirch added the bug Something isn't working label Dec 6, 2022
@vsoch
Copy link
Contributor

vsoch commented Dec 6, 2022

It's been a while since I worked on this, but at least when I used this API I used the command as you described above:

https://github.com/snakemake/snakemake/blob/main/snakemake/executors/google_lifesciences.py#L862-L876

but that would only work if you put more verbose printing in your job. Maybe it would be a matter of adding --verbose to the command there for snakemake debugging output, and perhaps if you can't see anything even with extra prints, we could minimally have the script that runs everything catch the failure and print a log? I'm not sure to debug the user needs to clutter their storage with logs - likely it's better to get it interactively from the client command as you've done above. There are definitely many options - let's discuss with the folks here and @johanneskoester.

@cademirch
Copy link
Contributor Author

cademirch commented Dec 6, 2022

Yeah, that only works if you don't redirect the stderr/stdout of your job to a log file. Which then defeats the purpose of the log directive when defining a rule. I think a suitable workaround is allowing --show-failed-logs to be passed to the cloud instance because we then retain our defined log file for that rule.

I just tried adding w2a("show_failed_logs") in the return here:

def general_args(self):

However, show_failed_logs isn't an attribute of the workflow so that doesn't work :/

Edit:

I guess we could just hardcode --show-failed-logs... though I'm not sure of the consequences of that.

@vsoch
Copy link
Contributor

vsoch commented Dec 6, 2022

@cademirch remember that the container the life sciences worker is using is the latest snakemake - so if you want to try changing something you'd need to build a container, push to a registry, and then provide the container URI to the executor.

@cademirch
Copy link
Contributor Author

cademirch commented Dec 6, 2022

Right, but what I'm proposing is just adding --show-failed-logs to the snakemake command line being run in the container, which is handled in the function linked above. Fwiw I've added it on my fork and it is working as I expected. However, it does hardcode the option to the command executed in the container, which I'm not sure is a good thing.

See:
cademirch@3b90fa3

@vsoch
Copy link
Contributor

vsoch commented Dec 6, 2022

Ok great! Glad you found a solution.

@cademirch
Copy link
Contributor Author

cademirch commented Dec 6, 2022

Opened a PR but further discussion is probably needed. Thanks for your help @vsoch! :)

@vsoch
Copy link
Contributor

vsoch commented Dec 6, 2022

Haha I didn’t do anything - all you @cademirch! 🙌

@oxenit
Copy link

oxenit commented Aug 22, 2023

+1 on this feature. It was very frustrating to have a job failing all afternoon because a file in a parameter directive could not be found without finding the exact reason 😝 (both from the pipeline logs or in the bucket itself)

What is surprising is, just like OP and in opposition to what is written here, logs of a rule are actually not written in the bucket if the rule is failing. I can only see rule logs when the rule passes (which is a bit sad).
We can find the GLS pipeline logs in another folder of the bucket but they are not as verbose as the rule logs.

@cademirch
Copy link
Contributor Author

@oxenit, I haven't had a chance to work on this unfortunately but my workaround above has been sufficient for our use. I expect that this behavior won't change anytime soon as we will eventually have to migrate to Google Batch. Hopefully we can solve this issue when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants