New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Existing intermediate outputs are mishandled when reading from Google Cloud Storage #1460
Comments
Related to #1405. I've experienced this quite a bit. Thank you for taking the time to reproduce it and make a MRE. An observation I'll add is that I've experienced this without having to delete any files. For example, the pipeline will be running fine and one job will hit this issue on its own - no intervention. Files are present in GCS but snakemake thinks they are missing. If I have specified '--retry-times' then the workflow will enter the loop described in the OP. Then when I rerun the workflow those files are marked as 'incomplete' though they are not since looking at the logs the job that produced them finished just fine. |
@CowanCS1 @cademirch please have a look at PR #1531. I am trying to reproduce your issue there. |
@johanneskoester Looks correct from my side. |
I tested a fix that would try to update the cache when a new file is uploaded, got the same error. #1541 I'm not sure how I can debug just looking at this, I'd need to be able to see / interact with the project to understand what is going on. |
@CowanCS1 if you have a google project to test (I do not) can you give me more insight as to where you think the issue is? I tried adding a call to update the cache after upload but we get the same error. I feel kind of helpless for reproducing and debugging. Stanford had easy access to GCP, my new employer doesn't even use it 😭 |
@vsoch I can also give you access to the snakemake-testing project on GCP. Would that help? We run on the free budget, but the workflows are minimal, so it should not be a big issue, right? |
Yes hugely! I would just interactively step through the test you created, nothing big. |
Can you send me your google account email address via mail? |
ok fixed! #1541 |
* chore: add testcase for issue #1460 * fix expected results * try force update of inventory after upload Signed-off-by: vsoch <vsoch@users.noreply.github.com> * run gls tests first Signed-off-by: vsoch <vsoch@users.noreply.github.com> * pushing test fix the issue is that the blob object held by a GS remote object can go sort of stale, returning a False for blob.exists() when it clearly exists! To fix we need to do an additional self.update_blob() and then returning the exists check again. I am not sure if this can be made more efficient by only checking under certain conditions, but since it seems likely we cannot perfectly know when the blob has gone stale the sure way is to always update. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * wrong version of black Signed-off-by: vsoch <vsoch@users.noreply.github.com> * remove unused remote inventory, was just testing! Signed-off-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de> Co-authored-by: vsoch <vsoch@users.noreply.github.com>
Fixed by @vsoch, our hero! |
Snakemake version
7.1, 7.0.1
Describe the bug
When an existing, intermediate output is present in the GS remote location it is not recognized as being present, possibly on purpose due to versioning considerations. When the rule that produces the file is run, it correctly creates the files and uploads them to the remote but then returns a
MissingOutputException
, not recognizes the files are present.This minimal example is a partial reproduction of the issue. I've seen cases where the pipeline is stuck in an infinite loop or otherwise enters a state where it can be run repeatedly, correctly completing and uploading the outputs each time, without the outputs ever being recognized as present. This minimal example is partial in that it generates the bug on the first run, but then correctly recognizes the intermediates on the second run. I believe the error produced here also underlies the more severe cases.
The error appears to be restricted to intermediate rules which have a list of outputs. In all cases I've seen, the error can be circumvented by manually deleting the outputs
Logs
Minimal example
Instructions
test.txt
,blob.txt
andpretest.txt
from the remote, leaving the filepreblob.txt
MisingOutputException
.preblob.txt
has been overwritten so it is not missing in any sense, and despite the script reporting thatpretest.txt
was removed as an output of a failed job it is actually still present. So two cases in which operations performed on GS files are causing the system to lose track of the files.Snakefile
Command to execute
Output of the rule which produces the exception
Additional info
I suspect the issue here is related to the
blob
handles first being created in a way that doesn't check whether the file exists or not, and then not being correctly updated within the list of output objects. Here's an illustration of how theblob
object is changed over certain actions. Not in particular how 1) theget_blob
providesNone
only if the the file actually doesn't exist, whereas the directly specifiedblob
isNone
in both cases 2) the handles are updated even in the event of a rewrite, which is correctly handled here but I think the observed behavior could be caused by accidentally passing and updating a copy of the blob object.Code
Output
The text was updated successfully, but these errors were encountered: