-
Notifications
You must be signed in to change notification settings - Fork 516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MissingOutputException with directory and GS remote provider #576
Comments
Is this the answer I'm looking for? from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS.allows_directories
> False |
hey @eric-czech just a quick question - if the *.zarr file is a directory, did you try appending a slash to the end? E.g., |
Hm I had not but I just tried a few variations on that using a simpler rule: rule test:
input:
f"tmp/input.txt"
output:
# Attempts:
# "tmp/output.zarr"
# "tmp/output.zarr/"
# directory("tmp/output.zarr")
directory("tmp/output.zarr/")
conda:
"envs/spark.yaml"
shell:
"mkdir -p {output} && "
"echo '1' > {output}/f1.txt && "
"echo '2' > {output}/f2.txt" I got the same result each time -- here's the log from one of the runs: Log> snakemake --use-conda --cores=1 --local-cores=1 --default-remote-provider GS --default-remote-prefix rs-ukb rs-ukb/tmp/output.zarrBuilding DAG of jobs... [Wed Aug 19 16:16:20 2020] Skipped removing non-empty directory rs-ukb/tmp/output.zarr |
I thought an extension on the directory name could be an issue so I tried with and without the |
I think the core issue here (and the reason we have bucket = client.bucket('snakemake-testing')
<Bucket: snakemake-testing>
bucket.blob("1/").exists()
False
bucket.blob("1").exists()
False So they key would be finding a way to get support for that. Another alternative is that snakemake could (when it's creating the inventory) keep a set of separate directory names that are deemed to exist based on filenames. And of course the third options to do what you already did, and use a file in the directory as a proxy for the directory existence. @johanneskoester what are your thoughts? |
Hi all, I'll just chime in to say that I've also been having this problem and my vote is for creating a set of separate directory names based on filenames, since that is what it seems like others have done. |
I have also hit this issue when attempting to move an existing workflow onto Google's Cloud Life Sciences.
It would be very useful to hear what parts of this are expected behavior. At present the only method I can think of to make this particular rule work on Cloud Life Sciences is to list all of the required outputs explicitly, and attempt to remove the is_directory flag where it incorrectly persists by having the remote shell copy them into a second location after they are created. |
hey @CowanCS1 I just wanted to let you know that I hear you! There is one thing I want to try that might allow for a folder, and I'll try to make some time this weekend. I can't make promises because Google Storage isn't really designed to have "folders" (a folder is just part of the path to a file in an object store) but I at least want to test something out. I'll let you know if it gets anywhere. In the meantime, could you provide a simple Snakefile that reproduces your particular issue? That will be very helpful for me to test. |
Hi @vsoch - thanks! I very much appreciate the work you and others have done to make it easier to use Snakemake on GCP. I wrote a minimal-ish example to both demonstrate the issue and identify a work-around for uploading arbitrary directory contents, aiming to keep the solution somewhat within the existing framework. The work-around I have settled on is to use the gls.py helper file and explicitly start a "save to remote" call for the relevant sub-directories. As mentioned above, a single file in the sub-directory is referenced as the snakemake output. Here is the example with work-around in place: Snakemake version: Command used to execute workflow:
The code is always run with the most recent version of all files committed into a github repository.
And the resulting log file:
Errata:
|
@CowanCS1 I can confirm the bug with the log files not getting overwritten. We've experienced this ourselves, too. |
hey @CowanCS1! So it looks like you are copying the code that the worker runs for the step (downloading gls.py) can you provide a basic example of generating some number of files with a directory prefix for the output that fails? We need to work on one thing at a time, and there are many moving pieces in the Snakefile above. |
Hi @vsoch ! Absolutely. Snakemake version: Rule: # Return all files created in the sub-directory /data/test/
rule results:
input:
directory('data/test/')
# Run a command which creates an arbitrary set of files in /data/test/
rule minimal_test:
resources:
machine_type = 'n1-standard-4',
disk_mb = 10000
output:
directory('data/test/')
shell:
'''
mkdir -p {output}
for i in $(seq 1 5); do echo $i > {output}/$i.txt; done
''' Output when executed with
Purpose of rule: Intended to mimic any rule whose output is a collection of files stored within a new sub-directory. Expected outcome: For all 5 files created within the Hopefully this is what you were requesting - there is a degree of uncertainty on my part, but I'm happy to help further as needed. |
In case it is helpful, here is a function I wrote to try and test for the existence of directories on GCS. It stills fails in the case that the directory exists as a placeholder only, containing neither files nor sub-directories containing files. That seems acceptable to me if documented. def is_dir( fname ):
blobs = list( bucket.list_blobs( prefix = fname, delimiter='/') )
if len(blobs) == 1:
print('%s is a file, not a directory'%fname)
return False
elif len(blobs) > 1:
print('%s is a directory that contains files'%fname)
return True
else:
blobs = list( bucket.list_blobs( prefix = fname, ) )
if len(blobs) == 0:
print('%s contains no files and may not exist'%fname)
return False
else:
print('%s is a directory containing subdirectories which contain files'%fname)
return True Test commands using the same data structure from my minimal example immediately preceding this: print( '%s\n'%is_dir( 'data/' ) );
print( '%s\n'%is_dir( 'data/test/' ) );
print( '%s\n'%is_dir( 'data/test/1.txt' ) );
print( '%s\n'%is_dir( 'data/test/qwerty' ) ); And the output:
|
okay, so I'm on a worker node and reproducing the issue. The error is thrown by this function so I'm walking through it to see what happens. Here is our job job
# minimal_test We define expanded output as a list: expanded_output = []
for path in job.expanded_output:
expanded_output.append(job.shadowed_path(path)) expanded_output
['snakemake-testing/testing-output/data/test'] That looks about right. I can also list that directory to confirm that it exists, and that there are files inside. ls snakemake-testing/testing-output/data/test/
1.txt 2.txt 3.txt 4.txt 5.txt Can verify that the f = expanded_output[0]
# 'snakemake-testing/testing-output/data/test' This seems kind of off: os.path.isdir(f)
True
f.is_directory
False So the condition that triggers the exception is:
I'm not sure if this is intentional, but let's explore how that is derived. It's a Snakemake IOFile: type(f)
snakemake.io._IOFile Ah so I think I know this bug - I fixed it last week but it hasn't been merged yet. When a remote is added, it strips away previous tags. To step back, each file object has flags, these are basically functions that you'd add to some path in a snakeflie (e.g., touch, directory, etc.). But when we have an AnnotatedString (with the correct flags) that gets parsed into the path of a remote object, it strips away previous tags (the PR I linked fixes this). The return is_flagged(self._file, "directory") but it's not there.
So to emulate how it would be with the fix, I'm going to artifically add the tag back and then the is_directory should be true. self._file.flags['directory'] = True
self.is_directory
True That should get us through that particular error (note the string below is not printed) if (f.is_directory and not os.path.isdir(f)) or (
...: os.path.isdir(f) and not f.is_directory
...: ):
...: print("ERROR IS TRIGGERED")
# not printed So I can apply the fixes from that PR to see if we get further. One quick note is that your first rule I think "input" should be "output"? I changed the Snakefile to be:
It still errors on the remote file upload, so let's look at that. I think here is obviously the issue, it only respects an upload given that it's a local file (which test is not) >>> self.local_file()
'snakemake-testing/testing-output/data/test' Here is explicitly what that looks like: >>> self.blob.upload_from_filename(self.local_file())
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 2335, in upload_from_filename
with open(filename, "rb") as file_obj:
IsADirectoryError: [Errno 21] Is a directory: 'snakemake-testing/testing-output/data/test' So - now we probably need to think about design. I suspect this is done intentionally, because otherwise you could provide some random output directory and then get access to everything (that maybe should not have access to). I am looking at other remotes and I don't see immediate support for walking a directory, and maybe this is intentional, it wouldn't be relevant when running locally because there isn't any content to upload. So what I can do is to put in a PR (note that it would be dependent on the other one being merged to work fully) that walks over the directories in the path, and uploads them to storage. So when I add that, there is one more remaining issue: Uploading to remote: snakemake-testing/testing-output/data/test
Finished upload.
WorkflowError:
The file does not seem to exist remotely: snakemake-testing/testing-output/data/test
File "/workdir/snakemake/executors/__init__.py", line 566, in handle_job_success
File "/workdir/snakemake/executors/__init__.py", line 243, in handle_job_success
File "/workdir/snakemake/remote/GS.py", line 172, in mtime okay, so now we're getting into hairy details. I can also update the exists function to use a file (any flie found) as a proxy for existing, but then all the functions that use self.exists() as a boolean break - for example, here we could only get a timestamp of a blob given that the blob exists. Since the blob itself is a directory, and doesn't have any kind of representation in storage, this throws an error. Any function in GS that requires metadata from the blob also throws an error. E.g., if self.exists():
self.update_blob()
t = self.blob.updated
return t.timestamp()
else:
raise WorkflowError(
"The file does not seem to exist remotely: %s" % self.local_file()
) So here is the issue - what we would essentially be doing is hard coding the remote to use the first file found as a proxy for the directory existing. That's exactly what the user can just do on the command line, instead of defining an output directory, define the files! I am starting to agree that directory support for objects should not be supported and likely this was a decision already thought about and made. @johanneskoester it would be good to have your feedback here. Unless this is supported by other object stores (and there is a clear method to do it), my thinking is that directory support doesn't make sense for object storage. |
Thanks @vsoch - that is very useful. I can certainly appreciate that there are cascading implications for this issue. The first rule in my minimal example could actually be removed - just my habit to put a placeholder rule at the top to define the final output of the workflow. General thoughts:
Briefly, I wanted to mention that there is a relatively simple fix to the make your example above work as expected. There is an undocumented way to make GCS objects which function like directories and are compatible with API queries. blob = bucket.blob('data/test/from_api/')
print(blob.exists())
blob.upload_from_string('', content_type='application/x-www-form-urlencoded;charset=UTF-8')
print(blob.exists()) Output:
Directories created in this fashion are list-able and can be browsed in cloud console as directories. If these directories are created when recursively uploading a directory tree, I think it would fix the last problem you listed above. For other remotes, this may not be possible - I can imagine how this could be a general issue with cloud storage. I'm not pushing for this to be actually applied, since it is undocumented and might still break if the user mixes in directories created by other methods, but it is an option I wanted to make you aware of. |
For some rules, it can be hard to know what the files will be called (or how many there will be) ahead of time. I think this was one of the original reasons for creating the Couldn't the hidden |
@CowanCS1 that's a neat try! I'll try it out now. |
Actually I'm not sure I can use the snakemake cloud account, my instance is gone and I'll need to verify and set it up again. :( |
okay, so without permission to do that the most I can do is put in a PR with changes that might work, and then get reviewers. But it depends on the PR that I previously opened to fix the remote object flags so I'm not sure you'd be able to test without pulling that one too. |
I made an attempt at this, but I'm new to git repositories so let me document what I did and you can judge.
Then the test: Minimal example rule, further simplified by removing unnecessary first rule:
Workflow executed with: Unfortunately, we hit the error:
Looks like the error you had expected be resolved by #665 - I haven't walked through to confirm that the tags were in the same state. I also tested the code added within #681 in isolation, and that appears to work as expected. |
The worker on GLS is not going to have access to the updated code, you’d need to build and push a container to a registry and use it with the —container-image flag. |
Makes sense 👍 Thanks! If someone else doesn't get to it first, I will try at my next available opportunity. |
Tested! The code completed without error, but the result was not as expected. A "test" object appeared in the directory which looked like this: f = 'data/te'
[i.name for i in bucket.list_blobs(prefix=f)]
I confirmed that the output of my rule is Object can be created with trailing slash f = 'data/test/'
blob = bucket.blob(f)
blob.upload_from_string(
"", content_type="application/x-www-form-urlencoded;charset=UTF-8"
)
blob.exists()
Object appears with trailing slash in blob listing and as a directory in cloud console. I retained the Snakemake generated object f = 'data/te'
[i.name for i in bucket.list_blobs(prefix=f)]
Since we would want to cover the case where the user applies the directory wrapper but doesn't include the trailing slash, we can cover both cases by appending it here. If you can update the PR, I can rebuild to test tomorrow. I am getting a crash course in git and containers this week, but haven't covered how to submit pull requests yet. |
@CowanCS1 that should be easy to do! And that's an interesting finding, because I expected the trailing slash to not really matter given that some files are produced in the folder. But perhaps not having a trailing slash produces a file, and then it's not possible to treat it further as a directory. Of course now I'm concerned if adding the trailing slash will invalidate being able to create it as what looks like an empty object. We will have to test! |
okay so I updated the PR to ensure we upload with a trailing slash 40f5395. |
Great! Many thanks. In blob storage, the trailing slash seems to be interpreted as a directory by the console. It even works if you put data into it - I was a little afraid to try, but sure enough you can read data from the directory blobs just like the rest. That opens up some weird opportunities for hidden directory annotation. I confirm there isn't a problem with creating one of these placeholder directories and writing files with the same directory prefix. |
@vsoch a basic question - this edit was a commit without a branch, whereas the last two were pull requests with a branch. Any tips on how I can download this similar to |
The commit is now represented in the same PR branch, so the easiest thing is to do the exact same set of steps as before. If you want to update the current combined branch that you put together (harder) then you'd want to pull again from the PR branch to grab the commit. You can use git log to peek and see what commits you've added, and of course visually check the code for a sanity check. Git is weird and hard, and there are always weird operations like this one. You are doing great! :) |
Progress! Same test setup as yesterday, just rebuilding the dockerfile with the updated PR #681 Snakemake fails with the following message:
I can see a simple explanation for this in GCS. For clarity I'll describe what I see using the URI style The directory and all of its containing files were uploaded to Here is a check of the GCS contents with the API: [i.name for i in bucket.list_blobs( prefix = 'bucket/data/test/') ]
Appears to be perfect, other than Thanks @vsoch ! It has been a good experience. I need this for my own workflows, and I can attest to the lack of a simple work-around for certain workflows. Edit 1: I admit to being a bit confused about why I saw that blob in a path which didn't have the Edit 2: I'm also seeing issues with file outputs, possibly related to #665 - will confirm it works inside the container with the master branch, then tests the PRs incrementally. |
After some more testing, I confirm that uploading individual files works exactly the same in the master branch and the master + #665 + #681 version. Both place files in As stated in my previous post, the directories are handled differently for the master + #665 + #681 version. Files are uploaded to In order to test files and directories with the same rule, I modified it to no longer rely on the # Run a command which creates an arbitrary set of files in data/test/
rule minimal_test:
resources:
machine_type = 'n1-standard-4',
disk_mb = 10000
output:
directory('data/test/')
#'data/test/1.txt'
shell:
'''
mkdir -p 'bucket/data/test/'
for i in $(seq 1 5); do echo $i > bucket/data/test/$i.txt; done
''' So the extra prepend is specific to the handling for directories, possibly this upload_blob = (
self.bucket.blob(f)
if f.endswith("/")
else self.bucket.blob(f + "/")
) should be this upload_blob = (
self.blob(f)
if f.endswith("/")
else self.blob(f + "/")
) since the definition of blob is already @lazy_property
def blob(self):
return self.bucket.blob(self.key) |
Thanks for this detailed testing @CowanCS1 ! I really appreciate it because I would have tested it before opening the PR, but don't have an account to use at the moment. I'll update the "upload_blob" as you've specified. |
Tested and found a few different issues.
def update_blob(self):
self._blob = self.bucket.get_blob(self.key) I tried a quick fix - moving the I thought this would be a good point to stop and get your feedback @vsoch seems like the best option would be to directly detect whether the user supplied the I'm not sure what your situation is, but would you like me to explore whether it is possible to get you a GCP account? |
I've restored the PR back to use self.bucket.blob - I was fairly confident about this but thought you had tested what you proposed and verified that it worked. Could you please clarify in the future if you've done this or not? I think to help I can explain what I do - you can launch the branch with the changed PR first, and then inspect the stdout to get the command that is run to create the instance. Then you can create a dummy instance, install the proper version of snakemake, and follow the command to download and extract the working directory from storage. I like to have the working directory be the snakemake version I want to test so it's included and I can extract. Then I can run the step manually on the worker, and insert
to interactively test. I usually do that a bunch of times in different spots until I figure it out. I've updated the PR with upstream master (with the merged touch fix) so it should at least be easier to test.
I think @johanneskoester is working on it :) |
Thanks for the explanation - I'll try to be extremely explicit about what testing I have done. I have a version of GS.py which is tested and has improved, but incomplete, support for directories. This code is only for illustrative purposes and should not be included in a PR without further improvements and testing. Selected code from my modified GS.py with improved directory support @retry.Retry(predicate=google_cloud_retry_predicate)
def upload(self):
try:
if not self.bucket.exists():
self.bucket.create()
self.update_blob()
# Distinguish between single file, and folder
f = self.local_file()
if os.path.isdir(f):
# Ensure the "directory" exists
self.blob.upload_from_string(
"", content_type="application/x-www-form-urlencoded;charset=UTF-8"
)
for root, _, files in os.walk(f): # walk over local files
for filename in files:
filename = os.path.join(root, filename)
p = pathlib.Path(filename).parts[1:] # strips the bucket part of the filename from the local
filekey = str( pathlib.Path(*p) ) # creates a key pointing to the remote
self.bucket.blob(filekey).upload_from_filename(filename)
else:
self.blob.upload_from_filename(f)
except google.cloud.exceptions.Forbidden as e:
raise WorkflowError(
e,
"When running locally, make sure that you are authenticated "
"via gcloud (see Snakemake documentation). When running in a "
"kubernetes cluster, make sure that storage-rw is added to "
"--scopes (see Snakemake documentation).",
)
@lazy_property
def key(self):
"""
Placeholder for illustrative purposes, do not implement in production
Currently: assumes anything with an extension is a file and anything without is a directory
Better: test is_directory by checking whether the directory() wrapper was applied to the output
Best: append the trailing slash to the filename at input if directory wrapper is applied. May break other code.
"""
_, file_extension = os.path.splitext(self.local_file())
return self.parse().group("key") if file_extension else self.parse().group("key") + "/" Output when running this is that the files are created in the correct place on remote. We still get an error, but it now appears to be outside of this rule.
I have checked the I did a quick test to verify that the workflow outside of that rule is not seeing The key points from my testing thus far are
Seems like a good place to stop and get your advice @vsoch . |
hey @CowanCS1 ! We just merged a PR into master to address directories on GLS - would you care to test it out? |
Ping @CowanCS1 - have you had a chance to test? I was going to close the issue, but want to make sure you are good. |
@vsoch I'm having a similar error. When using VEP wrappers to download plugings or cache (see below). I can see that there is a o byte object with the same name as the folder created in the bucket. This may explain the error above. In my case I don't see that error, but a later rule which uses that folder as input fails as it copies the 0 byte object instead of the folder rule get_vep_cache: rule get_vep_plugins: |
I am trying to run a workflow rule that creates a directory in GS, but Snakemake continually fails to recognize that the directory exists. The error message recommends using the
directory()
flag, which I am.This appears to be related to #396.
Snakemake version
5.22.1
Describe the bug
Output directories are flagged as missing when using GS remote provider.
Logs
The most salient part:
Full log: error_log.txt
Minimal example
Here is the offending rule, and I apologize that this isn't fully reproducible but it's difficult to share some of the details:
Invocation:
I also get the same error when running on a cluster, i.e. using:
Additional context
I am able to work around this by using an individual checkpoint/sentinel file of some kind, but it's unclear to me if directories are even supported for Google Storage. Is that in the docs somewhere? Am I just trying to use some feature that doesn't exist?
The text was updated successfully, but these errors were encountered: