Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MissingOutputException with directory and GS remote provider #576

Open
eric-czech opened this issue Aug 19, 2020 · 38 comments
Open

MissingOutputException with directory and GS remote provider #576

eric-czech opened this issue Aug 19, 2020 · 38 comments
Labels
bug Something isn't working

Comments

@eric-czech
Copy link

eric-czech commented Aug 19, 2020

I am trying to run a workflow rule that creates a directory in GS, but Snakemake continually fails to recognize that the directory exists. The error message recommends using the directory() flag, which I am.

This appears to be related to #396.

Snakemake version

5.22.1

Describe the bug

Output directories are flagged as missing when using GS remote provider.

Logs

The most salient part:

Uploading to remote: rs-ukb/logs/bgen_to_zarr.XY.txt
Finished upload.
ImproperOutputException in line 57 of /workdir/Snakefile:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule bgen_to_zarr:
rs-ukb/prep-data/gt-imputation/ukb_chrXY.zarr
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in handle_job_success

Full log: error_log.txt

Minimal example

Here is the offending rule, and I apologize that this isn't fully reproducible but it's difficult to share some of the details:

def bgen_samples_path(wc):
    n_samples = bgen_contigs.loc[wc.bgen_contig]['n_consent_samples']
    return [f"raw-data/gt-imputation/ukb59384_imp_chr{wc.bgen_contig}_v3_s{n_samples}.sample"]

rule bgen_to_zarr:
    input:
        bgen_path="raw-data/gt-imputation/ukb_imp_chr{bgen_contig}_v3.bgen",
        variants_path="raw-data/gt-imputation/ukb_mfi_chr{bgen_contig}_v3.txt",
        samples_path=bgen_samples_path
    output:
        directory("prep-data/gt-imputation/ukb_chr{bgen_contig}.zarr")
    params:
        contig_index=lambda wc: bgen_contigs.loc[str(wc.bgen_contig)]['index']
    conda:
        "envs/gwas.yaml"
    log:
        "logs/bgen_to_zarr.{bgen_contig}.txt"
    shell:
        # This will write to the local {output} path
        "python scripts/convert.py bgen_to_zarr "
        "--input-path-bgen={input.bgen_path} "
        "--input-path-variants={input.variants_path} "
        "--input-path-samples={input.samples_path} "
        "--output-path={output} "
        "--contig-name={wildcards.bgen_contig} "
        "--contig-index={params.contig_index} "
        "--remote=False 2> {log} "

Invocation:

snakemake --use-conda --cores 1 \
--default-remote-provider GS --default-remote-prefix $GS_BUCKET \
$GS_BUCKET/prep-data/gt-imputation/ukb_chrXY.zarr

I also get the same error when running on a cluster, i.e. using:

snakemake --use-conda --kubernetes \
--default-remote-provider GS --default-remote-prefix $GS_BUCKET \
$GS_BUCKET/prep-data/gt-imputation/ukb_chrXY.zarr

Additional context

I am able to work around this by using an individual checkpoint/sentinel file of some kind, but it's unclear to me if directories are even supported for Google Storage. Is that in the docs somewhere? Am I just trying to use some feature that doesn't exist?

@eric-czech eric-czech added the bug Something isn't working label Aug 19, 2020
@eric-czech
Copy link
Author

Is this the answer I'm looking for?

from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS.allows_directories
> False

@vsoch
Copy link
Contributor

vsoch commented Aug 20, 2020

hey @eric-czech just a quick question - if the *.zarr file is a directory, did you try appending a slash to the end? E.g., prep-data/gt-imputation/ukb_chr{bgen_contig}.zarr/.

@eric-czech
Copy link
Author

Hm I had not but I just tried a few variations on that using a simpler rule:

rule test:
    input:
        f"tmp/input.txt"
    output:
        # Attempts:
        # "tmp/output.zarr"
        # "tmp/output.zarr/"
        # directory("tmp/output.zarr")
        directory("tmp/output.zarr/")
    conda:
        "envs/spark.yaml"
    shell:
        "mkdir -p {output} && "
        "echo '1' > {output}/f1.txt && "
        "echo '2' > {output}/f2.txt"

I got the same result each time -- here's the log from one of the runs:

Log > snakemake --use-conda --cores=1 --local-cores=1 --default-remote-provider GS --default-remote-prefix rs-ukb rs-ukb/tmp/output.zarr

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 test
1

[Wed Aug 19 16:16:20 2020]
rule test:
input: rs-ukb/tmp/input.txt
output: rs-ukb/tmp/output.zarr
jobid: 0

Skipped removing non-empty directory rs-ukb/tmp/output.zarr
Downloading from remote: rs-ukb/tmp/input.txt
Finished download.
Activating conda environment: /home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/1a7806c2
ImproperOutputException in line 109 of /home/eczech/repos/ukb-gwas-pipeline-nealelab/Snakefile:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule test:
rs-ukb/tmp/output.zarr
File "/home/eczech/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/init.py", line 544, in handle_job_success
File "/home/eczech/miniconda3/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/init.py", line 225, in handle_job_success
Removing output files of failed job test since they might be corrupted:
rs-ukb/tmp/input.txt, rs-ukb/tmp/output.zarr
Skipped removing non-empty directory rs-ukb/tmp/output.zarr
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
An error occurred
Complete log: /home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/log/2020-08-19T161617.862561.snakemake.log

@eric-czech
Copy link
Author

I thought an extension on the directory name could be an issue so I tried with and without the .zarr (and with/without trailing slashes) but still got the same error.

@vsoch
Copy link
Contributor

vsoch commented Aug 20, 2020

I think the core issue here (and the reason we have allow_directories set to False) is because we would need a way to determine if a "subfolder" in Google Storage exists. Here is a dummy test for one of these paths that I know to exist:

bucket = client.bucket('snakemake-testing')
<Bucket: snakemake-testing>

bucket.blob("1/").exists()
False

bucket.blob("1").exists()
False

So they key would be finding a way to get support for that. Another alternative is that snakemake could (when it's creating the inventory) keep a set of separate directory names that are deemed to exist based on filenames. And of course the third options to do what you already did, and use a file in the directory as a proxy for the directory existence. @johanneskoester what are your thoughts?

@aryarm
Copy link
Member

aryarm commented Sep 2, 2020

Hi all, I'll just chime in to say that I've also been having this problem and my vote is for creating a set of separate directory names based on filenames, since that is what it seems like others have done.

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 9, 2020

I have also hit this issue when attempting to move an existing workflow onto Google's Cloud Life Sciences.

  • As expected from the above discussion, any output with the directory() wrapper gets removed causing a MissingOutputException

  • If I instead set the output to a created file within the created sub-directory, I still get the same error. A possible cause is that most of the created files for the tool I am using (Star aligner) first creates directories (e.g. path/Log.out/.../... ) and then aggregates results into a file with the same name ( path/Log.out ). The original designation of the path/filename as a directory appears to persist causing it to be removed, exemplified by the fact that the log proceeds to state Skipped removing non-empty directory path/Log.out. I looked through the code to see if this was feasible, but I am not clear on when each function is called.

  • If I create a file as a proxy to indicate the directory's existence, the workflow completes without error but the only file that gets transferred to remote (GCS) is the proxy file, not the rest of the outputs in the created sub-directory. From the comments of @vsoch and @eric-czech I would have expected that this would cause the entire sub-directory to be tracked and all outputs saved to remote upon completion. For example, I create the tracking file path/status.txt and as my sole output and then in remote storage I will see bucket/path/status.txt but not other files such as bucket/path/Log.out.

It would be very useful to hear what parts of this are expected behavior. At present the only method I can think of to make this particular rule work on Cloud Life Sciences is to list all of the required outputs explicitly, and attempt to remove the is_directory flag where it incorrectly persists by having the remote shell copy them into a second location after they are created.

@vsoch
Copy link
Contributor

vsoch commented Oct 9, 2020

hey @CowanCS1 I just wanted to let you know that I hear you! There is one thing I want to try that might allow for a folder, and I'll try to make some time this weekend. I can't make promises because Google Storage isn't really designed to have "folders" (a folder is just part of the path to a file in an object store) but I at least want to test something out. I'll let you know if it gets anywhere. In the meantime, could you provide a simple Snakefile that reproduces your particular issue? That will be very helpful for me to test.

@CowanCS1
Copy link
Contributor

Hi @vsoch - thanks! I very much appreciate the work you and others have done to make it easier to use Snakemake on GCP.

I wrote a minimal-ish example to both demonstrate the issue and identify a work-around for uploading arbitrary directory contents, aiming to keep the solution somewhat within the existing framework. The work-around I have settled on is to use the gls.py helper file and explicitly start a "save to remote" call for the relevant sub-directories. As mentioned above, a single file in the sub-directory is referenced as the snakemake output.

Here is the example with work-around in place:

Snakemake version: 5.26.1

Command used to execute workflow:

snakemake --google-lifesciences --use-conda --default-remote-prefix bucket --google-lifesciences-region europe-west1 -p

The code is always run with the most recent version of all files committed into a github repository.

rule results:
    input:
        'data/test/1.txt'
    resources:
        machine_type = 'n1-standard-16', # vCPUs: 16, RAM: 60GB
        disk_mb      = 10000             # 10 GB


rule test_save:
    input:
        gtf               = expand("data/ref/{gtf}.gtf", gtf=config['gtf']),      # ignore the specifics, holdover from the original rule I modified
    output:
        first             = 'data/test/1.txt'
    resources:
        machine_type      = 'n1-standard-16', # vCPUs: 16, RAM: 60GB
        disk_mb           = 10000             # 10 GB
    log:
        'log/test_save.log',
    shell:
        """
        /bin/bash
        echo "Creating files" >&{log} 
        mkdir -p bucket/data/test/ && for i in $(seq 1 5); do echo $i > bucket/data/test/$i.txt; done && ls -l bucket_workflow/data/test >>{log} 2>&1
        echo "Downloading gls.py" >>{log} 2>&1
        wget -O /gls.py https://gist.githubusercontent.com/vsoch/f5a6a6d1894be1e67aa4156c5b40c8e9/raw/a4e9ddbeba20996ca62745fcd4d9ecd7bfa3b311/gls.py >>{log} 2>&1
        echo "Setting attributes of gls.py" >>{log} 2>&1
        chmod +x /gls.py >>{log} 2>&1
        echo "Activating snakemake virtual environment" >>{log} 2>&1
        echo "Done" >>{log} 2>&1
        set +eu
        source activate snakemake >>{log} 2>&1
        set -eu
        echo "Executing gls.py to save bucket/data/test" >>{log} 2>&1
        python /gls.py save bucket bucket/data/test data/test >>{log} 2>&1
        """

And the resulting log file:

Creating files
total 20
-rw-r--r-- 1 root root 2 Oct  9 13:24 1.txt
-rw-r--r-- 1 root root 2 Oct  9 13:24 2.txt
-rw-r--r-- 1 root root 2 Oct  9 13:24 3.txt
-rw-r--r-- 1 root root 2 Oct  9 13:24 4.txt
-rw-r--r-- 1 root root 2 Oct  9 13:24 5.txt
Downloading gls.py
--2020-10-09 13:24:51--  https://gist.githubusercontent.com/vsoch/f5a6a6d1894be1e67aa4156c5b40c8e9/raw/a4e9ddbeba20996ca62745fcd4d9ecd7bfa3b311/gls.py
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4071 (4.0K) [text/plain]
Saving to: ‘/gls.py’

     0K ...                                                   100% 49.9M=0s

2020-10-09 13:24:51 (49.9 MB/s) - ‘/gls.py’ saved [4071/4071]

Setting attributes of gls.py
Activating snakemake virtual environment
Executing gls.py to save data/test

The following files will be uploaded: data/test/3.txt
data/test/4.txt
data/test/5.txt
data/test/2.txt
data/test/1.txt
data/test/3.txt -> bucket/data/test/3.txt
Uploading data/test/3.txt to bucket/data/test/3.txt
data/test/4.txt -> bucket/data/test/4.txt
Uploading data/test/4.txt to bucket/data/test/4.txt
data/test/5.txt -> bucket/data/test/5.txt
Uploading data/test/5.txt to bucket/data/test/5.txt
data/test/2.txt -> bucket/data/test/2.txt
Uploading data/test/2.txt to bucket/data/test/2.txt
data/test/1.txt -> bucket/data/test/1.txt
Uploading data/test/1.txt to bucket/data/test/1.txt
Done
  • Changing the output in the above code to data/test reproduces the findings of @eric-czech
  • Without python /gls.py save bucket bucket/data/test data/test the workflow will complete successfully, but only the file bucket/data/test/1.txt will be present in GCS not the other 4 files. Possibly the expected behavior, I am unclear based on the discussion above, but one would desire that there would be a way to track and save created files within certain directories without explicitly listing each output file within the directory explicitly.
  • I haven't completed tests with the hypothesized "create file as directory, remove, rename" issue, as it is circumvented by this work-around.

Errata:

  • This, and all other workflows I have tested refuse to overwrite the Snakemake log files at bucket/bucket/google-lifesciences-logs/. They will write them the first time, or replace them if they are deleted, but will not overwrite existing files. Expected behavior is for the log files to be overwritten on subsequent runs. Log files for individual rules work as expected.
  • I am aware the entire command would normally be inside a /bin/bash -c "" but it was better legible this way and didn't appear to break anything downstream.
  • The set +eu command was to avoid this issue: [conda 4.6] activate fails due to unbound PS1 in bash strict mode conda/conda#8186

@aryarm
Copy link
Member

aryarm commented Oct 10, 2020

@CowanCS1 I can confirm the bug with the log files not getting overwritten. We've experienced this ourselves, too.
But I didn't have the time to come up with a minimally reproducible example, so I never submitted a proper issue for it.
Also, I wasn't sure it it was a bug or just intended behavior.

@vsoch
Copy link
Contributor

vsoch commented Oct 10, 2020

hey @CowanCS1! So it looks like you are copying the code that the worker runs for the step (downloading gls.py) can you provide a basic example of generating some number of files with a directory prefix for the output that fails? We need to work on one thing at a time, and there are many moving pieces in the Snakefile above.

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 11, 2020

Hi @vsoch ! Absolutely.

Snakemake version: 5.26.1

Rule:

# Return all files created in the sub-directory /data/test/
rule results:
    input:
        directory('data/test/')

# Run a command which creates an arbitrary set of files in /data/test/
rule minimal_test:
    resources:
        machine_type      = 'n1-standard-4', 
        disk_mb           = 10000           
    output:
        directory('data/test/')
    shell:
        '''
        mkdir -p {output}
        for i in $(seq 1 5); do echo $i > {output}/$i.txt; done
        '''

Output when executed with snakemake --google-lifesciences --use-conda --default-remote-prefix bucket --google-lifesciences-region europe-west1 -p:

ImproperOutputException in line 7 of /workdir/Snakefile:
      Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule minimal_test:

Purpose of rule: Intended to mimic any rule whose output is a collection of files stored within a new sub-directory.

Expected outcome: For all 5 files created within the bucket/data/test subdirectory within the GCP VM instance to be saved to bucket/data/test/ on GCS. When this rule is run locally with snakemake --cores 8 there are no errors, and the files are present.

Hopefully this is what you were requesting - there is a degree of uncertainty on my part, but I'm happy to help further as needed.

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 11, 2020

In case it is helpful, here is a function I wrote to try and test for the existence of directories on GCS. It stills fails in the case that the directory exists as a placeholder only, containing neither files nor sub-directories containing files. That seems acceptable to me if documented.

def is_dir( fname ):
    blobs = list( bucket.list_blobs( prefix = fname, delimiter='/') )
    if len(blobs) == 1:
        print('%s is a file, not a directory'%fname)
        return False
    elif len(blobs) > 1:
        print('%s is a directory that contains files'%fname)
        return True
    else:
        blobs = list( bucket.list_blobs( prefix = fname, ) )
        if len(blobs) == 0:
            print('%s contains no files and may not exist'%fname)
            return False
        else:
            print('%s is a directory containing subdirectories which contain files'%fname)
            return True

Test commands using the same data structure from my minimal example immediately preceding this:

print( '%s\n'%is_dir( 'data/' ) );
print( '%s\n'%is_dir( 'data/test/' ) );
print( '%s\n'%is_dir( 'data/test/1.txt' ) );
print( '%s\n'%is_dir( 'data/test/qwerty' ) );

And the output:

data/ is a directory containing subdirectories which contain files
True

data/test/ is a directory that contains files
True

data/test/1.txt is a file, not a directory
False

data/test/qwerty contains no files and may not exist
False

@vsoch
Copy link
Contributor

vsoch commented Oct 11, 2020

okay, so I'm on a worker node and reproducing the issue. The error is thrown by this function so I'm walking through it to see what happens.

Here is our job

job
# minimal_test

We define expanded output as a list:

expanded_output = []
for path in job.expanded_output:
    expanded_output.append(job.shadowed_path(path))
expanded_output
['snakemake-testing/testing-output/data/test']

That looks about right. I can also list that directory to confirm that it exists, and that there are files inside.

ls snakemake-testing/testing-output/data/test/
1.txt  2.txt  3.txt  4.txt  5.txt

Can verify that the not ignore_missing_output is set, so we jump into the try except (all is well, no error with wait_for_files).
So we should be getting to where the error is thrown, let's reproduce that.

f = expanded_output[0]
#  'snakemake-testing/testing-output/data/test'

This seems kind of off:

os.path.isdir(f)
True

f.is_directory
False

So the condition that triggers the exception is:

os.path.isdir(f) and not f.is_directory

I'm not sure if this is intentional, but let's explore how that is derived. It's a Snakemake IOFile:

type(f)
snakemake.io._IOFile

Ah so I think I know this bug - I fixed it last week but it hasn't been merged yet. When a remote is added, it strips away previous tags. To step back, each file object has flags, these are basically functions that you'd add to some path in a snakeflie (e.g., touch, directory, etc.). But when we have an AnnotatedString (with the correct flags) that gets parsed into the path of a remote object, it strips away previous tags (the PR I linked fixes this). The is_directory
is just looking for the associated flag:

return is_flagged(self._file, "directory")

but it's not there.

self._file.flags
{'remote_object': <snakemake.remote.GS.RemoteObject at 0x7f35ec2c6be0>}

So to emulate how it would be with the fix, I'm going to artifically add the tag back and then the is_directory should be true.

self._file.flags['directory'] = True
self.is_directory
True

That should get us through that particular error (note the string below is not printed)

if (f.is_directory and not os.path.isdir(f)) or (
    ...:                 os.path.isdir(f) and not f.is_directory
    ...:             ):
    ...:     print("ERROR IS TRIGGERED")
# not printed

So I can apply the fixes from that PR to see if we get further. One quick note is that your first rule I think "input" should be "output"? I changed the Snakefile to be:

# Return all files created in the sub-directory /data/test/
rule results:
    output:
        directory('data/test/')

# Run a command which creates an arbitrary set of files in /data/test/
rule minimal_test:
    resources:
        machine_type      = 'n1-standard-4', 
        disk_mb           = 10000           
    output:
        directory('data/test/')
    shell:
        '''
        mkdir -p {output}
        for i in $(seq 1 5); do echo $i > {output}/$i.txt; done
        '''

It still errors on the remote file upload, so let's look at that. I think here is obviously the issue, it only respects an upload given that it's a local file (which test is not)

>>> self.local_file()
'snakemake-testing/testing-output/data/test'

Here is explicitly what that looks like:

>>> self.blob.upload_from_filename(self.local_file())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/google/cloud/storage/blob.py", line 2335, in upload_from_filename
    with open(filename, "rb") as file_obj:
IsADirectoryError: [Errno 21] Is a directory: 'snakemake-testing/testing-output/data/test'

So - now we probably need to think about design. I suspect this is done intentionally, because otherwise you could provide some random output directory and then get access to everything (that maybe should not have access to). I am looking at other remotes and I don't see immediate support for walking a directory, and maybe this is intentional, it wouldn't be relevant when running locally because there isn't any content to upload. So what I can do is to put in a PR (note that it would be dependent on the other one being merged to work fully) that walks over the directories in the path, and uploads them to storage. So when I add that, there is one more remaining issue:

Uploading to remote: snakemake-testing/testing-output/data/test
Finished upload.
WorkflowError:
The file does not seem to exist remotely: snakemake-testing/testing-output/data/test
  File "/workdir/snakemake/executors/__init__.py", line 566, in handle_job_success
  File "/workdir/snakemake/executors/__init__.py", line 243, in handle_job_success
  File "/workdir/snakemake/remote/GS.py", line 172, in mtime

okay, so now we're getting into hairy details. I can also update the exists function to use a file (any flie found) as a proxy for existing, but then all the functions that use self.exists() as a boolean break - for example, here we could only get a timestamp of a blob given that the blob exists. Since the blob itself is a directory, and doesn't have any kind of representation in storage, this throws an error. Any function in GS that requires metadata from the blob also throws an error. E.g.,

        if self.exists():
            self.update_blob()
            t = self.blob.updated
            return t.timestamp()
        else:
            raise WorkflowError(
                "The file does not seem to exist remotely: %s" % self.local_file()
            )

So here is the issue - what we would essentially be doing is hard coding the remote to use the first file found as a proxy for the directory existing. That's exactly what the user can just do on the command line, instead of defining an output directory, define the files! I am starting to agree that directory support for objects should not be supported and likely this was a decision already thought about and made. @johanneskoester it would be good to have your feedback here. Unless this is supported by other object stores (and there is a clear method to do it), my thinking is that directory support doesn't make sense for object storage.

@CowanCS1
Copy link
Contributor

Thanks @vsoch - that is very useful. I can certainly appreciate that there are cascading implications for this issue.

The first rule in my minimal example could actually be removed - just my habit to put a placeholder rule at the top to define the final output of the workflow.

General thoughts:

  • I presume that the central aim is that Snakemake users should be able to point their workflow to different executors and have it "just work"
  • If it is decided that the directory wrapper is not supported at all on remote, it would be useful to have clear error messages and a list in the docs of incompatible features and good alternatives. Just to clearly communicate that the ways in which workflows are not cross-compatible across executors.
  • While listing the outputs explicitly is a good alternative for a small number of files, there are cases where thousands of objects would be created. A more broadly applicable option would be to have the user tarball any directory tree they want to back up, make that tarball the output, and reverse the process if it is used as an input in a downstream rule.

Briefly, I wanted to mention that there is a relatively simple fix to the make your example above work as expected. There is an undocumented way to make GCS objects which function like directories and are compatible with API queries.

blob = bucket.blob('data/test/from_api/')
print(blob.exists())

blob.upload_from_string('', content_type='application/x-www-form-urlencoded;charset=UTF-8')
print(blob.exists())

Output:

False
True

Directories created in this fashion are list-able and can be browsed in cloud console as directories. If these directories are created when recursively uploading a directory tree, I think it would fix the last problem you listed above. For other remotes, this may not be possible - I can imagine how this could be a general issue with cloud storage. I'm not pushing for this to be actually applied, since it is undocumented and might still break if the user mixes in directories created by other methods, but it is an option I wanted to make you aware of.

@aryarm
Copy link
Member

aryarm commented Oct 12, 2020

For some rules, it can be hard to know what the files will be called (or how many there will be) ahead of time. I think this was one of the original reasons for creating the directory() feature.

Couldn't the hidden .snakemake_timestamp file described in the documentation be used to deduce the existence and timestamp of the directory, instead?

@vsoch
Copy link
Contributor

vsoch commented Oct 12, 2020

@CowanCS1 that's a neat try! I'll try it out now.

@vsoch
Copy link
Contributor

vsoch commented Oct 12, 2020

Actually I'm not sure I can use the snakemake cloud account, my instance is gone and I'll need to verify and set it up again. :(

@vsoch
Copy link
Contributor

vsoch commented Oct 12, 2020

okay, so without permission to do that the most I can do is put in a PR with changes that might work, and then get reviewers. But it depends on the PR that I previously opened to fix the remote object flags so I'm not sure you'd be able to test without pulling that one too.

@vsoch
Copy link
Contributor

vsoch commented Oct 12, 2020

okay, here is at least a start to what we talked about #681. If you are able to test this, you first need #665. I don't have a current account I can use so if anyone has that and bandwidth to test it would be greatly appreciated!

@CowanCS1
Copy link
Contributor

I made an attempt at this, but I'm new to git repositories so let me document what I did and you can judge.

  • Installed the gh command line tool
  • gh repo clone snakemake/snakemake
  • cd into snakemake and gh pr checkout 665 then gh pr checkout 681
  • switched back to master git checkout master
  • made a new branch git checkout -b merge-665-681 and merged the prs git merge fix/touch | cat and 'git merge test/gls-directory-output | cat'
  • execute mamba env create -f test-environment.yml -n test-sm-gcp2
  • activated test-sm-gcp2 and ran python setup.py develop
  • Confirmed the relevant sections of code in snakemake/snakemake are modified per fixing bug with using touch alongside remote prefix #665 and a suggested fix for directory support in Snakefile for GLS #681

Then the test:
Snakemake version: 5.26.1+11.g2f63684a.dirty

Minimal example rule, further simplified by removing unnecessary first rule:

# Run a command which creates an arbitrary set of files in data/test/
rule minimal_test:
    resources:
        machine_type      = 'n1-standard-4', 
        disk_mb           = 10000           
    output:
        directory('data/test/')
    shell:
        '''
        mkdir -p {output}
        for i in $(seq 1 5); do echo $i > {output}/$i.txt; done
	'''

Workflow executed with: snakemake --google-lifesciences --use-conda --default-remote-prefix bucket --google-lifesciences-region europe-west1 -p

Unfortunately, we hit the error:

ImproperOutputException in line 2 of /workdir/Snakefile:
      Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule minimal_test:
      bucket/data/test
        File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 566, in handle_job_success
        File "/opt/conda/envs/snakemake/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 243, in handle_job_success
      Removing output files of failed job minimal_test since they might be corrupted:
      bucket/data/test
      Skipped removing non-empty directory bucket/data/test
      Shutting down, this might take some time.
      Exiting because a job execution failed. Look above for error message

Looks like the error you had expected be resolved by #665 - I haven't walked through to confirm that the tags were in the same state.

I also tested the code added within #681 in isolation, and that appears to work as expected.

@vsoch
Copy link
Contributor

vsoch commented Oct 13, 2020

The worker on GLS is not going to have access to the updated code, you’d need to build and push a container to a registry and use it with the —container-image flag.

@CowanCS1
Copy link
Contributor

Makes sense 👍 Thanks!

If someone else doesn't get to it first, I will try at my next available opportunity.

@CowanCS1
Copy link
Contributor

Tested!

The code completed without error, but the result was not as expected.

A "test" object appeared in the directory which looked like this:

f = 'data/te'
[i.name for i in bucket.list_blobs(prefix=f)]
['data/test']

I confirmed that the output of my rule is 'data/test/' so it looks like the trailing forward slash is being dropped. I also tested that the code you're using works when supplied with a trailing slash.

Object can be created with trailing slash

f = 'data/test/'
blob = bucket.blob(f)
blob.upload_from_string(
        "", content_type="application/x-www-form-urlencoded;charset=UTF-8"
    )
blob.exists()
True

Object appears with trailing slash in blob listing and as a directory in cloud console. I retained the Snakemake generated object 'data/test' for contrast.

f = 'data/te'
[i.name for i in bucket.list_blobs(prefix=f)]
['data/test', 'data/test/']

Since we would want to cover the case where the user applies the directory wrapper but doesn't include the trailing slash, we can cover both cases by appending it here.

If you can update the PR, I can rebuild to test tomorrow. I am getting a crash course in git and containers this week, but haven't covered how to submit pull requests yet.

@vsoch
Copy link
Contributor

vsoch commented Oct 13, 2020

@CowanCS1 that should be easy to do! And that's an interesting finding, because I expected the trailing slash to not really matter given that some files are produced in the folder. But perhaps not having a trailing slash produces a file, and then it's not possible to treat it further as a directory. Of course now I'm concerned if adding the trailing slash will invalidate being able to create it as what looks like an empty object. We will have to test!

@vsoch
Copy link
Contributor

vsoch commented Oct 13, 2020

okay so I updated the PR to ensure we upload with a trailing slash 40f5395.

@CowanCS1
Copy link
Contributor

Great! Many thanks.

In blob storage, the trailing slash seems to be interpreted as a directory by the console. It even works if you put data into it - I was a little afraid to try, but sure enough you can read data from the directory blobs just like the rest. That opens up some weird opportunities for hidden directory annotation.

I confirm there isn't a problem with creating one of these placeholder directories and writing files with the same directory prefix.

@CowanCS1
Copy link
Contributor

@vsoch a basic question - this edit was a commit without a branch, whereas the last two were pull requests with a branch. Any tips on how I can download this similar to gh pr checkout 665 ?

@vsoch
Copy link
Contributor

vsoch commented Oct 13, 2020

The commit is now represented in the same PR branch, so the easiest thing is to do the exact same set of steps as before. If you want to update the current combined branch that you put together (harder) then you'd want to pull again from the PR branch to grab the commit. You can use git log to peek and see what commits you've added, and of course visually check the code for a sanity check.

Git is weird and hard, and there are always weird operations like this one. You are doing great! :)

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 14, 2020

Progress!

Same test setup as yesterday, just rebuilding the dockerfile with the updated PR #681

Snakemake fails with the following message:

rule minimal_test:
            output: bucket/data/test
            jobid: 0
            resources: mem_mb=1000, disk_mb=10000, machine_type=n1-standard-4

        Uploading to remote: bucket/data/test
        Finished upload.
        WorkflowError:
        The file does not seem to exist remotely: bucket/data/test
          File "/opt/conda/envs/snakemake/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 566, in handle_job_success
          File "/opt/conda/envs/snakemake/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 249, in handle_job_success
          File "/opt/conda/envs/snakemake/lib/python3.6/site-packages/snakemake/remote/GS.py", line 173, in mtime
        Removing output files of failed job minimal_test since they might be corrupted:
        bucket/data/test
        Shutting down, this might take some time.

I can see a simple explanation for this in GCS. For clarity I'll describe what I see using the URI style gs:// notation.

The directory and all of its containing files were uploaded to gs://bucket/bucket/data/test/ rather than gs://bucket/data/test/. Then the existence check appears to be happening on gs://bucket/data/test/ an subsequently failing.

Here is a check of the GCS contents with the API:

[i.name for i in bucket.list_blobs( prefix = 'bucket/data/test/') ]
['bucket/data/test/',
 'bucket/data/test/.snakemake_timestamp',
 'bucket/data/test/1.txt',
 'bucket/data/test/2.txt',
 'bucket/data/test/3.txt',
 'bucket/data/test/4.txt',
 'bucket/data/test/5.txt']

Appears to be perfect, other than bucket being prepended.

Thanks @vsoch ! It has been a good experience. I need this for my own workflows, and I can attest to the lack of a simple work-around for certain workflows.

Edit 1: I admit to being a bit confused about why I saw that blob in a path which didn't have the bucket/ prepended to it yesterday - but I tested today that setting the output to both data/test/ and data/test work with the same result described above.

Edit 2: I'm also seeing issues with file outputs, possibly related to #665 - will confirm it works inside the container with the master branch, then tests the PRs incrementally.

@CowanCS1
Copy link
Contributor

After some more testing, I confirm that uploading individual files works exactly the same in the master branch and the master + #665 + #681 version. Both place files in gs://bucket/data/test/... and correctly validate that the files exist remotely at that location.

As stated in my previous post, the directories are handled differently for the master + #665 + #681 version. Files are uploaded to gs://bucket/bucket/data/test/... but the directory check appears to occur for gs://bucket/data/test/ which fails.

In order to test files and directories with the same rule, I modified it to no longer rely on the {output} but rather manually appended the local bucket prefix to the shell commands. Here is the modified rule:

# Run a command which creates an arbitrary set of files in data/test/
rule minimal_test:
    resources:
        machine_type      = 'n1-standard-4',
        disk_mb           = 10000
    output:
        directory('data/test/')
        #'data/test/1.txt'
    shell:
        '''
        mkdir -p 'bucket/data/test/'
        for i in $(seq 1 5); do echo $i > bucket/data/test/$i.txt; done
        '''

So the extra prepend is specific to the handling for directories, possibly this

upload_blob = (
                    self.bucket.blob(f)
                    if f.endswith("/")
                    else self.bucket.blob(f + "/")
                )

should be this

upload_blob = (
                    self.blob(f)
                    if f.endswith("/")
                    else self.blob(f + "/")
                )

since the definition of blob is already

    @lazy_property
    def blob(self):
        return self.bucket.blob(self.key)

@vsoch
Copy link
Contributor

vsoch commented Oct 14, 2020

Thanks for this detailed testing @CowanCS1 ! I really appreciate it because I would have tested it before opening the PR, but don't have an account to use at the moment. I'll update the "upload_blob" as you've specified.

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 15, 2020

Tested and found a few different issues.

  • First, I see now why you had been using self.bucket.blob(f) instead of self.blob(f). As shown in the definition I linked yesterday, blob(self) returns self.bucket.blob(self.key) - so the f filename argument was not being used in my version, in favor of the stored self.key value. Referencing the bucket method directly circumvents that, sending the client.bucket object which is presumably storing the google storage APIs client class, whose blob object actually does take a file input parameter. Confusing that the blob implementation would be different between the RemoteObject and the client classes - but there may be reasons for this related to other remote storage types. Summary: my fix didn't help.

  • Second, the above implementation of blob is actually what was causing the other problems of directories not being uploaded or being uploaded the wrong location. The wrong location for the created directories is because we were setting f = self.local_file() and defining a self.bucket.blob(f) which evaluated to client.bucket.blob(f). That is in contrast to when we wrote a file, where it used self.blob.upload_from_filename(f) where the self.blob actually evaluates as self.client.bucket(self.key).blob.upload_from_filename(f). The buried self.key is where the prepended bucket name gets stripped and connected to the blob.

  • Third, while I could fix the upload method based on these observations, it seems that the fundamental issue is actually with the definition of self.key as this will also get here during the check for file existence as update_blob also populates itself directly off of self.key

def update_blob(self):
        self._blob = self.bucket.get_blob(self.key)

I tried a quick fix - moving the os.path.isdir(self.local_file()) conditional statement directly into the def key(self) method - but it turns out that this is called and cached very early, before the local directory exists I am not yet familiar enough with this implementation of @lazy_property to know if a recompute of the cached value is possible.

I thought this would be a good point to stop and get your feedback @vsoch seems like the best option would be to directly detect whether the user supplied the directory() wrapper for this output, second best option would be removing the lazy_property decorator from the key method.

I'm not sure what your situation is, but would you like me to explore whether it is possible to get you a GCP account?

@vsoch
Copy link
Contributor

vsoch commented Oct 15, 2020

I've restored the PR back to use self.bucket.blob - I was fairly confident about this but thought you had tested what you proposed and verified that it worked. Could you please clarify in the future if you've done this or not? I think to help I can explain what I do - you can launch the branch with the changed PR first, and then inspect the stdout to get the command that is run to create the instance. Then you can create a dummy instance, install the proper version of snakemake, and follow the command to download and extract the working directory from storage. I like to have the working directory be the snakemake version I want to test so it's included and I can extract. Then I can run the step manually on the worker, and insert

import IPython
ipython.embed()

to interactively test. I usually do that a bunch of times in different spots until I figure it out.

I've updated the PR with upstream master (with the merged touch fix) so it should at least be easier to test.

I'm not sure what your situation is, but would you like me to explore whether it is possible to get you a GCP account?

I think @johanneskoester is working on it :)

@CowanCS1
Copy link
Contributor

CowanCS1 commented Oct 16, 2020

Thanks for the explanation - I'll try to be extremely explicit about what testing I have done.

I have a version of GS.py which is tested and has improved, but incomplete, support for directories. This code is only for illustrative purposes and should not be included in a PR without further improvements and testing.

Selected code from my modified GS.py with improved directory support

    @retry.Retry(predicate=google_cloud_retry_predicate)
    def upload(self):
        try:
            if not self.bucket.exists():
                self.bucket.create()
                self.update_blob()

            # Distinguish between single file, and folder
            f = self.local_file()
            if os.path.isdir(f):

                # Ensure the "directory" exists
                self.blob.upload_from_string(
                    "", content_type="application/x-www-form-urlencoded;charset=UTF-8"
                )
                for root, _, files in os.walk(f):                  # walk over local files
                    for filename in files:
                        filename = os.path.join(root, filename)
                        p = pathlib.Path(filename).parts[1:]       # strips the bucket part of the filename from the local
                        filekey = str( pathlib.Path(*p) )          # creates a key pointing to the remote
                        
                        self.bucket.blob(filekey).upload_from_filename(filename)
            else:
                self.blob.upload_from_filename(f)
        except google.cloud.exceptions.Forbidden as e:
            raise WorkflowError(
                e,
                "When running locally, make sure that you are authenticated "
                "via gcloud (see Snakemake documentation). When running in a "
                "kubernetes cluster, make sure that storage-rw is added to "
                "--scopes (see Snakemake documentation).",
            )

    @lazy_property
    def key(self):
        """
        Placeholder for illustrative purposes, do not implement in production
        Currently: assumes anything with an extension is a file and anything without is a directory
        Better: test is_directory by checking whether the directory() wrapper was applied to the output
        Best: append the trailing slash to the filename at input if directory wrapper is applied. May break other code.
        """
        
        _, file_extension = os.path.splitext(self.local_file())
        return self.parse().group("key") if file_extension else self.parse().group("key") + "/"

Output when running this is that the files are created in the correct place on remote. We still get an error, but it now appears to be outside of this rule.

Waiting at most 5 seconds for missing files.
MissingOutputException

I have checked the RemoteObject.exists() method and it is returning the correct value every time in every circumstance I tested. So seemingly there is a reference to the remote file which doesn't rely on the key property in the RemoteObject class.

I did a quick test to verify that the workflow outside of that rule is not seeing data/test/ because it is still looking for data/test on remote. I created a blob at data/test as a second "directory placeholder". The output switched to an IncompleteFilesException error so clearly it is trying to read from it, and by running snakemake once with --rerun-incomplete it now reports correctly "Nothing to be done". When I then deleted the data/test blob, it reverted to not seeing the output files.

The key points from my testing thus far are

  • Walking the local to upload directory contents needs to be accompanied by stripping the bucket name when creating the blob.
  • The cached key and blob lazy_properties are generated before the local file exists, so we cannot rely on its presence to append a /.
  • There is a check somewhere in the workflow which doesn't appear to rely on the RemoteObject.exists() or RemoteObject.key to check the output's existence on remote.

Seems like a good place to stop and get your advice @vsoch .

@vsoch
Copy link
Contributor

vsoch commented Oct 27, 2020

hey @CowanCS1 ! We just merged a PR into master to address directories on GLS - would you care to test it out?

@vsoch
Copy link
Contributor

vsoch commented Dec 21, 2020

Ping @CowanCS1 - have you had a chance to test? I was going to close the issue, but want to make sure you are good.

@hnawar
Copy link
Contributor

hnawar commented May 8, 2023

@vsoch I'm having a similar error. When using VEP wrappers to download plugings or cache (see below). I can see that there is a o byte object with the same name as the folder created in the bucket. This may explain the error above. In my case I don't see that error, but a later rule which uses that folder as input fails as it copies the 0 byte object instead of the folder

rule get_vep_cache:
output:
directory("resources/vep/cache"),
params:
species=config["ref"]["species"],
build=config["ref"]["build"],
release=config["ref"]["release"],
log:
"logs/vep/cache.log",
wrapper:
"v1.29.0/bio/vep/cache"

rule get_vep_plugins:
output:
directory("resources/vep/plugins"),
log:
"logs/vep/plugins.log",
params:
release=config["ref"]["release"],
wrapper:
"v1.29.0/bio/vep/plugins"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants