Skip to content

Commit

Permalink
fix: remote GS builds too many inventories; io:collect_mtime always u…
Browse files Browse the repository at this point in the history
…ses uncached mtime (#2266)

### Description
This PR includes two small changes that drastically reduce inventory
building time when using Google remote files, as well as perhaps
unnecessary extra mtime checks.

The `IOCache` class uses
[`ExistsDicts`](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#LL106C1-L122C16)
to quickly lookup if a file is in the cache or not. Specifically, in
https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L119-L122
`ExistsDict` checks if a `parent` of the query file has been
inventoried.

This is problematic, because when we build the inventory for a remote GS
file, we add `bucket/subfolder` to `has_inventory`
https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/remote/GS.py#L195
But only return `self.bucket` from `get_inventory_parent`
https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/remote/GS.py#L199-L200

Which leads to many extra inventories being built. We fix this by
returning `self.bucket/subfolder`.


The second commit in this PR I am less sure about. In
[`IOCache`](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L125)
, we build an [`mtime_inventory` asynchronously
](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L135-L172).
However, for reasons I'm not sure about, each async worker uses
https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L174-L175
to get `mtime` info for a file. This is confusing to me as it explicitly
uses `mtime_uncached`, which directly checks the `mtime` for a file, and
for remote files this means expensive API calls. I'm not sure why
`mtime_uncached` is used here, but it would be good to know if this is
intentional.

Benchmarking these two changes shows some pretty impressive improvement.

Here is a simple workflow:
```
rule all:
    input: expand("results/{i}.txt", i=range(100))

rule make_file:
    output: "results/{i}.txt"
    shell: "echo {wildcards.i} > {output}"
```
Timed with `GNU time` three times.

Timing (no fix):
```
0:31.33 elapsed
0:35.06 elapsed
0:31.10 elapsed
```

Timing (with fix):
```
0:06.13 elapsed
0:06.12 elapsed
0:03.15 elapsed
```

The improvements are even more significant when testing with some of my
larger workflows.

Would appreciate discussion/thoughts on this @johanneskoester and @vsoch
and of course anyone else!

<!--Add a description of your PR here-->

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).

---------

Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
  • Loading branch information
3 people committed Jun 30, 2023
1 parent e2d64fa commit bad9115
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion snakemake/remote/GS.py
Expand Up @@ -194,7 +194,7 @@ async def inventory(self, cache: snakemake.io.IOCache):
# === Implementations of abstract class members ===

def get_inventory_parent(self):
return self.bucket_name
return f"{self.bucket_name}/{os.path.dirname(self.blob.name)}"

@retry.Retry(predicate=google_cloud_retry_predicate)
def exists(self):
Expand Down

0 comments on commit bad9115

Please sign in to comment.