Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: remote GS builds too many inventories; io:collect_mtime always u…
…ses uncached mtime (#2266) ### Description This PR includes two small changes that drastically reduce inventory building time when using Google remote files, as well as perhaps unnecessary extra mtime checks. The `IOCache` class uses [`ExistsDicts`](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#LL106C1-L122C16) to quickly lookup if a file is in the cache or not. Specifically, in https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L119-L122 `ExistsDict` checks if a `parent` of the query file has been inventoried. This is problematic, because when we build the inventory for a remote GS file, we add `bucket/subfolder` to `has_inventory` https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/remote/GS.py#L195 But only return `self.bucket` from `get_inventory_parent` https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/remote/GS.py#L199-L200 Which leads to many extra inventories being built. We fix this by returning `self.bucket/subfolder`. The second commit in this PR I am less sure about. In [`IOCache`](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L125) , we build an [`mtime_inventory` asynchronously ](https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L135-L172). However, for reasons I'm not sure about, each async worker uses https://github.com/snakemake/snakemake/blob/6f602a3cceba81b1c2281db3156f9a696bc50b6a/snakemake/io.py#L174-L175 to get `mtime` info for a file. This is confusing to me as it explicitly uses `mtime_uncached`, which directly checks the `mtime` for a file, and for remote files this means expensive API calls. I'm not sure why `mtime_uncached` is used here, but it would be good to know if this is intentional. Benchmarking these two changes shows some pretty impressive improvement. Here is a simple workflow: ``` rule all: input: expand("results/{i}.txt", i=range(100)) rule make_file: output: "results/{i}.txt" shell: "echo {wildcards.i} > {output}" ``` Timed with `GNU time` three times. Timing (no fix): ``` 0:31.33 elapsed 0:35.06 elapsed 0:31.10 elapsed ``` Timing (with fix): ``` 0:06.13 elapsed 0:06.12 elapsed 0:03.15 elapsed ``` The improvements are even more significant when testing with some of my larger workflows. Would appreciate discussion/thoughts on this @johanneskoester and @vsoch and of course anyone else! <!--Add a description of your PR here--> ### QC <!-- Make sure that you can tick the boxes below. --> * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). --------- Co-authored-by: Johannes Köster <johannes.koester@uni-due.de> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
- Loading branch information