Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remote GS builds too many inventories; io:collect_mtime always uses uncached mtime #2266

Merged
merged 10 commits into from Jun 30, 2023
Merged

Conversation

cademirch
Copy link
Contributor

@cademirch cademirch commented May 18, 2023

Description

This PR includes two small changes that drastically reduce inventory building time when using Google remote files, as well as perhaps unnecessary extra mtime checks.

The IOCache class uses ExistsDicts to quickly lookup if a file is in the cache or not. Specifically, in

snakemake/snakemake/io.py

Lines 119 to 122 in 6f602a3

def __contains__(self, path):
# if already in inventory, always return True.
parent = path.get_inventory_parent()
return parent in self.has_inventory or super().__contains__(path)
ExistsDict checks if a parent of the query file has been inventoried.

This is problematic, because when we build the inventory for a remote GS file, we add bucket/subfolder to has_inventory

cache.exists_remote.has_inventory.add("%s/%s" % (self.bucket_name, subfolder))

But only return self.bucket from get_inventory_parent
def get_inventory_parent(self):
return self.bucket_name

Which leads to many extra inventories being built. We fix this by returning self.bucket/subfolder.

The second commit in this PR I am less sure about. In IOCache , we build an mtime_inventory asynchronously . However, for reasons I'm not sure about, each async worker uses

snakemake/snakemake/io.py

Lines 174 to 175 in 6f602a3

async def collect_mtime(self, path):
return path.mtime_uncached
to get mtime info for a file. This is confusing to me as it explicitly uses mtime_uncached, which directly checks the mtime for a file, and for remote files this means expensive API calls. I'm not sure why mtime_uncached is used here, but it would be good to know if this is intentional.

Benchmarking these two changes shows some pretty impressive improvement.

Here is a simple workflow:

rule all:
    input: expand("results/{i}.txt", i=range(100))

rule make_file:
    output: "results/{i}.txt"
    shell: "echo {wildcards.i} > {output}"

Timed with GNU time three times.

Timing (no fix):

0:31.33 elapsed
0:35.06 elapsed
0:31.10 elapsed

Timing (with fix):

0:06.13 elapsed
0:06.12 elapsed
0:03.15 elapsed

The improvements are even more significant when testing with some of my larger workflows.

Would appreciate discussion/thoughts on this @johanneskoester and @vsoch and of course anyone else!

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

GS.py builds inventory and adds "self.bucket/subfolder" to cache.has_inventory. However, get_inventory_parent only returns self.bucket which leads to unnecessary extra inventory builds, because get_inventory_parent is called when checking if a inventory needs to be build for a file.
@vsoch
Copy link
Contributor

vsoch commented May 18, 2023

I would guess the timing difference comes down to using a cached value vs. requesting that one is not used. The second fix definitely looks like a bugfix (to include the dirname of the blob in the parent path). Did you test workflows without updating the mtime but with the second fix?

@cademirch
Copy link
Contributor Author

I would guess the timing difference comes down to using a cached value vs. requesting that one is not used. The second fix definitely looks like a bugfix (to include the dirname of the blob in the parent path). Did you test workflows without updating the mtime but with the second fix?

I did but I didn't record it, I can do that now. I'm curious why we would directly use the uncached mtime when building the mtime inventory?

@vsoch
Copy link
Contributor

vsoch commented May 18, 2023

I did but I didn't record it, I can do that now. I'm curious why we would directly use the uncached mtime when building the mtime inventory?

Mostly because a cached value may not be the most recent time it was modified. The question for snakemake is where / why would that matter? E.g., if we knew it was modified within the last day, is that enough, vs knowing it was an hour ago vs. three?

@cademirch
Copy link
Contributor Author

I did but I didn't record it, I can do that now. I'm curious why we would directly use the uncached mtime when building the mtime inventory?

Mostly because a cached value may not be the most recent time it was modified. The question for snakemake is where / why would that matter? E.g., if we knew it was modified within the last day, is that enough, vs knowing it was an hour ago vs. three?

Hmm I guess I'm confused about how IOCache is used then. I thought it's built every execution of snakemake so it should be the most up to date information.

@vsoch
Copy link
Contributor

vsoch commented May 18, 2023

Hmm I guess I'm confused about how IOCache is used then. I thought it's built every execution of snakemake so it should be the most up to date information.

Yeah - so we probably would not want a cached modified time, but I'll defer to @johanneskoester maybe that specificity doesn't matter as long as we can determine it was modified.

@cademirch
Copy link
Contributor Author

Hmm I guess I'm confused about how IOCache is used then. I thought it's built every execution of snakemake so it should be the most up to date information.

Yeah - so we probably would not want a cached modified time, but I'll defer to @johanneskoester maybe that specificity doesn't matter as long as we can determine it was modified.

I'm not sure I agree. The iocache is created in workflow.py then passed to the DAG. Where it's used and updated by target files to check existence and mtime of needed files. So the cache is always up to date relative to the start of the workflow.

@cademirch
Copy link
Contributor Author

@johanneskoester Any thoughts?

snakemake/remote/GS.py Outdated Show resolved Hide resolved
snakemake/io.py Outdated Show resolved Hide resolved
cademirch and others added 2 commits June 5, 2023 10:39
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
See #2284

Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
@cademirch
Copy link
Contributor Author

@johanneskoester Thanks, applied your suggestions. Should be ready to merge.

@sonarcloud
Copy link

sonarcloud bot commented Jun 30, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
1.2% 1.2% Duplication

@johanneskoester johanneskoester merged commit bad9115 into snakemake:main Jun 30, 2023
8 checks passed
johanneskoester pushed a commit that referenced this pull request Jul 21, 2023
🤖 I have created a release *beep* *boop*
---


##
[7.30.2](v7.30.1...v7.30.2)
(2023-07-20)


### Bug Fixes

* do not allow setting benchmark and between-workflow caching for the
same rule. The reason is that when the result is taken from cache, there
is no way to fill the benchmark file with any reasonable values.
([#2335](#2335))
([e2d64fa](e2d64fa))
* ensure lazy evaluation of resource functions/callables (this also
entails, for now, a removal of the thread statistics in the yellow job
stats table); further, added some clarifying sentences about resource
function evaluation to the docs
([#2356](#2356))
([4c591b7](4c591b7))
* handle non-PEP440 versions of apptainer/singulariy
([#2337](#2337))
([dea6ba8](dea6ba8))
* remote GS builds too many inventories; io:collect_mtime always uses
uncached mtime
([#2266](#2266))
([bad9115](bad9115))
* Solve apptainer version issue
([#2333](#2333))
([a876e0f](a876e0f))
* SyntaxWarnings due to non-raw regex pattern strings
([#2359](#2359))
([a08c0b0](a08c0b0))


### Documentation

* clarify minimum Snakemake version for profiles
([86dc277](86dc277))
* clarify the channel priority in environment definition deployment.rst
([#2352](#2352))
([76aa964](76aa964))
* fix typo (stackoverflow issue)
([#2365](#2365))
([f770984](f770984))
* note on using checkpoint mechanism only for input function, not for
params or resources.
([#2353](#2353))
([4be2f9d](4be2f9d))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants