New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remote GS builds too many inventories; io:collect_mtime always uses uncached mtime #2266
Conversation
This reverts commit 3b90fa3.
GS.py builds inventory and adds "self.bucket/subfolder" to cache.has_inventory. However, get_inventory_parent only returns self.bucket which leads to unnecessary extra inventory builds, because get_inventory_parent is called when checking if a inventory needs to be build for a file.
I would guess the timing difference comes down to using a cached value vs. requesting that one is not used. The second fix definitely looks like a bugfix (to include the dirname of the blob in the parent path). Did you test workflows without updating the mtime but with the second fix? |
I did but I didn't record it, I can do that now. I'm curious why we would directly use the uncached mtime when building the mtime inventory? |
Mostly because a cached value may not be the most recent time it was modified. The question for snakemake is where / why would that matter? E.g., if we knew it was modified within the last day, is that enough, vs knowing it was an hour ago vs. three? |
Hmm I guess I'm confused about how IOCache is used then. I thought it's built every execution of snakemake so it should be the most up to date information. |
Yeah - so we probably would not want a cached modified time, but I'll defer to @johanneskoester maybe that specificity doesn't matter as long as we can determine it was modified. |
I'm not sure I agree. The iocache is created in workflow.py then passed to the DAG. Where it's used and updated by target files to check existence and mtime of needed files. So the cache is always up to date relative to the start of the workflow. |
@johanneskoester Any thoughts? |
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
See #2284 Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
@johanneskoester Thanks, applied your suggestions. Should be ready to merge. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
🤖 I have created a release *beep* *boop* --- ## [7.30.2](v7.30.1...v7.30.2) (2023-07-20) ### Bug Fixes * do not allow setting benchmark and between-workflow caching for the same rule. The reason is that when the result is taken from cache, there is no way to fill the benchmark file with any reasonable values. ([#2335](#2335)) ([e2d64fa](e2d64fa)) * ensure lazy evaluation of resource functions/callables (this also entails, for now, a removal of the thread statistics in the yellow job stats table); further, added some clarifying sentences about resource function evaluation to the docs ([#2356](#2356)) ([4c591b7](4c591b7)) * handle non-PEP440 versions of apptainer/singulariy ([#2337](#2337)) ([dea6ba8](dea6ba8)) * remote GS builds too many inventories; io:collect_mtime always uses uncached mtime ([#2266](#2266)) ([bad9115](bad9115)) * Solve apptainer version issue ([#2333](#2333)) ([a876e0f](a876e0f)) * SyntaxWarnings due to non-raw regex pattern strings ([#2359](#2359)) ([a08c0b0](a08c0b0)) ### Documentation * clarify minimum Snakemake version for profiles ([86dc277](86dc277)) * clarify the channel priority in environment definition deployment.rst ([#2352](#2352)) ([76aa964](76aa964)) * fix typo (stackoverflow issue) ([#2365](#2365)) ([f770984](f770984)) * note on using checkpoint mechanism only for input function, not for params or resources. ([#2353](#2353)) ([4be2f9d](4be2f9d)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
This PR includes two small changes that drastically reduce inventory building time when using Google remote files, as well as perhaps unnecessary extra mtime checks.
The
IOCache
class usesExistsDicts
to quickly lookup if a file is in the cache or not. Specifically, insnakemake/snakemake/io.py
Lines 119 to 122 in 6f602a3
ExistsDict
checks if aparent
of the query file has been inventoried.This is problematic, because when we build the inventory for a remote GS file, we add
bucket/subfolder
tohas_inventory
snakemake/snakemake/remote/GS.py
Line 195 in 6f602a3
But only return
self.bucket
fromget_inventory_parent
snakemake/snakemake/remote/GS.py
Lines 199 to 200 in 6f602a3
Which leads to many extra inventories being built. We fix this by returning
self.bucket/subfolder
.The second commit in this PR I am less sure about. In
IOCache
, we build anmtime_inventory
asynchronously . However, for reasons I'm not sure about, each async worker usessnakemake/snakemake/io.py
Lines 174 to 175 in 6f602a3
mtime
info for a file. This is confusing to me as it explicitly usesmtime_uncached
, which directly checks themtime
for a file, and for remote files this means expensive API calls. I'm not sure whymtime_uncached
is used here, but it would be good to know if this is intentional.Benchmarking these two changes shows some pretty impressive improvement.
Here is a simple workflow:
Timed with
GNU time
three times.Timing (no fix):
Timing (with fix):
The improvements are even more significant when testing with some of my larger workflows.
Would appreciate discussion/thoughts on this @johanneskoester and @vsoch and of course anyone else!
QC
docs/
) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).