-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize reference fetch #1371
Optimize reference fetch #1371
Conversation
…me filename across multiple Pipelines or Steps in the same process. Removed conditional code in crds_client.py for supporting old versions of CRDS that did not include cache locking. Commented out explicit garbage collection in crds_client.py.
code and highlight structure. Added first crack at init_multiprocessing() for CRDS which may accelerate some multiprocessing use cases if called at the beginning of a session. These changes assume that dataset headers relevant to CRDS matching do not change from call-to-call.
…mize-reference-fetch
I think this does achieve at least a 2:1 practical improvement in file i/o and CRDS performance just within a single Step like assign_wcs. If name morphing between Steps was handled for the caching, potentially the model work done solely for CRDS can be reduced greatly, down to a pre-fetch for each Pipeline rather than Pipeline + n-Steps + m-references. As it stands, I think this optimization reduces CRDS-related work to n-Steps. |
Exploited evolution of step.py towards consistent use data models as inputs to crds_client to simplify crd_client.get_multiple_reference_paths(). Added caching for core getreferences() results. This is effectively the same as calling crds_client less as a consequence of Step/Pipeline mods, it even eliminates theoretically fast calls to the CRDS bestrefs core in addition to data models header reads. Both data model opens / metadata reads and CRDS bestrefs calls are now reasonably minimized. Removed duplicate _prefetch_reference_files() method from pipeline.Pipeline, it's equivalent to the Step superclass method.
…peline pre-fetch can support prefetches of smaller combinations of the same types.
Dropped init_multiprocessing() since it did not seem to have a big impact isn't directly used, and contained it's first refactoring error on day 2. For now, two nice changes are dropping "import gc" and also the non-test instances of "from jwst import datamodels"... good dependencies to ditch.
This is a response to Jeb's performance issues when attempting to multi-process 5 or so MIRI associations concurrently. I believe any remaining CRDS performance issues are in data model i/o or garbage collection. This addresses that. This changes only the crds_client glue code.
The primary optimization is to cache the header parameters CRDS extracts from incoming data models and uses for bestrefs matching. Since the cache is keyed by filenames, this scheme can be defeated by similar files with different names (pipeline progression?) or by datasets that change in ways that affect bestrefs matching (only original header counts...) Nevertheless, potentially it can reduce CRDS data models related file i/o by 3:1, maybe 4:1. This does not reduce Step file i/o. This does not eliminate redundant but fast calls to CRDS bestrefs. It just caches parameters.
A secondary optimization was to remove two calls to gc.collect() for every bestrefs call. memory_profiler.profile() showed zero growth, I've seen gc be surprisingly expensive, seconds.
Removes handling for older versions of CRDS that don't have crds_cache_locking module.
Additional caching to avoid repeat equivalent calls to crds.getreferences(), strategically good even though those are currently cheap for most CRDS configurations.