Optimize reference fetch #1371

jaytmiller · 2017-10-25T21:01:47Z

This is a response to Jeb's performance issues when attempting to multi-process 5 or so MIRI associations concurrently. I believe any remaining CRDS performance issues are in data model i/o or garbage collection. This addresses that. This changes only the crds_client glue code.

The primary optimization is to cache the header parameters CRDS extracts from incoming data models and uses for bestrefs matching. Since the cache is keyed by filenames, this scheme can be defeated by similar files with different names (pipeline progression?) or by datasets that change in ways that affect bestrefs matching (only original header counts...) Nevertheless, potentially it can reduce CRDS data models related file i/o by 3:1, maybe 4:1. This does not reduce Step file i/o. This does not eliminate redundant but fast calls to CRDS bestrefs. It just caches parameters.

A secondary optimization was to remove two calls to gc.collect() for every bestrefs call. memory_profiler.profile() showed zero growth, I've seen gc be surprisingly expensive, seconds.

Removes handling for older versions of CRDS that don't have crds_cache_locking module.

Additional caching to avoid repeat equivalent calls to crds.getreferences(), strategically good even though those are currently cheap for most CRDS configurations.

…me filename across multiple Pipelines or Steps in the same process. Removed conditional code in crds_client.py for supporting old versions of CRDS that did not include cache locking. Commented out explicit garbage collection in crds_client.py.

code and highlight structure. Added first crack at init_multiprocessing() for CRDS which may accelerate some multiprocessing use cases if called at the beginning of a session. These changes assume that dataset headers relevant to CRDS matching do not change from call-to-call.

…mize-reference-fetch

jaytmiller · 2017-10-25T21:33:14Z

I think this does achieve at least a 2:1 practical improvement in file i/o and CRDS performance just within a single Step like assign_wcs. If name morphing between Steps was handled for the caching, potentially the model work done solely for CRDS can be reduced greatly, down to a pre-fetch for each Pipeline rather than Pipeline + n-Steps + m-references. As it stands, I think this optimization reduces CRDS-related work to n-Steps.

Exploited evolution of step.py towards consistent use data models as inputs to crds_client to simplify crd_client.get_multiple_reference_paths(). Added caching for core getreferences() results. This is effectively the same as calling crds_client less as a consequence of Step/Pipeline mods, it even eliminates theoretically fast calls to the CRDS bestrefs core in addition to data models header reads. Both data model opens / metadata reads and CRDS bestrefs calls are now reasonably minimized. Removed duplicate _prefetch_reference_files() method from pipeline.Pipeline, it's equivalent to the Step superclass method.

…peline pre-fetch can support prefetches of smaller combinations of the same types.

Dropped init_multiprocessing() since it did not seem to have a big impact isn't directly used, and contained it's first refactoring error on day 2. For now, two nice changes are dropping "import gc" and also the non-test instances of "from jwst import datamodels"... good dependencies to ditch.

jaytmiller added 4 commits October 25, 2017 10:30

Merge branch 'master' of https://github.com/STScI-JWST/jwst into opti…

3dd193c

…mize-reference-fetch

Updated comments for crds_client.init_multiprocessing().

5128269

jaytmiller mentioned this pull request Oct 25, 2017

CRDS wait causes hung pipeline #1301

Closed

jaytmiller requested review from stscieisenhamer and hbushouse October 25, 2017 21:38

jaytmiller added 4 commits October 26, 2017 12:29

Removed test for unused crds_client function _flatten_dict().

b5067f2

Simplification to crds_client bestrefs cache keys so that a single pi…

de7de85

…peline pre-fetch can support prefetches of smaller combinations of the same types.

stscieisenhamer approved these changes Oct 27, 2017

View reviewed changes

jaytmiller merged commit 89a2424 into spacetelescope:master Oct 30, 2017

jaytmiller deleted the optimize-reference-fetch branch January 12, 2018 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize reference fetch #1371

Optimize reference fetch #1371

jaytmiller commented Oct 25, 2017 •

edited

jaytmiller commented Oct 25, 2017 •

edited

Optimize reference fetch #1371

Optimize reference fetch #1371

Conversation

jaytmiller commented Oct 25, 2017 • edited

jaytmiller commented Oct 25, 2017 • edited

jaytmiller commented Oct 25, 2017 •

edited

jaytmiller commented Oct 25, 2017 •

edited