Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reference fetch #1371

Merged
merged 8 commits into from
Oct 30, 2017

Conversation

jaytmiller
Copy link
Contributor

@jaytmiller jaytmiller commented Oct 25, 2017

This is a response to Jeb's performance issues when attempting to multi-process 5 or so MIRI associations concurrently. I believe any remaining CRDS performance issues are in data model i/o or garbage collection. This addresses that. This changes only the crds_client glue code.

The primary optimization is to cache the header parameters CRDS extracts from incoming data models and uses for bestrefs matching. Since the cache is keyed by filenames, this scheme can be defeated by similar files with different names (pipeline progression?) or by datasets that change in ways that affect bestrefs matching (only original header counts...) Nevertheless, potentially it can reduce CRDS data models related file i/o by 3:1, maybe 4:1. This does not reduce Step file i/o. This does not eliminate redundant but fast calls to CRDS bestrefs. It just caches parameters.

A secondary optimization was to remove two calls to gc.collect() for every bestrefs call. memory_profiler.profile() showed zero growth, I've seen gc be surprisingly expensive, seconds.

Removes handling for older versions of CRDS that don't have crds_cache_locking module.

Additional caching to avoid repeat equivalent calls to crds.getreferences(), strategically good even though those are currently cheap for most CRDS configurations.

…me filename

across multiple Pipelines or Steps in the same process.
Removed conditional code in crds_client.py for supporting old versions of CRDS
that did not include cache locking.
Commented out explicit garbage collection in crds_client.py.
code and highlight structure.
Added first crack at init_multiprocessing() for CRDS which may
accelerate some multiprocessing use cases if called at the beginning
of a session.
These changes assume that dataset headers relevant to CRDS matching
do not change from call-to-call.
@jaytmiller
Copy link
Contributor Author

jaytmiller commented Oct 25, 2017

I think this does achieve at least a 2:1 practical improvement in file i/o and CRDS performance just within a single Step like assign_wcs. If name morphing between Steps was handled for the caching, potentially the model work done solely for CRDS can be reduced greatly, down to a pre-fetch for each Pipeline rather than Pipeline + n-Steps + m-references. As it stands, I think this optimization reduces CRDS-related work to n-Steps.

Exploited evolution of step.py towards consistent use data models as inputs to
crds_client to simplify crd_client.get_multiple_reference_paths().

Added caching for core getreferences() results.  This is effectively the same
as calling crds_client less as a consequence of Step/Pipeline mods,  it even
eliminates theoretically fast calls to the CRDS bestrefs core in addition to
data models header reads.

Both data model opens / metadata reads and CRDS bestrefs calls are now reasonably
minimized.

Removed duplicate _prefetch_reference_files() method from pipeline.Pipeline,  it's
equivalent to the Step superclass method.
…peline pre-fetch

can support prefetches of smaller combinations of the same types.
Dropped init_multiprocessing() since it did not seem to have a big impact
isn't directly used, and contained it's first refactoring error on day 2.

For now,  two nice changes are dropping "import gc" and also the non-test
instances of "from jwst import datamodels"...  good dependencies to ditch.
@jaytmiller jaytmiller merged commit 89a2424 into spacetelescope:master Oct 30, 2017
@jaytmiller jaytmiller deleted the optimize-reference-fetch branch January 12, 2018 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants