CRDS wait causes hung pipeline #1301

baileyji · 2017-10-12T06:49:08Z

On a fresh conda environment CRDS will not continue and the reduction hangs.
2017-10-12 08:46:26,891 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-12 08:46:26,893 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
^C---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
in ()
----> 1 runCubePipe(arglist[0])

in runCubePipe((f, args))
181
182 def runCubePipe((f, args)):
--> 183 MRSCubePipeline(steps={'cube_build': args}).run(f)
184
185

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/step.pyc in run(self, *args)
343 if (len(args) and len(self.reference_file_types) and not self.skip
344 and self.prefetch_references):
--> 345 self._precache_reference_files(args[0])
346
347 self.log.info(

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/pipeline.pyc in _precache_reference_files(self, input_file)
168 try:
169 with datamodels.open(input_file) as model:
--> 170 super(Pipeline, self)._precache_reference_files_opened(model)
171 except (ValueError, TypeError, IOError):
172 self.log.info(

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/step.pyc in _precache_reference_files_opened(self, model_or_container)
590 # recurse on each contained model
591 for contained_model in model_or_container:
--> 592 self._precache_reference_files_opened(contained_model)
593 else:
594 # precache a single model object

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/step.pyc in _precache_reference_files_opened(self, model_or_container)
593 else:
594 # precache a single model object
--> 595 self._precache_reference_files_impl(model_or_container)
596
597 def _precache_reference_files_impl(self, model):

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/step.pyc in _precache_reference_files_impl(self, model)
611 fetch_types = sorted(set(self.reference_file_types) - set(ovr_refs.keys()))
612
--> 613 crds_refs = crds_client.get_multiple_reference_paths(model, fetch_types)
614
615 ref_path_map = dict(list(crds_refs.items()) + list(ovr_refs.items()))

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/jwst-0.7.8rc3.dev10-py2.7-macosx-10.7-x86_64.egg/jwst/stpipe/crds_client.pyc in get_multiple_reference_paths(input_file, reference_file_types)
99 try:
100 if crds_cache_locking is not None:
--> 101 with crds_cache_locking.get_cache_lock():
102 bestrefs = crds.getreferences(data_dict, reftypes=reference_file_types, observatory="jwst")
103 else:

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/lockfile/init.pyc in enter(self)
195 Context manager support.
196 """
--> 197 self.acquire()
198 return self
199

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/crds/core/crds_cache_locking.pyc in acquire(self, *args, **keys)
66 def acquire(self, *args, **keys):
67 log.verbose("Acquiring lock", repr(self))
---> 68 result = super(CrdsLockFile, self).acquire(*args, **keys)
69 log.verbose("Lock acquired", repr(self))
70 return result

/Users/one/anaconda2/envs/jwst/lib/python2.7/site-packages/lockfile/linklockfile.pyc in acquire(self, timeout)
48 raise AlreadyLocked("%s is already locked" %
49 self.path)
---> 50 time.sleep(timeout is not None and timeout / 10 or 0.1)
51 else:
52 # Link creation succeeded. We're good to go.

KeyboardInterrupt:

stscieisenhamer · 2017-10-12T14:01:02Z

One thing to check for: In /tmp, look for anything that has crds in the name and remove it. There may also be a set of files with the system name in it; remove those also.

baileyji · 2017-10-12T14:02:23Z

Yes, I tried removing the lock file (didn't see any others) but the problem persists

stscieisenhamer · 2017-10-12T14:03:18Z

@jaytmiller thoughts?

jaytmiller · 2017-10-12T14:11:00Z

Locking should not be used at all in the pipeline so I'm not sure why it's being used. So it's either a bug in locking or a missing env definition, almost definitely the latter.

In whatever Python environment this is running, I need to see:

$ printenv

A pointer to the log file might also help, both a hostname and the filepath.

baileyji · 2017-10-12T14:27:38Z

The pipeline didn't save a log.
printenv:
this is in a fresh env from "conda create -n jwst -c http://ssb.stsci.edu/conda-dev jwst"

TMPDIR=/var/folders/3k/nfshy8wx3gzds_86lyr9w3jw0000gn/T/
XPC_FLAGS=0x0
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.EZYvmHn42b/Render
TERM=xterm-256color
LANG=en_US.UTF-8
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.psaMgASpuR/Listeners
DISPLAY=/private/tmp/com.apple.launchd.krBkKOQQRq/org.macosforge.xquartz:0
SECURITYSESSIONID=188ec
XPC_SERVICE_NAME=0
TERM_PROGRAM=Apple_Terminal
TERM_PROGRAM_VERSION=400
TERM_SESSION_ID=E89B8626-D1AD-40CA-ADAC-8D2411279C63
SHELL=/bin/zsh
HOME=/Users/one
LOGNAME=one
USER=one
PATH=/Users/one/anaconda2/envs/jwst/bin:/Users/one/anaconda2/bin:/Users/one/CloudStation/miri/DHAS/MIRI_DHAS-9.4.2/MPipeline/DAT:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/opt/X11/bin
SHLVL=1
PWD=/Users/one/Cloudstation/miri
OLDPWD=/Users/one
PAGER=less
LESS=-R
LC_CTYPE=en_US.UTF-8
LSCOLORS=Gxfxcxdxbxegedabagacad
ANACONDA_PATH=/Users/one/anaconda2/bin
YSASPYTHON=/Users/one/YSAS
PYTHONPATH=:/Users/one/YSAS
MIRI_DIR=/Users/one/CloudStation/miri/DHAS/MIRI_DHAS-9.4.2/
CDP_DIR=/Users/one/MIRICLE/cdp/
IDL_PATH=/Applications/exelis/idl84/bin
PYSYN_CDBS=/Users/one/MIRICLE/cdbs/
CRDS_SERVER_URL=https://jwst-crds.stsci.edu
CRDS_PATH=/Users/one/MIRICLE/crds/
CRDS_CONTEXT=jwst_0401.pmap
CONDA_PATH_BACKUP=/Users/one/anaconda2/bin:/Users/one/CloudStation/miri/DHAS/MIRI_DHAS-9.4.2/MPipeline/DAT:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/opt/X11/bin
CONDA_PS1_BACKUP=${ret_status} %{$fg[cyan]%}%c%{$reset_color%} $(git_prompt_info)
CONDA_DEFAULT_ENV=jwst
CONDA_PREFIX=/Users/one/anaconda2/envs/jwst
PS1=(jwst) ${ret_status} %{$fg[cyan]%}%c%{$reset_color%} $(git_prompt_info)
_=/usr/bin/printenv

jaytmiller · 2017-10-12T14:40:49Z

crds_jwst_serverless_pipeline_env.csh.txt

In the pipeline environment, as part of setup, I was assuming the attached script is being sourced to configure CRDS prior to a calibration run. Some of the needed things are not being defined so CRDS locking is occurring when it should not, and transparent CRDS cache updates are occurring when they should not.

This is a little tricky because CRDS needs two configurations:

When the pipeline is running massively parallel jobs doing calibrations
When an operator is syncing the pipeline's CRDS cache to the server

(1) has to have a READONLY cache or multiple pipeline calibrations will update the CRDS cache concurrently and cache updates will be a point of multiprocessing contention. In the pipeline, CRDS assumes it has a "complete" cache and no automatic syncing will be occurring.

(2) has to have a writable cache, it's when the "complete" cache updates to stay "complete".

I think the necessary environment settings for (2) are built into the CRDS sync wrapper script which installs CRDS in the pipeline. (It is an SDP script used by operators).

The attached script is intended to handle (1).

baileyji · 2017-10-12T14:46:45Z

So I typically spawn 3-6 processes and run using multiprocessing. Does this script also work with rc1 and before?

jaytmiller · 2017-10-12T14:49:40Z

Yeah, this is pretty old, I'm not sure when we codified it as this script but these settings have really been needed since at least B5 which I think was the first time we did careful/formal I&T for CRDS. This script may not be used since it's an integration issue, but what's in it needs to happen somehow.

baileyji · 2017-10-12T14:51:03Z

So I should be good to do multiprocessing if I just change my environment activation script to have:

export PYSYN_CDBS=/Users/one/MIRICLE/cdbs/
export CDP_DIR=/Users/one/MIRICLE/cdp/
export CRDS_PATH=/Users/one/MIRICLE/crds/
export CRDS_CONTEXT='jwst_0401.pmap'
export CRDS_SERVER_URL="https://jwst-serverless-mode.stsci.edu"
export CRDS_OBSERVATORY='jwst'
export CRDS_READONLY_CACHE=1
export CRDS_LOG_TIME=1
export CRDS_CLIENT_RETRY_COUNT=1
export CRDS_CLIENT_RETRY_DELAY_SECONDS=0

jaytmiller · 2017-10-12T14:56:19Z

I may have misunderstood, I'm in the middle of a DMS telecon so I assumed you were DMS and this was B7.1 I&T. As an end user processing a few datasets or few calibrations doing multiprocessing in multiple windows, there may be issues with locking. The configuration I just gave you is for the archive pipeline. As an end user, using a readonly cache would block transparent download of reference files.

jaytmiller · 2017-10-12T14:59:44Z

Are you a developer/end-user or operating the pipeline?

baileyji · 2017-10-12T15:02:14Z

I'm in the weird intermediate end of things working on development and testing on the MIRI end of things in Europe. My workflow is using the MIRI instrument simulator to create fake data and the use the stsci pipeline to try and reduce that data to both test the simulator and the pipeline.

Typically I run in a terminal window.

jaytmiller · 2017-10-12T15:12:34Z

As part of "running the pipeline", are you also using a "complete" CRDS cache, or are you trying to do demand based reference file downloads? (So, a complete cache would mean running the CRDS sync tool or perhaps some kind of web-suck before calibrating, and pre-downloading all the rules and references (400-500G). Then running readonly as I described.)

If you're using demand-based / dynamic caching, are you sure massive downloads are not occurring while your process is blocked? (It may/would help us debug this if you set CRDS_VERBOSITY=50 before your run, that will add extra logging.)

baileyji · 2017-10-12T15:42:12Z

I'm doing it demand based. But the behavior persists on rerunning the pipeline on the same dataset. What I mean is that my local cache is totally (should be anyway) complete because I can do things like run the analysis, delete on the output files, then run again and get the same behavior. Right now the only workaround we've got is to download all the relevant MIRI files which while not 0.5 TB is still too much for my poor old laptop.

jaytmiller · 2017-10-12T15:52:08Z

Forgot to ask: what OS is this?

baileyji · 2017-10-12T16:17:20Z

OSX Sierra & high sierra

jaytmiller · 2017-10-12T16:38:27Z

As a possible workaround, there is a tool from HST that may help here. With your dynamic configuration, try something like this:

$ crds bestrefs --files ./*.fits --sync-references=1 --new-context jwst-operational

as a way of serially pre-caching references for *.fits.

Hopefully once they are pre-cached, the cache aspect for further pre-cache attempts will take over and reduce it to a noop.

jaytmiller · 2017-10-12T17:55:03Z

Do you know what your RAM utilization is during this run?

jaytmiller · 2017-10-12T18:23:27Z

How many gigabytes of non-reference data are involved with this test? Any chance you can share that so we can try to reproduce this here?

Are you actually running a copy of the DMS pipeline or just CAL code from SCSB?

Is the above program output all there is or are you running 3-6 of those in parallel?

baileyji · 2017-10-12T18:31:48Z

How does the bestrefs command work? Its come up in an email between us before and I think in the end we came to the conclusion it wouldn't solve the issue because everything was cached and because I'm working from an interactive python terminal.

My ram pressure is negligible, I can easily spawn 6 instances of the reduction on my mid2015 macbook pro (16BG) and have enough headroom for other work.

Here is a gist of the code: https://gist.github.com/baileyji/fad112e42e91dade21678aba4e3eed8a
e.g. line 228 or executing the non-main portion and then snippis of the main bit as necessary and calling runCubePipe(arglist[0]).

I get the slow behavior regardless of if i use a pool or I do a serial for x in arglist: runCubePipe(x)
. The email I send Howard is a log excerpt from doing it in a for loop.

jaytmiller · 2017-10-12T19:08:15Z

crds bestrefs is used by HST to assign reference files to dataset FITS headers prior to running calibrations as an extra independent step. The calibrations just refer to the headers. It also supports the --sync-references=1 switch to download and cache the references it assigns as a bonus feature not used by the pipeline.

So... crds bestrefs can be run on all your JWST datasets and will serially download your references outside the scope of your multi-processing runs. When your code is finally run, your references should already be cached, and hopefully that will ameliorate the situation. This may take a little work but it should take about 5 minutes to get started and will only require the minimum space.

Another quick cheat, a temporary work around, would be to set Pool(1). It'd be interesting to know if that does complete. And, if it does, afterward are you able to work with your original multiprocessing plan and Pool(N > 1)?

baileyji · 2017-10-12T20:05:17Z

So how is bestrefs different than
for x in arglist: runCubePipe(x)
exit
delete output
run for x in arglist: runCubePipe(x) again?
Everything has been downloaded and no pool involved.

jaytmiller · 2017-10-12T20:36:32Z

OK, it hasn't been sinking in that you've already cached everything, sorry about that. If you have the files already, you're right, it's no different, all crds bestrefs would do is get them. (Howard forwarded a discussion where your files were taking a long time to download/pre-cache, so I may have conflated that with this.)

Since you've got the files already, this (as the archive pipeline does) may help:

export CRDS_READONLY_CACHE=1

That should prevent cache modifications and short circuit the locking which may be failing for some reason.

baileyji · 2017-10-13T14:36:07Z

I don't think this is a good solution (or even a good workaround). This issue with the CRDS has been going on for some time and we really need to get to the root of it, especially has the various context changes will mean that we need to refresh files periodically.

What can I do to generate more useful debugging info?

jaytmiller · 2017-10-13T16:37:53Z

There is CRDS_USE_LOCKING=0 which has more flexibility regarding switching back and forth between syncing serially and calibrating since the cache can remain writable. This can also be tried now and get you going now.

You should be able to set CRDS_VERBOSITY=50 and get more info, particularly for locking. That should provide some trace of acquire and release. With this kind of thing, potentially it could even make it "work" by changing the timing and sequence of blocking kernel calls.

However, my gut sense is that the issue here is somewhere with the filelock module running on OS-X using /tmp to store locks. As such, I don't want to use hypothetical test cases... because I already did that with different test cases from Jonathan and they passed. Can you share a dumbed down version of this:

from mirisim.config_parser import SimConfig, SimulatorConfig, SceneConfig
from mirisim import MiriSimulation

Or.. more generally, just the closest possible version of your code that you can share, or the smallest case you can create that demonstrates the problem on your system?

If you're willing to try, you can also set CRDS_LOCK_PATH=<some_existing_directory>. That governs the file system on which the lock should be created. I'd suggest something under $HOME or in your cache, on the most canonical local file system possible (e.g. not a network file system, a local disk), with the most direct path possible (e.g. no sym-links, etc). It must preexist in order to avoid a race condition creating it.

I will keep looking on my end, but until I can easily reproduce this the chances of solving it are diminished.

jaytmiller · 2017-10-17T19:52:20Z

I made modifications to the CRDS package (not JWST cal code) to enable the use of multiprocessing locks as well as the filelock package (vs. lockfile). In the process of adding more unit tests, I discovered that lockfile does seem to be incompatible with multiprocessing. There is a new environment variable CRDS_LOCKING_MODE which defaults to "multiprocessing" but can also be set to filelock or lockfile. Multiprocessing locks require no external dependencies but probably won't be compatible with multiple terminal windows or archive-pipeline-class parallelism. Our Institute archive pipelines will continue to operate with a readonly cache for now and synchronize with an independent operation for every new batch of references. Should be available as an update on conda-dev for crds tomorrow.

baileyji · 2017-10-18T13:02:30Z

I ran conda update -n jwst -c http://ssb.stsci.edu/conda-dev crds and updated from
crds: 7.1.5.dev82-py27_0 http://ssb.stsci.edu/conda-dev --> 7.1.5.dev93-py27_0
and with verbose=50 I'm getting:

2017-10-18 14:57:53,797 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-18 14:57:53,797 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-18 14:57:53,797 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-18 14:57:53,798 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-18 14:57:53,801 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
2017-10-18 14:57:53,801 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
2017-10-18 14:57:53,801 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
2017-10-18 14:57:53,801 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - Lock acquired CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - ========================================================================================================================
CRDS - DEBUG - getreferences() CRDS version: 7.1.6, bump-version, 25ec561
CRDS - DEBUG - getreferences() server: https://jwst-crds.stsci.edu
CRDS - DEBUG - getreferences() observatory: jwst
CRDS - DEBUG - getreferences() reftypes: ['cubepar', 'resol']
CRDS - DEBUG - getreferences() context: None
CRDS - DEBUG - getreferences() ignore_cache: False
CRDS - DEBUG - CRDS_PATH = /Users/one/MIRICLE/crds/
CRDS - DEBUG - CRDS_SERVER_URL = https://jwst-crds.stsci.edu
CRDS - DEBUG - CRDS JSON RPC get_server_info () -->
CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')
2017-10-18 14:58:02,383 - stpipe.MRSCubePipeline - INFO - MRSCubePipeline instance created.
2017-10-18 14:58:02,387 - stpipe.MRSCubePipeline.cube_build - INFO - CubeBuildStep instance created.
CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')

Without multiprocessing I get:

CRDS - DEBUG - Acquiring lock CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - Lock acquired CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - ========================================================================================================================
CRDS - DEBUG - getreferences() CRDS version: 7.1.6, bump-version, 25ec561
CRDS - DEBUG - getreferences() server: https://jwst-crds.stsci.edu
CRDS - DEBUG - getreferences() observatory: jwst
CRDS - DEBUG - getreferences() reftypes: ['cubepar', 'resol']
CRDS - DEBUG - getreferences() context: None
CRDS - DEBUG - getreferences() ignore_cache: False
CRDS - DEBUG - CRDS_PATH = /Users/one/MIRICLE/crds/
CRDS - DEBUG - CRDS_SERVER_URL = https://jwst-crds.stsci.edu
CRDS - DEBUG - CRDS JSON RPC get_server_info () -->
CRDS - DEBUG - RPC OK
CRDS - DEBUG - Connected to server at 'https://jwst-crds.stsci.edu'
CRDS - DEBUG - Using reference file selection rules 'jwst_0401.pmap' defined by environment CRDS_CONTEXT.
CRDS - DEBUG - Final effective context is 'jwst_0401.pmap'
CRDS - DEBUG - Computing best references locally.
CRDS - DEBUG - Bestrefs header:
{'META.EXPOSURE.READPATT': 'FAST',
'META.EXPOSURE.TYPE': 'MIR_MRS',
'META.INSTRUMENT.BAND': 'SHORT',
'META.INSTRUMENT.CHANNEL': '12.0',
'META.INSTRUMENT.DETECTOR': 'MIRIFUSHORT',
'META.INSTRUMENT.FILTER': 'UNDEFINED',
'META.INSTRUMENT.NAME': 'MIRI',
'META.OBSERVATION.DATE': '2019-01-01',
'META.OBSERVATION.TIME': '17:00:00',
'META.SUBARRAY.NAME': 'FULL',
'REFTYPE': 'UNDEFINED'}
CRDS - DEBUG - Connected to server and computing locally, updating CRDS cache config and operational context.
CRDS - DEBUG - CACHE updating: '/Users/one/MIRICLE/crds/config/jwst/server_config'
CRDS - DEBUG - CACHE updating: '/Users/one/MIRICLE/crds/config/jwst/bad_files.txt'
CRDS - DEBUG - Releasing lock CrdsMultiprocessingLock('crds.cache')
CRDS - DEBUG - Lock released CrdsMultiprocessingLock('crds.cache')

jaytmiller · 2017-10-18T13:27:50Z

OK, looks like you're running the right CRDS. Please try this:

$ conda install filelock
$ export CRDS_LOCKING_MODE=filelock
$ ... run multiprocessing script

and send me the output if it does not work, including the STPIPE output.

baileyji · 2017-10-18T13:36:23Z

I'll work on that. One thing to note is that it sometimes works if I first run the pipline on one input, e.g.
runPipe(arglist[0])
pool.map(runPipe, arglist[1:])
it will work.

baileyji · 2017-10-18T14:56:13Z

I'm attached four log files corresponding to the combinations multiprcessing+filelock, single+filelock, multiprocessing+default, single+default. I got tired of waiting on multiprcessing+filelock and multiprocessing+default. single+filelock completed but CRDS is unacceptably slow as was single+default.

The fifth log file is default+single run to completion, followed by default+multiprocessing without exiting the interpreter. Grep for PHASETWO. This one runs (I think to completion) but then hangs and nodata is written to disk. I've filed a bug about that (#1323). I've not tested filelock+single followed by filelock+multiprocessing.

default multiprocess.txt
default single.txt
filelock multiprocess.txt
filelock single.txt
default single then default multiprocess.txt

baileyji · 2017-10-18T23:02:53Z

Here is a partial log from running a batch of cubes. Note the extreme wait times often occurring after "stpipe.MRSCubePipeline.cube_build - INFO - Prefetch for RESOL reference file is '/Users/one/MIRICLE/crds/references/jwst/miri/jwst_miri_resol_0003.fits'."

slow.txt

jaytmiller · 2017-10-19T00:41:10Z

What's the extreme time? I'm seeing 7-10 seconds. Do you see any worse than that?

Historically, one expensive operation CRDS/CAL performs is extracting matching parameters from datasets prior to calling CRDS-proper. CAL opens a model. Along those lines:

What's the size/structure of your input file(s)?
Does your laptop have an SSD?

Finally, is it your current assessment that locking works (in any mode) for your multiprocessing configuration? Optimization may be a completely independent problem. Or there are two problems.

baileyji · 2017-10-19T07:04:04Z

The times between successive lines with the string "Prefetch for RESOL reference" can be upwards of 20 minutes, though I don't know if this log captures that. Even when not if it is 20 seconds or the more common 1-2 minutes this has to be multiplied by an order of magnitude. I'm counting
My guess is that it is going and doing the prefetch 12*12 times in total: once for each file in the association and then again for each time I call cube_build as I iterate through building a cube out of each sub-channel in the association. At the end of the log you can see where I finally got sick of waiting after >10 minutes. I'm on a Macbook pro with 16GB on a 1 TB SSD on the PCIe bus.

Maybe this isn't a CRDS server issue per say.

Take the slow.txt log on the run for 2Medium (2017-10-18 23:51:32,635 to 2017-10-19 00:21:13). Each time takes 0.75 - 1.5 minutes, every time for the exact same file! They are first fetched sometime after the pipeline is created and then fetched all over again after the pipeline is running. So something that might plausibly take < minute took ~25 minutes!

The queries are at:
2017-10-18 23:36:13,970 Pipeline Instance created
52.19
52.52
53.22
54.45
55.38
56.30
58.07
58.56
59.32
1.26
3.06
3.49
2017-10-18 23:41:11,892 Pipeline running with args
5.24
6.21
7.00
7.45
8.47
9.32
10.20
11.16
12.14
13.42
14.40
15.55
2017-10-19 00:16:03,575 Actual work starts
2017-10-19 00:21:13,855 Pipeline done

baileyji · 2017-10-19T08:42:50Z

The input file in a associaton of 12 miri mrs images that have been through calsepc2 (.photm.fits), 12 sub-bands (6 images), 2-pt dither = 12 files. Each file is ~34 MB.

One other thing to note is that the time interval grows with each pipeline run. Killing the interpreter and starting over will get me through a few in somewhat manageable time but then the problem recurs.

When I ctrl+c the process while it is hung on the "Prefetch for RESOL" it doesn't respond immediately, it waits for a macroscopic time several seconds at least and maybe 10s of seconds. When it finally terminates it seems it is always in the ---> 97 gc.collect point that was captured in slow.txt above.

jaytmiller · 2017-10-19T14:00:54Z

OK, thanks for the info, I now agree completely about "unacceptable".

Rather than speculate, I'm going to go do some homework and profiling. It's going to be a while.

jaytmiller · 2017-10-19T20:08:59Z

What's the size of your input file for single+filelock or single+default?

Does your Mac have an SSD?

Other than file transfer, one of the primary costs of CRDS bestrefs is loading lookup parameters from the dataset.

jaytmiller · 2017-10-19T20:09:45Z

Eck, sorry, posted a stale comment by refreshing the page. Disregard that repeat message.

baileyji · 2017-10-24T14:02:18Z

I think that there was at least some python 2vs3 issue here. I'm now using
conda create -n jwstb7.1rc3 -c http://ssb.stsci.edu/conda-dev python=3 jwst to install and that has helped significantly CRDS remains the dominant reduction time. 4 processes*24queries each for the same file.

jaytmiller · 2017-10-25T21:35:09Z

I created a patch to jwst.stpipe.crds_client intended to optimize data models i/o and garbage collection. I did a pull request (#1371) but I'm not sure if/when it will be accepted since other approaches are possible. What ballpark is your performance in now? My optimization improves reffile related file i/o by as much as 2:1 or 3:1.

baileyji · 2017-10-26T13:07:45Z

3:1 may put this in the area of livable and eventually fine if any of the optimizations to datamodels.open end up helping here too. I'm not presently setup to install the piepline in any way other than conda create -n foo -c http://ssb.stsci.edu/conda-dev python=3 jwst so I guess I'll just need to wait and see if it is accepted for b7.1

Looking at your changes I suspect gc.collect() will be the dominant time savings based on what I was seeing. I didn't see anything mentioned, but did you note that the CRDS was being hit by the pipeline twice for the same file for every file in the association?

jaytmiller · 2017-11-01T20:51:10Z

Pull request #1371 was accepted so with a conda update you may be able to see some improvement now.

The original design presumption of Step was that repeat CRDS reference fetches would be cheap and consequently "fetch" is a full repeat of "prefetch". I think what happened though, at least over time, is that the dominant cost of CRDS became reading parameters out of the dataset models / FITS files. This optimization addresses doing that read twice so there may be a 2x speed up for that reason.

baileyji · 2018-01-12T15:30:22Z

Is there any chance of a regression of sorts on this in the 0.9.x release? I don't think I'm seeing the hangs, but I'm seeing 5+ minutes extreme CUBEPAR and RESOL prefetch repeats again. I can generate new test data, but I'd first hope just a quick pok around might show the problem since it looks the same on my end.

jaytmiller · 2018-01-12T16:42:07Z

I don't think there have been any commits directly related to CRDS or STPIPE since I added CRDS header caching. @stscieisenhamer also mentioned a significant performance issue he encountered but I don't know if any related update was in broad use or applies to this use case.

As an aside, I've already worked up some changes which will mothball CRDS model header caching (the current optimization) because caching can produce incorrect results if models are modified without changing files names. To compensate, I'm planning to implement an idea from @stscieisenhamer : lift the prefetch completely out of Steps and run it at the Pipeline level only. That's much less of a kludge than header caching but not quite ready.

If there's a way to pass on a demo of the problem without too much effort, it's probably time.

As another suggestion, it is/was possible to export JWST_PROFILE=1 and run under the Python profiler without changing anything else. I haven't done it in a while and never for anything this complex or already performing so poorly, but if you felt like setting-and-forgetting it over the weekend it might provide more information. I'm not sure how well it will work in the context of multiprocessing so it may be a non-starter.

jaytmiller · 2018-02-13T20:29:24Z

In the interim there have been a lot of change for this in CRDS and CAL.

#1, default CRDS_LOCKING_MODE is now "multiprocessing", usable within a single process tree / terminal window only I believe, based on built-in Python multiprocessing locks. Secondary CRDS_LOCKING_MODE=lockfile should work from multiple terminal windows based on the conda lockfile package, but I think you're better off not using it. The original CRDS_LOCKING_MODE=filelock is highly discouraged, it fails unit tests on OS-X I wrote to force locking issues so I consider filelock broken on OS-X. It nevertheless remains an option.

#2, CAL data models header caching as an optimization was implemented and then removed. Caching in general makes the assumption that data model won't change unless the dataset filename changes, but interactive users do violate that assumption producing incorrect CRDS results based on a stale header cache. Instead, Jonathan's idea about pre-fetching only at the Pipeline level was implemented which saves duplicate header reads proportional to the Step count.

jaytmiller · 2018-06-04T14:56:30Z

@stscieisenhamer made a pretty major find related to CRDS and multiprocessing. There's a bug related to Ipython, Tkinter, and urllib which can cause subprocesses to silently die and hang CRDS waiting for the process completion. There was a simple work around (use urllib before launching subprocesses) which was added to CRDS; in theory (and this was tested interactively) CRDS is no longer vulnerable to that particular Python bug.

@baileyji have you noticed any change in your multiprocessing performance? If not, can you give an example script which is exhibiting the pathological behavior?

jaytmiller · 2018-09-26T14:18:23Z

This ticket spawned a number of effective and ineffective changes.

I am closing this based on the addition of CRDS cache locking to solve issues related to concurrent cache syncs, and based on Jonathan's discovery of the Tkinter, IPython, urllib bug which could cause sub-processes to fail and CRDS to silently hang.

hbushouse assigned jaytmiller Feb 13, 2018

hbushouse added the crds interface label Feb 13, 2018

hbushouse added this to the Future milestone Feb 13, 2018

jaytmiller closed this as completed Sep 26, 2018

CRDS wait causes hung pipeline #1301

CRDS wait causes hung pipeline #1301

Comments

baileyji commented Oct 12, 2017

stscieisenhamer commented Oct 12, 2017

baileyji commented Oct 12, 2017

stscieisenhamer commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017 • edited

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 12, 2017

jaytmiller commented Oct 12, 2017

baileyji commented Oct 13, 2017 • edited

jaytmiller commented Oct 13, 2017

jaytmiller commented Oct 17, 2017

baileyji commented Oct 18, 2017

jaytmiller commented Oct 18, 2017

baileyji commented Oct 18, 2017

baileyji commented Oct 18, 2017

baileyji commented Oct 18, 2017

jaytmiller commented Oct 19, 2017

baileyji commented Oct 19, 2017

baileyji commented Oct 19, 2017

jaytmiller commented Oct 19, 2017

jaytmiller commented Oct 19, 2017

jaytmiller commented Oct 19, 2017

baileyji commented Oct 24, 2017

jaytmiller commented Oct 25, 2017

baileyji commented Oct 26, 2017 • edited

jaytmiller commented Nov 1, 2017

baileyji commented Jan 12, 2018

jaytmiller commented Jan 12, 2018

jaytmiller commented Feb 13, 2018

jaytmiller commented Jun 4, 2018

jaytmiller commented Sep 26, 2018

jaytmiller commented Oct 12, 2017 •

edited

baileyji commented Oct 13, 2017 •

edited

baileyji commented Oct 26, 2017 •

edited