Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: epics failed to respond #273

Open
dpdutcher opened this issue Aug 15, 2022 · 9 comments
Open

RuntimeError: epics failed to respond #273

dpdutcher opened this issue Aug 15, 2022 · 9 comments

Comments

@dpdutcher
Copy link
Collaborator

dpdutcher commented Aug 15, 2022

Got this error when running uxm_setup, during the estimate_phase_delay portion. The full traceback is below, though I know users often enounter this error in various places, so this can be a catch-all thread.

In this particular instances, there was no associated error in the smurf-streamer docker logs, and I could still communicate with the board via the pysmurf-ipython session, and I could just restart the uxm_setup script with no hammering required.

[ 2022-08-15 14:55:17 ]  Running find_freq
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 234, in 'calling callback function'
  File "/usr/local/lib/python3.6/dist-packages/epics/ca.py", line 730, in _onGetEvent
    result = memcopy(dbr.cast_args(args))
  File "/usr/local/lib/python3.6/dist-packages/epics/dbr.py", line 308, in cast_args
    ntype = native_type(ftype)
  File "/usr/local/lib/python3.6/dist-packages/epics/dbr.py", line 255, in native_type
    if ftype > CTRL_STRING:
TypeError: '>' not supported between instances of '_ctypes.PyCSimpleType' and 'int'
[ 2022-08-15 14:55:22 ]  Command failed: smurf_server_s6:AMCc:FpgaTopLevel:AppTop:AppCore:SysgenCryo:Base[5]:bandCenterMHz
[ 2022-08-15 14:55:22 ]  Retry attempt 1 of 5
[ 2022-08-15 14:55:27 ]  Retry attempt 2 of 5
[ 2022-08-15 14:55:32 ]  Retry attempt 3 of 5
[ 2022-08-15 14:55:37 ]  Retry attempt 4 of 5
[ 2022-08-15 14:55:42 ]  Retry attempt 5 of 5
Traceback (most recent call last):
  File "/devel/scripts/uxm_setup.py", line 18, in <module>
    uxm_setup.uxm_setup(S, cfg, bands=args.bands)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/sodetlib/sodetlib/operations/uxm_setup.py", line 472, in uxm_setup
    S, cfg, bands, update_cfg=update_cfg)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/sodetlib/sodetlib/operations/uxm_setup.py", line 209, in setup_phase_delay
    band_delay_us, _ = S.estimate_phase_delay(b, make_plot=True, show_plot=False)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/smurf_util.py", line 310, in estimate_phase_delay
    freq_dsp,resp_dsp=self.find_freq(band,subband=dsp_subbands)
  File "/usr/local/src/pysmurf/python/pysmurf/client/util/pub.py", line 50, in wrapper
    rv = func(S, *args, **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/tune/smurf_tune.py", line 3459, in find_freq
    band_center = self.get_band_center_mhz(band)
  File "/usr/local/src/pysmurf/python/pysmurf/client/command/smurf_command.py", line 2345, in get_band_center_mhz
    **kwargs)
  File "/usr/local/src/pysmurf/python/pysmurf/client/command/smurf_command.py", line 186, in _caget
    raise RuntimeError("epics failed to respond")
RuntimeError: epics failed to respond
@dpdutcher
Copy link
Collaborator Author

Currently, this is seeming very similar to slaclab/pysmurf#713 , which was "solved" by an update to the smurf-streamer but perhaps I marked as closed prematurely.
Anecdotally, I've once again seen epics crashing when operating three slots for an overnight dataset, but running two slots works fine.

@jlashner
Copy link
Collaborator

just for reference, seems like this generated a core dump (core_1660699519_python3_11091_11979_1001_1000) with the following backtrace:

#0  0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
   from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
[Current thread is 1 (Thread 0x7f443f4a5700 (LWP 107))]
(gdb) bt
#0  0x00007f446004f270 in SmurfBuilder::FrameFromSamples(std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>, std::_Deque_iterator<boost::shared_ptr<SmurfSample const>, boost::shared_ptr<SmurfSample const>&, boost::shared_ptr<SmurfSample const>*>) ()
   from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#1  0x00007f4460050973 in SmurfBuilder::FlushStash() () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#2  0x00007f4460050e1d in SmurfBuilder::ProcessStashThread(SmurfBuilder*) () from /usr/local/src/smurf-streamer/lib/sosmurfcore.so
#3  0x00007f44849926ef in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4487cfc6db in start_thread (arg=0x7f443f4a5700) at pthread_create.c:463
#5  0x00007f448803588f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I will look more into it but if you find a way to semi-reliably reproduce it that would be very helpful

@jlashner
Copy link
Collaborator

But definitely seems like an issue with the streamer, possibly a race condition or something

@jlashner
Copy link
Collaborator

Sent this to Daniel in Slack, but if you run this command it will enable more debugging lots in the smurf-streamer dockers...

S._caput('smurf_server_s2:AMCc:SmurfProcessor:SOStream:DebugBuilder', 1)

which could provide some good info related to what's going wrong

@jlashner
Copy link
Collaborator

Hi Daniel, I was able to debug this a bit after seeing it in some SAT1 tests. I added some queue limits in this PR of the smurf-streamer, which seems to have fixed a lot of the issues we were seeing when operating multiple slots on the SAT. If you want to upgrade you can use the docker tag simonsobs/smurf-streamer:v0.4.1-2-g142bddf.

Since this has fixed things on SAT1 I'm going to close this issue, but feel free to re-open if you upgrade and still see crashes.

@dpdutcher dpdutcher reopened this Nov 12, 2022
@dpdutcher
Copy link
Collaborator Author

dpdutcher commented Nov 12, 2022

Happened with smurf-streamer version v0.4.1-3-g728183a . I was only doing things on one slot at the time. I don't see any out of the ordinary in the smurf streamer log or in core dumps. I can't communicate with the board now, just getting "epics failed to respond" errors.

Original crash message:

RuntimeError: epics failed to respond
During handling of the above exception, another exception occurred:
...
epics.ca.ChannelAccessGetFailure: Get failed; status code: 192

@jlashner
Copy link
Collaborator

What were you doing when it crashed?

@dpdutcher
Copy link
Collaborator Author

I was running https://github.com/simonsobs/readout-script-dev/blob/master/ddutcher/ufm_biasstep_sodetlib.py , it should have been running bias steps at the time it crashed. The last messges in stdout before the timeout were

[ 2022-11-12 05:29:53 ]  Waiting 3 sec after switching to hcm
[ 2022-11-12 05:29:56 ]  Input downsample factor is None. Using value already in pyrogue: 1
[ 2022-11-12 05:29:56 ]  FLUX RAMP IS DC COUPLED.
[ 2022-11-12 05:30:00 ]  caput smurf_server_s2:AMCc:SmurfProcessor:Unwrapper:reset 1
[ 2022-11-12 05:30:00 ]  caput smurf_server_s2:AMCc:SmurfProcessor:Filter:reset 1
[ 2022-11-12 05:30:02 ]  Writing to file : /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003.dat
[ 2022-11-12 05:30:02 ]  /data/smurf_data/20221112/crate1slot2/1668227025/outputs/1668231003_mask.txt
[ 2022-11-12 05:30:02 ]  Writing frequency mask.
[ 2022-11-12 05:30:10 ]  Command failed: smurf_server_s2:AMCc:FpgaTopLevel:AppTop:AppCore:SysgenCryo:Base[2]:CryoChannels:centerFrequencyArray
[ 2022-11-12 05:30:10 ]  Retry attempt 1 of 5

@jlashner
Copy link
Collaborator

Interesting... this could be the same issue but I don't see a core-dump file on your system.

It seems like your smurf-server, being one of the first ones issued, is also under-spec'ed compared to the ones we're using on the SAT, so it kind of makes sense you're seeing this the most often. We were seeing it more frequently on our system that was having RAM issues. Replacing it with an official one might alleviate this issue...

Apart from replacing your server there are a few things we can probably try that might help:

  • Lower the max queue-size for your system. I would have to change the smurf-streamer to make this a configurable parameter, but that kind of makes sense to me.
  • Have the bias-step function optionally take downsampled data. Right now bias steps always disables downsampling, but that is not really necessary unless you care about calculating tau-eff. I think the combination of high data rates / a lot of epics calls is what's causing this issue in the first place, so lowering the sampling rate when possible would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants