socket failures that take hours to heal

I use a Wolfram session to compute the integrand in the Vegas algorithm in Python.

I use MPI to call a session in each core on a high-performance cluster.

Before I start a session, I want to kill any floating Mathematica processes, so I use the kernelcontroller as follows:

controller = kernelcontroller.WolframKernelController(kernel='path', kernel_loglevel=1)
  
controller._kernel_stop()

Now, if I wait 10 minutes after this clean-up, my actual code

with WolframLanguageSession('path') as session:...

works fine most of the time. 

But at seemingly random times, I get socket failures when I run the two-step process (cleanup, then run session), with multiple instances of the following error message:

Socket exception: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.
Failed to start.
Traceback (most recent call last):
  File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 435, in _kernel_start
    response = self.kernel_socket_in.recv_abortable(
  File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/zmqsocket.py", line 53, in recv_abortable
    raise SocketOperationTimeout(
wolframclient.evaluation.kernel.zmqsocket.SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.

Now, to be able to run my code again, I find that I have to wait around 3 hours and run my routine. Otherwise, this socket failure persists.

So my questions are i) is there a better way to kill stray processes than what I have used, ii) why am I getting the socket failures, and is there a way to heal the socket failures faster?



   





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

socket failures that take hours to heal #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

socket failures that take hours to heal #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions