Skip to content

socket failures that take hours to heal #33

Open
@stopatz

Description

@stopatz

I use a Wolfram session to compute the integrand in the Vegas algorithm in Python.

I use MPI to call a session in each core on a high-performance cluster.

Before I start a session, I want to kill any floating Mathematica processes, so I use the kernelcontroller as follows:

controller = kernelcontroller.WolframKernelController(kernel='path', kernel_loglevel=1)

controller._kernel_stop()

Now, if I wait 10 minutes after this clean-up, my actual code

with WolframLanguageSession('path') as session:...

works fine most of the time.

But at seemingly random times, I get socket failures when I run the two-step process (cleanup, then run session), with multiple instances of the following error message:

Socket exception: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.
Failed to start.
Traceback (most recent call last):
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 435, in _kernel_start
response = self.kernel_socket_in.recv_abortable(
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/zmqsocket.py", line 53, in recv_abortable
raise SocketOperationTimeout(
wolframclient.evaluation.kernel.zmqsocket.SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.

Now, to be able to run my code again, I find that I have to wait around 3 hours and run my routine. Otherwise, this socket failure persists.

So my questions are i) is there a better way to kill stray processes than what I have used, ii) why am I getting the socket failures, and is there a way to heal the socket failures faster?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions