Description
I use a Wolfram session to compute the integrand in the Vegas algorithm in Python.
I use MPI to call a session in each core on a high-performance cluster.
Before I start a session, I want to kill any floating Mathematica processes, so I use the kernelcontroller as follows:
controller = kernelcontroller.WolframKernelController(kernel='path', kernel_loglevel=1)
controller._kernel_stop()
Now, if I wait 10 minutes after this clean-up, my actual code
with WolframLanguageSession('path') as session:...
works fine most of the time.
But at seemingly random times, I get socket failures when I run the two-step process (cleanup, then run session), with multiple instances of the following error message:
Socket exception: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.
Failed to start.
Traceback (most recent call last):
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 435, in _kernel_start
response = self.kernel_socket_in.recv_abortable(
File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/zmqsocket.py", line 53, in recv_abortable
raise SocketOperationTimeout(
wolframclient.evaluation.kernel.zmqsocket.SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.
Now, to be able to run my code again, I find that I have to wait around 3 hours and run my routine. Otherwise, this socket failure persists.
So my questions are i) is there a better way to kill stray processes than what I have used, ii) why am I getting the socket failures, and is there a way to heal the socket failures faster?