pypestworker #628

cnicol-gwlogic · 2025-10-30T04:50:45Z

cnicol-gwlogic
Oct 30, 2025

I've been using pypestworker a lot (for pastas models with some non-stock stuff to slow things down a bit). I find that for larger pest problems (with longer forward run times) I run into issues. I have played with timeout (and socket_timeout) a lot, and sometimes this can resolve my issue. But this is what I tend to see (from pest.rmr):

10/30/25 15:38:44-> run_id:0 received from unexpected group_id:0 should be group: 2 from:localhost$/mnt/f/xxx ...ignoring
10/30/25 15:38:44->Sending run_id:42 to:localhost$/mnt/f/xxx group_id:2  da_cycle:-9999   iteration:0  realization:42 concurrent:1)
10/30/25 15:38:45-> run_id:0 received from unexpected group_id:0 should be group: 2 from:localhost$/mnt/f/xxx ...ignoring

It seems like ppw / netpack has killed/restarted the ppw instance while the model is still running/post-processing, so is being sent results from a killed worker (would that make sense?). If that is the case, I wonder if we need a check for process return codes before deciding to kill the worker - eg multiprocessing.Process().exitcode # None=still running, 0=success, >0=error - but I can't quite see where the kill/reconnect process happens (maybe it doesn't...).

Any thoughts / tips on this? Thanks a lot.

rhugman · 2025-11-03T11:32:17Z

rhugman
Nov 3, 2025

hey @cnicol-gwlogic - any chance you could send me a test model?

1 reply

cnicol-gwlogic Nov 4, 2025
Author

@rhugman-intera - thanks for the offer, I'll package one up in the next day or so and send (to your work email if that's ok). I am running this in windows WSL; I'll test it on a straight up linux machine tomorrow before hassling you. You never know.

jtwhite79 · 2025-11-03T19:45:34Z

jtwhite79
Nov 3, 2025
Maintainer

@cnicol-gwlogic can you check the rmr file to see if the master is requesting to kill runs? The pypestworker does not implement the kill request, so Im wondering the master is sending a kill request to the worker and the worker is ignoring it and then when run that should have been killed finishes, the worker sends it to the master and thats where things get mixed up?

3 replies

cnicol-gwlogic Nov 4, 2025
Author

@jtwhite79 - thanks - no kill commands in the rmr (example attached); that was happening originally, but a while ago i saw it and upped that ies overdue factor thing up to 100. So it seems this must be some ppw timeout or parallel proc issue I think; seems like a ppw is running a model, and the netpak (groupid etc) has been reset whilst its doing the forward run work. Or I wonder if there could be an issue with those locks somewhere.
pest.rmr.zip

jtwhite79 Nov 4, 2025
Maintainer

bummer. One other thing and this is also a wild guess...there looks to be a hidden non-ascii char in the rmr file at the end of your working path:

maybe that is just a reporting thing, but others have reported that when the path has those kind of chars, strange things can happen...a red herring?

cnicol-gwlogic Nov 4, 2025
Author

Ah, hadn't noticed that but I had noticed my text editor read that file in hex mode when I open it...I'll try non-WSL shortly and report back. Could be on to something here.

cnicol-gwlogic · 2025-11-05T01:29:13Z

cnicol-gwlogic
Nov 5, 2025
Author

fyi - same problem on straight-up ubuntu as via WSL.

3 replies

jtwhite79 Nov 5, 2025
Maintainer

got it. I just opened a pr that adds a log file to ppw - this should help us debug whats happening...

cnicol-gwlogic Nov 5, 2025
Author

Awesome ta. I'll try zip this case up too, just gonna take a day or two for me to get off my butt on that. Thanks heaps.

cnicol-gwlogic Nov 6, 2025
Author

Ok, thanks for that commit - installed and tested.

I ran the exact same pest workflow/models as when I was having the issue previously mentioned; you can see that it starts the run with correct group/run_id, but then does a ping with group/run_id of zero:

2025-11-06 15:46:33.393254 PyPestWorker starting with timeout:4.0 and socket_timeout:400.0
2025-11-06 15:46:33.393420 processing control file
2025-11-06 15:46:44.798145 trying to connect to localhost:11194...
2025-11-06 15:46:56.807468 connected to localhost:11194
2025-11-06 15:49:09.986132 recv'd message type:REQ_RUNDIR, group:0, run_id:0, desc:
2025-11-06 15:49:09.987045 sent message type:RUNDIR, group: 0, run_id:0, desc:sending cwd
2025-11-06 15:49:13.997349 recv'd message type:PAR_NAMES, group:0, run_id:0, desc:
2025-11-06 15:49:18.323700 recv'd message type:OBS_NAMES, group:0, run_id:0, desc:
2025-11-06 15:49:22.582951 recv'd message type:REQ_LINPACK, group:0, run_id:0, desc:
2025-11-06 15:49:22.583337 sent message type:LINPACK, group: 0, run_id:0, desc:fake linpack result

all normal, then the bad bit(s):

2025-11-06 15:49:37.020880 recv'd message type:START_RUN, group:2, run_id:15, desc: da_cycle:-9999   iteration:0  realization:15
2025-11-06 15:50:47.371852 recv'd message type:PING, group:0, run_id:0, desc:
2025-11-06 15:50:47.373609 sent message type:PING, group: 0, run_id:0, desc:ping back
2025-11-06 15:51:20.499209 sent message type:RUN_FINISHED, group: 0, run_id:0, desc:
2025-11-06 15:51:20.499563 sent message type:READY, group: 0, run_id:0, desc:ready for next run
2025-11-06 15:51:38.843239 recv'd message type:START_RUN, group:2, run_id:44, desc: da_cycle:-9999   iteration:0  realization:44
2025-11-06 15:52:41.188099 recv'd message type:PING, group:0, run_id:0, desc:
2025-11-06 15:52:41.192026 sent message type:PING, group: 0, run_id:0, desc:ping back

Similar size case that works fine: same number of models/workers, but a bit of a different setup including fewer pars (5k vs 9k), same number of obs (270k) (and a mix of pastas stressmodels/wellmodels (latter with several wells) vs one stressmodel per well) - so probably less of a busy pastas run here:

2025-11-06 15:34:20.844107 starting: opening pypestworker_2025-11-06-15-34-20-842674.txt for logging
2025-11-06 15:34:20.844402 PyPestWorker starting with timeout:4.0 and socket_timeout:400.0
2025-11-06 15:34:20.844550 processing control file
2025-11-06 15:34:28.445486 trying to connect to localhost:11194...
2025-11-06 15:34:40.457756 connected to localhost:11194
2025-11-06 15:38:09.971960 recv'd message type:REQ_RUNDIR, group:0, run_id:0, desc:
2025-11-06 15:38:09.972548 sent message type:RUNDIR, group: 0, run_id:0, desc:sending cwd
2025-11-06 15:38:13.980396 recv'd message type:PAR_NAMES, group:0, run_id:0, desc:
2025-11-06 15:38:18.350615 recv'd message type:OBS_NAMES, group:0, run_id:0, desc:
2025-11-06 15:38:25.375913 recv'd message type:REQ_LINPACK, group:0, run_id:0, desc:
2025-11-06 15:38:25.376343 sent message type:LINPACK, group: 0, run_id:0, desc:fake linpack result
2025-11-06 15:38:36.957614 recv'd message type:START_RUN, group:2, run_id:68, desc: da_cycle:-9999   iteration:0  realization:68
2025-11-06 15:39:02.970800 sent message type:RUN_FINISHED, group: 2, run_id:68, desc: da_cycle:-9999   iteration:0  realization:68
2025-11-06 15:39:02.971324 sent message type:READY, group: 0, run_id:0, desc:ready for next run
2025-11-06 15:39:20.172688 recv'd message type:START_RUN, group:2, run_id:94, desc: da_cycle:-9999   iteration:0  realization:94
2025-11-06 15:39:48.820197 sent message type:RUN_FINISHED, group: 2, run_id:94, desc: da_cycle:-9999   iteration:0  realization:94
2025-11-06 15:39:48.820548 sent message type:READY, group: 0, run_id:0, desc:ready for next run
2025-11-06 15:40:20.377848 recv'd message type:PING, group:0, run_id:0, desc:
2025-11-06 15:40:20.378128 sent message type:PING, group: 0, run_id:0, desc:ping back

Then it gets weird, makes me think it is a user error issue (not surprising!). Same as case (1) above, but made using exact same jupyter workflow as (2) above, in which I made some mods to allow for the case 2 features. This now works fine....So I reckon there was (is) an error of mine somewhere in the original workflow I was using for case 1. Very odd, as it worked fine without ppw.

I'll keep digging and report back if I find anything of use. But thanks heaps for that logging stuff - I think it will be handy anyway.

pypestworker #628

Uh oh!

Uh oh!

cnicol-gwlogic Oct 30, 2025

Replies: 3 comments · 7 replies

Uh oh!

rhugman Nov 3, 2025

Uh oh!

cnicol-gwlogic Nov 4, 2025 Author

Uh oh!

jtwhite79 Nov 3, 2025 Maintainer

Uh oh!

cnicol-gwlogic Nov 4, 2025 Author

Uh oh!

jtwhite79 Nov 4, 2025 Maintainer

Uh oh!

cnicol-gwlogic Nov 4, 2025 Author

Uh oh!

cnicol-gwlogic Nov 5, 2025 Author

Uh oh!

jtwhite79 Nov 5, 2025 Maintainer

Uh oh!

cnicol-gwlogic Nov 5, 2025 Author

Uh oh!

cnicol-gwlogic Nov 6, 2025 Author

cnicol-gwlogic
Oct 30, 2025

Replies: 3 comments 7 replies

rhugman
Nov 3, 2025

cnicol-gwlogic Nov 4, 2025
Author

jtwhite79
Nov 3, 2025
Maintainer

cnicol-gwlogic Nov 4, 2025
Author

jtwhite79 Nov 4, 2025
Maintainer

cnicol-gwlogic Nov 4, 2025
Author

cnicol-gwlogic
Nov 5, 2025
Author

jtwhite79 Nov 5, 2025
Maintainer

cnicol-gwlogic Nov 5, 2025
Author

cnicol-gwlogic Nov 6, 2025
Author