Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bomb out with lots of complaints if I/O worker dies #439

Open
o-smirnov opened this issue Mar 4, 2021 · 8 comments
Open

bomb out with lots of complaints if I/O worker dies #439

o-smirnov opened this issue Mar 4, 2021 · 8 comments
Assignees
Labels

Comments

@o-smirnov
Copy link
Collaborator

If the I/O worker dies, this is a little hard for the end user to diagnose, as the solver workers carry on and fill up the log with messages. The error message is then buried somewhere mid-log and the whole process hangs waiting on I/O, instead of exiting with an error.

Surely a subprocess error is catchable at the main process level. #319 is related.

@Mulan-94
Copy link
Contributor

@o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks from 7 to 4 to no avail

INFO      19:42:07 - main               [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
 Traceback (most recent call last):
  File "/home/CubiCal/cubical/main.py", line 582, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop
    stats = future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

@bennahugo
Copy link
Collaborator

bennahugo commented Sep 20, 2021 via email

@kendak333
Copy link

I'm running into this BrokenProcessPool error with an oom-kill notice at the end of the log file - taking that to mean the system thinks I'll run out RAM at some point, so kills it. What I can't understand is that earlier in the log when it's calculating all the memory requirements, it says my max memory requirement will be ~57 GB - the system I'm running on has max 62 GB available, so I don't know why things are being killed.

I'm using --data-freq-chunk=256 (reduced down from 1024), --data-time-chunk=36, --dist-max-chunks=2, and ncpus=20 (the max available on the node). What other memory-related knobs can I twiddle to try solve this? It's only 2 hours of data, but running into the same issue with even smaller MSs as well.

@JSKenyon
Copy link
Collaborator

JSKenyon commented Mar 2, 2022

The memory estimation is just that - a guess based on some empirical experiments I did. So take it with a pinch of salt. If it is an option, I would really suggest taking a look at QuartiCal. It is much less memory hungry, and has fewer knobs to boot. I am only too happy to help you out on that front.

That said, could you please post your log and config. That will help identify what is going wrong.

@kendak333
Copy link

kendak333 commented Mar 2, 2022

@JSKenyon running it as part of oxkat - guess we can have a chat about incorporating QuartiCal on an ad hoc basis. I'll take a look at it. But for now, here's the log and the parset
CL2GC912_cubical.zip

and the code run was
gocubical /data/knowles/mkatot/reruns/data/cubical/2GC_delaycal.parset --data-ms=1563148862_sdp_l0_1024ch_J0046.4-3912.ms --out-dir /data/knowles/mkatot/reruns/GAINTABLES/delaycal_J0046.4-3912_2022-03-01-10-17-13.cc/ --out-name delaycal_J0046.4-3912_2022-03-01-10-17-13 --k-save-to delaycal_J0046.4-3912.parmdb --data-freq-chunk=256
.

@JSKenyon
Copy link
Collaborator

JSKenyon commented Mar 2, 2022

OK, in this instance I suspect it is just the fact that the memory footprint is underestimated. I think that the easiest solution in this instance is to set --dist-ncpu=3. Simply put, each the memory footprint of each worker is just too large to use all the cores (or even 5+1 for I/O as in the log you sent). This is unfortunate and will make things slower. On a positive note, hopefully people will start onboarding QuartiCal which does much better in this regard. Apologies for not having a better solution for you.

@kendak333
Copy link

Ok thanks, I'll give that a go.

@IanHeywood
Copy link
Collaborator

The oxkat defaults are tuned so they work on standard worker nodes at IDIA and CHPC for standard MeerKAT continuum processing (assuming 1024 channel data). The settings should actually leave a fair bit of overhead to account for things like differing numbers of antennas, and the slurm / PBS controllers being quite trigger happy when jobs step out of line in terms of memory usage. But if you have a node with 64 GB of RAM then the defaults will certainly be too ambitious.

Is this running on hippo?

Also I'm not sure whether moving from a a single solution for the entire band (--data-freq-chunk = 1024) to moving to four solutions across the band (--data-freq-chunk = 256) will reduce the quality of your delay solutions, particularly for those quarter-band chunks that have high RFI occupancy. You might want to check if reverting to a 1024 channel solution gives better results. You could drop --dist-ncpu further, and/or reduce --data-time-chunk to accommodate this. Note that the latter is 36 by default, but that encompasses 9 individual solution intervals (--k-time-int 4).

Cheers.

PS: @JSKenyon swapping to QuartiCal remains on my to-do list!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants