Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes-459 #460

Merged
merged 6 commits into from
Mar 23, 2022
Merged

fixes-459 #460

merged 6 commits into from
Mar 23, 2022

Conversation

bennahugo
Copy link
Collaborator

This fixes #459.

As near as I can tell the issue is in the use of threads inside numba..... TDD fails to import in the new numba (even if I follow their instructions and install from pip.... @JSKenyon)

I still can't track down this super annoying error -- that part of the code is a bit cryptic maybe @o-smirnov can be of some assistance here

INFO      19:19:39 - data_handler       [x01] [0.3/2.2 2.6/23.0 0.9Gb] reading BITFLAG
INFO      19:19:39 - main               [0.2/2.9 2.1/25.8 0.9Gb] WARNING: unrecognized worker process name 'Process-6'. Please inform the developers.

At least it runs through again now

Running with

gocubical --sol-jones g,dd --data-ms msdir/1491291289.1ghz.1.1ghz.4hrs.ms --data-column CORRECTED_DATA --data-time-chunk 8 --data-freq-chunk 0 --model-list "MODEL_DATA+-output/deep2.DicoModel@msdir/tag.reg:output/deep2.DicoModel@msdir/tag.reg" --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50,90,50,90 --sol-min-bl 110.0 --sol-max-bl 0 --dist-max-chunks 4 --out-name output/deep2cal --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-subtract-dirs 1:  --g-time-int 8 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --dd-dd-term 1 --dd-time-int 8 --dd-freq-int 32 --dd-clip-low 0 --dd-clip-high 0 --dd-type complex-diag --dd-fix-dirs 0 --dd-max-prior-error 0.35 --dd-max-post-error 0.35 --dd-max-iter 200 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1 --dist-nthread 4 --dist-nworker 4 --dist-ncpu 4

Copy link
Collaborator

@JSKenyon JSKenyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one log message to fix, otherwise looks good to me.

return nthread
except:
numba.config.THREADING_LAYER = "default"
print("Cannot use TDD threading (check your installation). Dropping the number of solver threads to 1", file=log(0, "red"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be TBB (thread building blocks).

@bennahugo
Copy link
Collaborator Author

Yup. As discussed -:- https://numba.pydata.org/numba-doc/latest/user/threading-layer.html
This says TBB can be enabled by installing the python package from pip. This is not my experience though -- it seems to be picking up the system version. This is therefore only a workaround.

This used to work with earlier versions of numba though so I'm not sure what has changed to make things execute "unsafely". Perhaps they switched over their default threaded model.

@JSKenyon
Copy link
Collaborator

Just putting my two cents here. Note these issues on Numba: numba/numba#6108 and numba/numba#7148. It seems that it has something to do with discovery of the .so files. It is possible to work around this by setting LD_LIBRARY_PATH to wherever pip put the file. In my case this was something like path/to/venv/lib.

@bennahugo
Copy link
Collaborator Author

Ok I tried exporting the LD_LIBRARY_PATH (technically it should not be needed when the virtualenv is activated expressly though)

Numba -s reports successful import

__Threading Layer Information__
TBB Threading Layer Available                 : True
+-->TBB imported successfully.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

I do have

Requirement already satisfied: tbb in ./venvddf/lib/python3.6/site-packages (2021.5.1)

However, when I try to execute the basic function

    def set_numba_threading(nthread):
        try:
            numba.config.THREADING_LAYER = "safe"
            @numba.njit(parallel=True)
            def foo(a, b):
                return a + b
            foo(np.arange(5), np.arange(5))
            return nthread
        except:
            numba.config.THREADING_LAYER = "default"
            print("Cannot use TBB threading (check your installation). Dropping the number of solver threads to 1", file=log(0, "red"))
            return 1

I get

/home/hugo/workspace/venvddf/lib/python3.6/site-packages/numba/np/ufunc/parallel.py:365: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 9107. The TBB threading layer is disabled.
  warnings.warn(problem)
INFO      10:59:42 - main               [0.2 2.0 1.0Gb] Cannot use TBB threading (check your installation). Dropping the number of solver threads to 1

So it is still picking up the older system libraries version. Unfortunately I can't uninstall that version --- it will break several system packages

@bennahugo
Copy link
Collaborator Author

Ok hang on... found another place where OMP is being used -- the degridder.... J

It may run into a screwup when forks and threads are used. I suggest we switch to workqueue if TBB fails to load then on top of that set the environment variables accordingly -- if workers.py sets nthread >1 then the degridder needs to go to OMP_NUM_THREADS == 1

@bennahugo
Copy link
Collaborator Author

I'm not sure how this did not give issues before. My best guess is numpy now invokes OMP before we fork and then it becomes unsafe to use.....

@bennahugo
Copy link
Collaborator Author

bennahugo commented Mar 22, 2022

Nope workqueue don't completely solve the issue -- things still go pot if threads > 1 is used on workqueue -- and it is not a memory issue. I'm testing this on com08

Edit : at least on 18.04 it looks like one would need to compile packages at system level (possibly just to include headers for TBB ???) I have no idea how to get things working in a venv. I've traced it down to stem from the numba code with these changes

@bennahugo
Copy link
Collaborator Author

Alright I'm happy this now works as advertised -- can't believe nobody picked up the issue with the previously used small angle approximation for getting the ra dec of the facet center for beam application:
Predicted flux far off axis E evaluated almost at the source :

[278:293,1833:1850] min -0.01585, max 0.05009, mean 0.002003, std 0.01133, sum 0.5107, np 255

original convolved model flux (subject to a slightly different beam evaluation due to the regular facets used in DDF - so a small difference to be expected

[270:297,1831:1857] min -1.119e-09, max 0.04783, mean 0.00147, std 0.005706, sum 1.032, np 702

Apparent peak convolved flux of the source:

[270:297,1832:1856] min -0.0004496, max 0.02002, mean 0.0006666, std 0.002545, sum 0.4319, np 648

@viralp please take note -- you have previously tried using the beam within cubical and not getting decent subtraction when peeling sources from intrinsic models

The example use is (pointing center may be set to be the phase center if you did not mosaic via 'DataPhaseDir':

gocubical --sol-jones g,dd --data-ms msdir/1491291289.1ghz.1.1ghz.4hrs.ms --data-column CORRECTED_DATA --data-time-chunk 8 --data-freq-chunk 0 --model-list "output/deep2.DicoModel@msdir/tag2.reg" --model-ddes auto --weight-column WEIGHT --flags-apply FLAG --flags-auto-init legacy --madmax-enable 0 --madmax-global-threshold 0,0 --madmax-threshold 0,0,10 --sol-stall-quorum 0.95 --sol-term-iters 50 --sol-min-bl 110.0 --sol-max-bl 0 --dist-max-chunks 4 --out-name output/deep2cal --out-overwrite 1 --out-mode sr --out-column DE_DATA --out-subtract-dirs 0  --g-time-int 8 --g-freq-int 0 --g-clip-low 0 --g-clip-high 0 --g-type complex-diag --g-update-type phase-diag --g-max-prior-error 0.35 --g-max-post-error 0.35 --g-max-iter 100 --degridding-OverS 11 --degridding-Support 7 --degridding-Nw 100 --degridding-wmax 0 --degridding-Padding 1.7 --degridding-NDegridBand 15 --degridding-MaxFacetSize 0.15 --degridding-MinNFacetPerAxis 1 --dist-nthread 1 --dist-nworker 16 --dist-ncpu 4 --degridding-NProcess 8 --degridding-BeamModel FITS --degridding-FITSFile 'input/meerkat_pb_jones_cube_95channels_$(corr)_$(reim).fits' --out-model-column MODEL_OUT --sel-field 2 --degridding-PointingCenterAt j2000,4h13m26.40,-80d00m00s

This work is done in preparation for SKA-MID. I will next port heterogeneous beams to this package

@bennahugo
Copy link
Collaborator Author

@JSKenyon please review

Copy link
Collaborator

@JSKenyon JSKenyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, bar my single comment.

return a + b
foo(np.arange(5), np.arange(5))
return nthread
except:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unqualified excepts are generally frowned upon. Does this not raise a consistent exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it raises a massively complex numba exception which I'm not sure how to properly catch. let me see if I can reproduce. I'm just worried this exception interface will change (numba seems constantly evolving)

@bennahugo bennahugo merged commit 06ea755 into ratt-ru:master Mar 23, 2022
@bennahugo bennahugo deleted the issue-459 branch March 23, 2022 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Racecondition in threading
2 participants