Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLD] Segmentation fault when tested on some platforms #3676

Closed
olebole opened this issue Nov 17, 2021 · 10 comments · Fixed by #3688
Closed

[BLD] Segmentation fault when tested on some platforms #3676

olebole opened this issue Nov 17, 2021 · 10 comments · Fixed by #3688
Labels
bug build related to the build process

Comments

@olebole
Copy link
Contributor

olebole commented Nov 17, 2021

Bug report

Bug summary

When running the tests, on some platforms appears a segmentation fault in test_gadget_binary.

The platforms are MIPS (32+64 bit; official Debian architectures), HP-PA, RiscV64, Sparc64 (unofficial architectures).
(sorry for flowing you with exotic bugs)

Actual outcome

yt/frontends/gadget/tests/test_outputs.py::test_gadget_binary yt : [INFO     ] 2021-11-17 09:03:10,652 Omega Lambda is 0.0, so we are turning off Cosmology.
[…]
yt : [WARNING  ] 2021-11-17 09:06:04,770 Non-standard header size is detected! Gadget-2 standard header is 256 bytes, but yours is [288]. Make sure a non-standard header is actually expected. Otherwise something is wrong, and you might want to check how the dataset is loaded. Futher information about header specification can be found in https://yt-project.org/docs/dev/examining/loading_data.html#header-specification.
yt : [INFO     ] 2021-11-17 09:06:04,774 Omega Lambda is 0.0, so we are turning off Cosmology.
yt : [INFO     ] 2021-11-17 09:06:04,775 Assuming length units are in kpc (physical)
yt : [INFO     ] 2021-11-17 09:06:05,313 Parameters: current_time              = 0.0
yt : [INFO     ] 2021-11-17 09:06:05,314 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-11-17 09:06:05,315 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-11-17 09:06:05,317 Parameters: domain_right_edge         = [1. 1. 1.]
yt : [INFO     ] 2021-11-17 09:06:05,318 Parameters: cosmological_simulation   = 0
yt : [INFO     ] 2021-11-17 09:06:05,332 Allocating for 4.000e+02 particles
yt : [WARNING  ] 2021-11-17 09:06:34,877 Non-standard header size is detected! Gadget-2 standard header is 256 bytes, but yours is [288]. Make sure a non-standard header is actually expected. Otherwise something is wrong, and you might want to check how the dataset is loaded. Futher information about header specification can be found in https://yt-project.org/docs/dev/examining/loading_data.html#header-specification.
yt : [INFO     ] 2021-11-17 09:06:34,882 Omega Lambda is 0.0, so we are turning off Cosmology.
yt : [INFO     ] 2021-11-17 09:06:34,883 Assuming length units are in kpc (physical)
yt : [INFO     ] 2021-11-17 09:06:35,435 Parameters: current_time              = 0.0
yt : [INFO     ] 2021-11-17 09:06:35,436 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2021-11-17 09:06:35,437 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2021-11-17 09:06:35,438 Parameters: domain_right_edge         = [1. 1. 1.]
yt : [INFO     ] 2021-11-17 09:06:35,440 Parameters: cosmological_simulation   = 0
yt : [INFO     ] 2021-11-17 09:06:35,455 Allocating for 4.000e+02 particles
Fatal Python error: Segmentation fault

Thread 0x0000003fee14f150 (most recent call first):
  File "/usr/lib/python3.9/threading.py", line 316 in wait
  File "/usr/lib/python3.9/threading.py", line 574 in wait
  File "/usr/lib/python3/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/usr/lib/python3.9/threading.py", line 930 in _bootstrap

Current thread 0x0000003ff43a3010 (most recent call first):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/geometry/particle_geometry_handler.py", line 217 in _initialize_coarse_index
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/geometry/particle_geometry_handler.py", line 188 in _initialize_index
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/frontends/sph/data_structures.py", line 91 in _initialize_index
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/frontends/gadget/data_structures.py", line 203 in _initialize_index
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/geometry/particle_geometry_handler.py", line 25 in __init__
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/frontends/gadget/data_structures.py", line 194 in __init__
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/data_objects/static_output.py", line 528 in index
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/data_objects/static_output.py", line 573 in field_list
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.9_yt/build/yt/frontends/gadget/tests/test_outputs.py", line 54 in test_gadget_binary
[…]
Segmentation fault

Full build log for RiscV64

This seems to happen in yt/geometry/particle_oct_container.pyx

Version Information

  • Operating System: Debian unstable
  • Python Version: 3.9.9
  • yt version: 4.0.1
@neutrinoceros
Copy link
Member

(sorry for flowing you with exotic bugs)

It's fine, or anyway it's much better than keeping them to yourself :)

@cphyc
Copy link
Member

cphyc commented Nov 18, 2021

@olebole do you have access to a machine with any of the architecture where the problem arises? If so, it would be useful to be able run the test through gdb to get where in particle_oct_container.pyx the segfault happens.
This may be an endianness problem, as we mostly test stuff on little-endian architectures.

@matthewturk
Copy link
Member

@cphyc that's a good point -- and we might be particularly susceptible due to the bit twiddling we do in the coarse index initialization.

@olebole
Copy link
Contributor Author

olebole commented Nov 18, 2021

It is not endianess -- all failing architectures are little endian (as x86), and the only big endian arch we have (s390x) builds fine (all tests passing).
I would bet it is an alignment problem. I have access to a mips64el machine, so I will try to get a proper stack trace from there. It may take a few days, however.

@matthewturk
Copy link
Member

@olebole Not to sound too amateur, but do you think we could conceivably reproduce the error with an emulator or something, even if it's very slow?

@olebole
Copy link
Contributor Author

olebole commented Nov 18, 2021

I am not sure whether an emulator is as picky as the hardware when it comes to alignment errors. You could try it, and if you are lucky, then it reproduce the problem.

@olebole
Copy link
Contributor Author

olebole commented Nov 20, 2021

I could create a stack trace on mipsel64:

Program received signal SIGBUS, Bus error.
0x719e27a8 in __pyx_fuse_0__pyx_f_2yt_8geometry_22particle_oct_container_14ParticleBitmap___coarse_index_data_file (__pyx_v_self=0x70dcf8e8, 
    __pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0) at yt/geometry/particle_oct_container.cpp:12618
12618	    *((__pyx_t_5numpy_uint8_t *) ( /* dim=0 */ (__pyx_v_mask.data + __pyx_t_13 * __pyx_v_mask.strides[0]) )) = 1;
(gdb) bt
#0  0x719e27a8 in __pyx_fuse_0__pyx_f_2yt_8geometry_22particle_oct_container_14ParticleBitmap___coarse_index_data_file (
    __pyx_v_self=0x70dcf8e8, __pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0)
    at yt/geometry/particle_oct_container.cpp:12618
#1  0x719e0d04 in __pyx_pf_2yt_8geometry_22particle_oct_container_14ParticleBitmap_80_coarse_index_data_file (__pyx_v_self=0x70dcf8e8, 
    __pyx_v_pos=0x70be06b0, __pyx_v_hsml=0x91e05c <_Py_NoneStruct>, __pyx_v_file_id=0) at yt/geometry/particle_oct_container.cpp:12078
#2  0x719e09dc in __pyx_fuse_0__pyx_pw_2yt_8geometry_22particle_oct_container_14ParticleBitmap_81_coarse_index_data_file (
    __pyx_v_self=0x70dcf8e8, __pyx_args=0x7076a3e8, __pyx_kwds=0x0) at yt/geometry/particle_oct_container.cpp:12028
#3  0x73010f54 in __Pyx_CyFunction_CallMethod (func=0x71de4c68, self=0x70dcf8e8, arg=0x7076a3e8, kw=0x0)
    at yt/geometry/selection_routines.c:83554
#4  0x7301134c in __Pyx_CyFunction_CallAsMethod (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83617
#5  0x73012308 in __pyx_FusedFunction_callfunction (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83898
#6  0x73012820 in __pyx_FusedFunction_call (func=0x71de4c68, args=0x709b53c0, kw=0x0) at yt/geometry/selection_routines.c:83986
#7  0x00432220 in _PyObject_MakeTpCall ()
#8  0x00421b34 in _PyEval_EvalFrameDefault ()
#9  0x004173d4 in _ftext ()
#10 0x00465aec in PyDict_GetItemWithError ()
Backtrace stopped: frame did not save the PC

These lines look like this:

    /* "yt/geometry/particle_oct_container.pyx":597
 *             mi = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
 *                                           dds, mi_split)
 *             mask[mi] = 1             # <<<<<<<<<<<<<<
 *             particle_counts[mi] += 1
 *             # Expand mask by softening
 */
    __pyx_t_13 = __pyx_v_mi;
    *((__pyx_t_5numpy_uint8_t *) ( /* dim=0 */ (__pyx_v_mask.data + __pyx_t_13 * __pyx_v_mask.strides[0]) )) = 1;

Note that the particle_oct_container.cpp file was generated from the pyx file here It points to line 597 in particle_oct_container.pyx:

mi = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
dds, mi_split)
mask[mi] = 1
.

One observation was that this seems not 100% reproducible; I needed actually two attempts to get it (in the first one, the test passes).

@olebole
Copy link
Contributor Author

olebole commented Nov 20, 2021

To complete this, here is an info locals dump:

__pyx_v_i = 2
__pyx_v_p = 48
__pyx_v_mi = 2635249153387078802
__pyx_v_miex = 7
__pyx_v_mi_split = {0, 9223372034707292159, 0}
__pyx_v_ppos = {1.7913391018714546e-38, nan(0x7ffffffffffff), 2.2360368128959752e-21}
__pyx_v_s_ppos = {1.0609624037829015e-314, 3.5740411942693072e+265, 2.9908604455725968e-307}
__pyx_v_clip_pos_l = {0.40239225998491079, 0.40239225998491079, 0.40239214951068192}
__pyx_v_clip_pos_r = {2.7175623358972077e+234, 2.71756163009643e+234, 2.3547636263540591e+234}
__pyx_v_skip = 0
__pyx_v_bounds = {{9223068829353259392, 23001201026473008, 9712000}, {23586070613733760, 42027869908775296, 20100601573710868}}
__pyx_v_xex = 9712000
__pyx_v_yex = 1
__pyx_v_zex = 1
__pyx_v_LE = {0, 0, 0}
__pyx_v_RE = {1, 1, 1}
__pyx_v_DW = {1, 1, 1}
__pyx_v_PER = "\001\001\001"
__pyx_v_dds = {0.5, 0.5, 0.5}
__pyx_v_radius = 0.59760785102844238
__pyx_v_mask = {memview = 0x70e7e298, data = 0x1d40908 "\001\001\001\001\001\001\001\001", shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {1, 
    0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0, 0, 0, 0}}
__pyx_v_particle_counts = {memview = 0x70e7eb20, data = 0x1c43638 "\020", shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {8, 0, 0, 0, 0, 0, 0, 
    0}, suboffsets = {-1, 0, 0, 0, 0, 0, 0, 0}}
__pyx_v_msize = 8
__pyx_v_axiter = {{0, 999}, {0, 999}, {0, 999}}
__pyx_v_axiterv = {{0, 8.096809699390939e+233}, {0, 2.7175616335247829e+234}, {0, 4.1519961721059194e+235}}
__pyx_v_xi = 1
__pyx_v_yi = 1
__pyx_v_zi = 1
__pyx_pybuffernd_hsml = {rcbuffer = 0x7ffee6c8, data = 0x0, diminfo = {{shape = 0, strides = 0, suboffsets = 2013024304}, {
      shape = 1073741824, strides = 4, suboffsets = 5213996}, {shape = 9743480, strides = 2147483647, suboffsets = 2008105888}, {
      shape = 4969628, strides = 9712000, suboffsets = 0}, {shape = 9743408, strides = 9781416, suboffsets = 9712000}, {shape = 2013189992, 
      strides = 2001061624, suboffsets = 1}, {shape = 9712000, strides = 2008105912, suboffsets = 9712000}, {shape = 5100, 
      strides = 9712000, suboffsets = 0}}}
__pyx_pybuffer_hsml = {refcount = 0, pybuffer = {buf = 0x0, obj = 0x0, len = 0, itemsize = 2013227512, readonly = 1891301036, ndim = 4, 
    format = 0x5 <error: Cannot access memory at address 0x5>, shape = 0x71ad4434 <__Pyx_zeros>, strides = 0x71ad4434 <__Pyx_zeros>, 
    suboffsets = 0x71ad0000 <__Pyx_minusones>, internal = 0x0}}
__pyx_pybuffernd_pos = {rcbuffer = 0x7ffee6f8, data = 0x0, diminfo = {{shape = 100, strides = 12, suboffsets = 1891501744}, {shape = 3, 
      strides = 4, suboffsets = 1902433016}, {shape = 2147412648, strides = 1887721480, suboffsets = 9712000}, {shape = 0, 
      strides = 1888950784, suboffsets = 4514316}, {shape = 9712000, strides = 4513340, suboffsets = 9712000}, {shape = 1891501744, 
      strides = 9712000, suboffsets = 0}, {shape = 33513416, strides = 1887721480, suboffsets = 1907209456}, {shape = 2013227512, 
      strides = 2147412504, suboffsets = 1906789492}}}
__pyx_pybuffer_pos = {refcount = 0, pybuffer = {buf = 0x1e08220, obj = 0x70be06b0, len = 1200, itemsize = 4, readonly = 0, ndim = 2, 
    format = 0x1ff5f40 "f", shape = 0x1ff5fe0, strides = 0x1ff5fe8, suboffsets = 0x71ad0000 <__Pyx_minusones>, internal = 0x0}}
__pyx_t_1 = 0x0
__pyx_t_2 = 0x0
__pyx_t_3 = {memview = 0x0, data = 0x0, shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {1, 0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0, 
    0, 0, 0}}
__pyx_t_4 = {memview = 0x0, data = 0x0, shape = {8, 0, 0, 0, 0, 0, 0, 0}, strides = {8, 0, 0, 0, 0, 0, 0, 0}, suboffsets = {-1, 0, 0, 0, 0, 
    0, 0, 0}}
__pyx_t_5 = 48
__pyx_t_6 = 100
__pyx_t_7 = 100
__pyx_t_8 = 3
__pyx_t_9 = 0
__pyx_t_10 = 48
__pyx_t_11 = 2
__pyx_t_12 = 0
__pyx_t_13 = 2635249153387078802
__pyx_t_14 = 0
__pyx_t_15 = 0
__pyx_t_16 = 0x0
__pyx_t_17 = 2
__pyx_t_18 = 2
__pyx_t_19 = 2
__pyx_t_20 = 2
__pyx_t_21 = 2
__pyx_t_22 = 2
__pyx_t_23 = 2
__pyx_t_24 = 2
__pyx_t_25 = 2
__pyx_t_26 = 2
__pyx_t_27 = 2
__pyx_t_28 = 7
__pyx_lineno = 0
__pyx_filename = 0x0
__pyx_clineno = 0

EDIT I updated this and the previous post with running a build without optimization (-O0) to show all variables and to make sure it is not an optimization problem.

What may be the problem is this: __pyx_v_mi = 2635249153387078802, which may be (???) caused by an NaN in the ppos array, __pyx_v_ppos = {1.7913391018714546e-38, nan(0x7ffffffffffff), 2.2360368128959752e-21}.

@cphyc
Copy link
Member

cphyc commented Nov 20, 2021

Thanks so much for the detailed log. I think I may have tracked the issue down to the fact ppos should be normal. I realize this may be too much to ask, but if you happen to have a chance at trying the following patch, that would be very handy:

--- a/yt/geometry/particle_oct_container.pyx
+++ b/yt/geometry/particle_oct_container.pyx
@@ -587,11 +587,11 @@ cdef class ParticleBitmap:
             for i in range(3):
                 axiter[i][1] = 999
                 # Skip particles outside the domain
-                if pos[p,i] >= RE[i] or pos[p,i] < LE[i]:
+                if not (LE[i] <= pos[p, i] < RE[i]):
                     skip = 1
                     break
                 ppos[i] = pos[p,i]
-            if skip==1: continue
+            if skip == 1: continue
             mi = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
                                           dds, mi_split)
             mask[mi] = 1
@@ -756,11 +756,11 @@ cdef class ParticleBitmap:
             skip = 0
             for i in range(3):
                 axiter[i][1] = 999
-                if pos[p,i] >= RE[i] or pos[p,i] < LE[i]:
+                if not (LE[i] <= pos[p, i] < RE[i]):
                     skip = 1
                     break
                 ppos[i] = pos[p,i]
-            if skip==1: continue
+            if skip == 1: continue
             # Only look if collision at coarse index
             mi1 = bounded_morton_split_dds(ppos[0], ppos[1], ppos[2], LE,
                                            dds1, mi_split1)

Essentially the issue seems to be that bounded_morton_split_dds compute an integer index from spatial coordinates, which should be normal values. This is fenced by checking that the position is between the left edge (LE) and the right edge (RE). The logic of the test however fails with NaN, because NaN > anything... but also, NaN < anything!

[EDIT: easier fix!]

@cphyc
Copy link
Member

cphyc commented Nov 20, 2021

Small update @olebole I have tested the proposed bugfix (#3688) locally and it seems to be passing. I have left it to draft so that it isn't fixed if it doesn't solve this issue.
Would you be in a position to confirm it does solve the issue you raised on your side? Note that the PR contains exactly the patch in my previous comment, so no need to test both.

@neutrinoceros neutrinoceros added build related to the build process bug labels Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug build related to the build process
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants