Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on non x86 platforms #237

Closed
avalentino opened this issue Sep 7, 2019 · 10 comments · Fixed by #307
Closed

Segmentation fault on non x86 platforms #237

avalentino opened this issue Sep 7, 2019 · 10 comments · Fixed by #307

Comments

@avalentino
Copy link
Contributor

We are working to prepare the debian package for pysph v1.0-a6 (python3 only) we found that there are Segmentation fault on several non-x86 platforms:

pysph/base/tests/test_nnps.py .......ssssssssss......................... [ 25%]
...Segmentation fault

See also [1] and [2].

[1] https://buildd.debian.org/status/fetch.php?pkg=pysph&arch=arm64&ver=1.0%7Ea6-3&stamp=1567840824&raw=0
[2] https://buildd.debian.org/status/package.php?p=pysph

@avalentino
Copy link
Contributor Author

This issue still persists after the update of PySPH to the latest git revision (e3d5e10)

@AdrianBunk
Copy link

The problem seems to be specific to Python 3.8, the latest Debian package builds for me on arm64 when building only for Python 3.7.

@prabhuramachandran
Copy link
Contributor

@avalentino, @AdrianBunk -- I do not have access to one of these platforms. Is this issue resolved or are there any PySPH issues that need to be addressed? Could you please run the following:

pytest pysph/base/tests/test_nnps.py -v

So we have some idea of which test exactly is failing? Once we find that, we will need to see where this is failing.

@AdrianBunk
Copy link

With commit f599a52 sources:

pysph/base/tests/test_nnps.py::MultipleLevelsStratifiedHashNNPSTestCase::test_neighbors_aa Segmentation fault (core dumped)

Backtrace:

#0  0x0000ffffaba5c2b8 in __pyx_f_5pysph_4base_20stratified_hash_nnps_18StratifiedHashNNPS__h_mask_exact (__pyx_v_H=2147483647, __pyx_v_z=0x0, __pyx_v_y=0x0, 
    __pyx_v_x=0x0, __pyx_v_self=0xffffab8db040)
    at pysph/base/stratified_hash_nnps.cpp:5465
#1  __pyx_f_5pysph_4base_20stratified_hash_nnps_18StratifiedHashNNPS__neighbor_boxes (__pyx_v_H=2147483647, __pyx_v_z=0x0, __pyx_v_y=0x0, __pyx_v_x=0x0, 
    __pyx_v_k=<optimized out>, __pyx_v_j=<optimized out>, 
    __pyx_v_i=<optimized out>, __pyx_v_self=0xffffab8db040)
    at pysph/base/stratified_hash_nnps.cpp:5592
#2  __pyx_f_5pysph_4base_20stratified_hash_nnps_18StratifiedHashNNPS_find_nearest_neighbors (__pyx_v_self=0xffffab8db040, __pyx_v_d_idx=<optimized out>, 
    __pyx_v_nbrs=0xffffa9cb0160) at pysph/base/stratified_hash_nnps.cpp:5123
#3  0x0000ffffabdabdc4 in __pyx_f_5pysph_4base_9nnps_base_8NNPSBase_get_nearest_particles_no_cache (__pyx_v_self=0xffffab8db040, __pyx_v_src_index=0, 
    __pyx_v_dst_index=<optimized out>, __pyx_v_d_idx=0, 
    __pyx_v_nbrs=0xffffa9cb0160, __pyx_v_prealloc=0, 
    __pyx_skip_dispatch=<optimized out>) at pysph/base/nnps_base.cpp:23914
#4  0x0000ffffabdbc0f8 in __pyx_f_5pysph_4base_9nnps_base_8NNPSBase_get_nearest_particles (__pyx_v_self=__pyx_v_self@entry=0xffffab8db040, 
    __pyx_v_src_index=__pyx_v_src_index@entry=0, 
    __pyx_v_dst_index=__pyx_v_dst_index@entry=0, 
    __pyx_v_d_idx=__pyx_v_d_idx@entry=0, __pyx_v_nbrs=0xffffa9cb0160, 
    __pyx_skip_dispatch=__pyx_skip_dispatch@entry=1)
    at pysph/base/nnps_base.cpp:24377
#5  0x0000ffffabdca754 in __pyx_pf_5pysph_4base_9nnps_base_8NNPSBase_6get_nearest_particles (__pyx_v_nbrs=<optimized out>, __pyx_v_d_idx=0, 
    __pyx_v_dst_index=0, __pyx_v_src_index=0, __pyx_v_self=0xffffab8db040)
    at pysph/base/nnps_base.cpp:24505
#6  __pyx_pw_5pysph_4base_9nnps_base_8NNPSBase_7get_nearest_particles (
    __pyx_v_self=0xffffab8db040, __pyx_args=<optimized out>, 
    __pyx_kwds=<optimized out>) at pysph/base/nnps_base.cpp:24488

@mwhudson
Copy link

It seems what is happening is that self._get_h_max(self.current_cells, i) in the line

            H = <int> ceil(h_max*self.H/self._get_h_max(self.current_cells, i))

from stratified_hash_nnps.pyx is 0.0, so this ends up casting a NaN to int, which is undefined behaviour. On arm64 (the architecture I was trying), it ends up being 2147483647 aka MAX_INT and then mask_len ends up being -1 and so enormous when cast to size_t and everything fails to work at all. I guess on amd64 casting a NaN to an int does something more benign?

I don't understand the code nearly well enough to see where this is going wrong. I certainly don't see any particular reason why assuming "self._get_h_max(self.current_cells, i)" is non-zero is justified...

@avalentino
Copy link
Contributor Author

Hi @prabhuramachandran it seems that @mwhudson has spotted the issue.
I was able to reproduce it on my laptop using the s390x docker image and quemu-static.
Can you please provide some hint on how to solve the problem in the case in which self._get_h_max(self.current_cells, i) is zero?

@avalentino
Copy link
Contributor Author

A feedback on this issue would be really appreciated.
There are really few days remaining before the Debian freeze, after that it will not be possible to have pysph in the the official debian stable distribution.
Having a fix would be the ideal but also a workaround could be enough.
According to [1] how do you classify this bug: critical, serious, minor ,...

[1] https://www.debian.org/Bugs/Developer

@prabhuramachandran
Copy link
Contributor

prabhuramachandran commented Feb 5, 2021

@avalentino -- sorry for the slow response on this. @adityapb -- could you please take a quick look at this? This is clearly a bug.

@prabhuramachandran
Copy link
Contributor

@avalentino -- the PR #307 should fix the issue, is it enough if this is merged into PySPH? I can merge it sooner if it will help.

@avalentino
Copy link
Contributor Author

Thanks @prabhuramachandran thanks @adityapb I can confirm that the fix works on s390x platforms.
Will try to push it in Debian ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants