Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hackcode1 segfaults intermittently #98

Open
teuben opened this issue May 14, 2022 · 15 comments
Open

hackcode1 segfaults intermittently #98

teuben opened this issue May 14, 2022 · 15 comments

Comments

@teuben
Copy link
Owner

teuben commented May 14, 2022

hackcode1 now intermittendly segfaults. was already the case on Ubuntu20, persisting on U22.
Slight correction: it's actually a bus error

@teuben
Copy link
Owner Author

teuben commented May 17, 2022

it seems on those machines where it crashes the default buiild gives a coredumping program, but the re-compile

         mknemo -t -T hackcode1

gives a working code.

@teuben teuben changed the title hackcode1 segfaults intermittenly hackcode1 segfaults intermittendly Jul 31, 2022
@teuben
Copy link
Owner Author

teuben commented Oct 13, 2022

The branch issue98 has a script to simplify triggering the bug. Making some progress there, but this sure is a hard nut to crack.

@jcldc
Copy link
Collaborator

jcldc commented Oct 13, 2022

From latest (NEMO) master version on github, and from macosx platform/clang, hackcode1 segfault, which fails io_nemo_test (make check)

@teuben
Copy link
Owner Author

teuben commented Oct 13, 2022

So far I was not able to crash it on an AMD, but I agree I could crash it on Mac as well. Compiler on Intel bug? I was able to be gdb and see the structure with members point to random 64bit values, where it then segfaults. I need a rainy day.

@jcldc
Copy link
Collaborator

jcldc commented Oct 13, 2022

I need too a rainy day to dig on it.

@teuben teuben changed the title hackcode1 segfaults intermittendly hackcode1 segfaults intermittently Oct 13, 2022
@teuben
Copy link
Owner Author

teuben commented Oct 13, 2022

for the record: the script cash100 in src/nbody/evolve/hackcode/hackcode1 is what I've been using to trigger a crash.

I also just realized zeno's treecode is essentially the same code as hackcode1. Need a snowy day for that.

@jcldc
Copy link
Collaborator

jcldc commented Oct 13, 2022

"snowy day " :D

@teuben
Copy link
Owner Author

teuben commented Oct 14, 2022

on an amusing note, past weekend it was raining a lot, and I installed mac in a virtual (QEMU) box, via the sosumi tool. It also fails in this environment. No surprise, since it also died on native mac.

On the other hand, I also tried the zeno 'treecode', and it never crashed. Also ran another 100 compilations of NEMO on an AMD. It did not crash.

@teuben
Copy link
Owner Author

teuben commented Mar 6, 2023

ran into a case where the bug was also triggered in hackforce, replacing it with hackforce_qp solved it.

Note added: the crash100 script will also make hackcode1_qp to fail eventually.

@teuben
Copy link
Owner Author

teuben commented Mar 14, 2023

using
typedef long atype;
instead of using a short, did not resolve the bug.

@jcldc
Copy link
Collaborator

jcldc commented Mar 14, 2023

Actually, after a new fresh install of NEMO, I got again the segmentation fault core dumped from hackforce (during io_nemo test suite).

I was able to find out the faulty line :
line 125 in src/nbody/evolve/hackcode/hackcode1/grav.c

121 
122 local bool subdivp(nodeptr p,      /* body/cell to be tested */
123                    real dsq)       /* size of cell squared */
124 {
125     if (Type(p) == BODY)                        /* at tip of tree?          */
126         return (FALSE);                         /*   then cant subdivide    */

The debugger says that the pointer on p is not null, but it is not probably pointing on an allocated part of the memory, that's why it crash.

Then I recompiled hackforce by turning off "-O2" option from $NEMOLIB/makedefs, and then no errors (no core dumped) when running hackforce.

Finally I put back "-O2" option in makedefs file, and the error/core dumped vanished !!!! no more core dumped by running hackforce.

That's really really weird.

@teuben
Copy link
Owner Author

teuben commented Mar 14, 2023

was yours a segfault or a bus error? bus error pointed to alignment error, and the body/node has a "short type", which made me suspicious. I made it a long, this didn't fix it. Also tried single precision NEMO, also didnt solve it. It has something to do with casting between body and node, and overlaying those structs (see defs.h).

I've documented some more cases I tested in the crash100 script. Everything is just bizarre. As I said, the mother of all bugs.

and on single precision NEMO the error was a segfault, not bus error as in the default double precision.

@jcldc
Copy link
Collaborator

jcldc commented Mar 14, 2023

In my case it was Bus errror (core dumped) but which vanished once I recompiled the code....

@teuben
Copy link
Owner Author

teuben commented Mar 30, 2023

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

@jcldc
Copy link
Collaborator

jcldc commented Mar 30, 2023

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

weird...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants