-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hackcode1 segfaults intermittently #98
Comments
it seems on those machines where it crashes the default buiild gives a coredumping program, but the re-compile
gives a working code. |
The branch issue98 has a script to simplify triggering the bug. Making some progress there, but this sure is a hard nut to crack. |
From latest (NEMO) master version on github, and from macosx platform/clang, hackcode1 segfault, which fails io_nemo_test (make check) |
So far I was not able to crash it on an AMD, but I agree I could crash it on Mac as well. Compiler on Intel bug? I was able to be gdb and see the structure with members point to random 64bit values, where it then segfaults. I need a rainy day. |
I need too a rainy day to dig on it. |
for the record: the script cash100 in src/nbody/evolve/hackcode/hackcode1 is what I've been using to trigger a crash. I also just realized zeno's treecode is essentially the same code as hackcode1. Need a snowy day for that. |
"snowy day " :D |
on an amusing note, past weekend it was raining a lot, and I installed mac in a virtual (QEMU) box, via the sosumi tool. It also fails in this environment. No surprise, since it also died on native mac. On the other hand, I also tried the zeno 'treecode', and it never crashed. Also ran another 100 compilations of NEMO on an AMD. It did not crash. |
ran into a case where the bug was also triggered in hackforce, replacing it with hackforce_qp solved it. Note added: the crash100 script will also make hackcode1_qp to fail eventually. |
using |
Actually, after a new fresh install of NEMO, I got again the segmentation fault core dumped from hackforce (during io_nemo test suite). I was able to find out the faulty line : 121
122 local bool subdivp(nodeptr p, /* body/cell to be tested */
123 real dsq) /* size of cell squared */
124 {
125 if (Type(p) == BODY) /* at tip of tree? */
126 return (FALSE); /* then cant subdivide */ The debugger says that the pointer on p is not null, but it is not probably pointing on an allocated part of the memory, that's why it crash. Then I recompiled hackforce by turning off "-O2" option from $NEMOLIB/makedefs, and then no errors (no core dumped) when running hackforce. Finally I put back "-O2" option in makedefs file, and the error/core dumped vanished !!!! no more core dumped by running hackforce. That's really really weird. |
was yours a segfault or a bus error? bus error pointed to alignment error, and the body/node has a "short type", which made me suspicious. I made it a long, this didn't fix it. Also tried single precision NEMO, also didnt solve it. It has something to do with casting between body and node, and overlaying those structs (see defs.h). I've documented some more cases I tested in the crash100 script. Everything is just bizarre. As I said, the mother of all bugs. and on single precision NEMO the error was a segfault, not bus error as in the default double precision. |
In my case it was |
Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough. Thus, consider this bug fixed, a pull request will follow. |
weird... |
hackcode1 now intermittendly segfaults. was already the case on Ubuntu20, persisting on U22.
Slight correction: it's actually a bus error
The text was updated successfully, but these errors were encountered: