Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Add color.h test #524

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

jodavies
Copy link
Collaborator

@jodavies jodavies commented May 21, 2024

This test seems trickier than expected. Currently:

mpirun -np {1,2} parform: OK
mpirun -np {3,4} parform: hang?
mpirun -np {5,6} parform: crash

valgrind vorm: OK
valgrind tvorm -w2: mostly hangs takes 100-200s, sometimes finishes in 2s
valgrind tvorm -w4: OK, finishes in 2s

The CI sees valgrind errors in vorm, tvorm that I can't reproduce locally. Edit: I can reproduce them on ubuntu 20.04 (as the runners are using) but not in 22.04.

@jodavies
Copy link
Collaborator Author

I added some print statements to PF_UnpackRedefinedPreVars to try to work out what happens. There are many successful redefines, and then we have:

0 i = 0
0 trying to redefine ik1 (35) to 1
0    loop j = 0; j < 2
0       AC.pfirstnum[0] (ik1c) (37), index 35
0       AC.pfirstnum[1] (adj) (38), index 35

so it fails to find the variable it is trying to redefine in pfirstnum. So then it evaluates

if ( AC.inputnumbers[j] < inputnumber ) {

for j=2, causing the Conditional jump or move depends on uninitialised value(s).

So, why is ik1 no longer in the AC.pfirstnum array? Earlier in the program it was there (and had index 35).

@tueda
Copy link
Collaborator

tueda commented May 22, 2024

Thanks for the investigation. I will look into the ParFORM issue. (You know, in programming, the person you were a month ago is a stranger. Then, the person you were more than 10 years ago is...)

By the way, maybe this (the code that gives Valgrind error) should be broken up into small unit tests.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 12, 2024

It seems the problem with valgrind and tvorm -w2 is due to the load balancing. The same issue happens with -w3. If I run this test under callgrind, we see that ThreadsProcessor makes 100s of millions of calls of LoadReadjusted (which is stealing terms from the working thread and distributing them around the idle threads) which also involve locks.

With w4, there are only ~400 calls of LoadReadjusted.

Edit: it is some kind of race condition though it seems: if I add a MesPrint in LoadReadjusted it prints only ~400 times, even when running under valgrind.

The easiest solution is to just disable valgrind for this test...

Don't run this test with tform and less than 4 threads, it goes extremely slowly.
See discussion in vermaseren#524 .
@jodavies
Copy link
Collaborator Author

Once #525 is merged we can rebase this on top and the parform tests will run successfully also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants