Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Uninitialized memory in Athena++ selection #3619

Closed
neutrinoceros opened this issue Oct 22, 2021 · 12 comments
Closed

BUG: Uninitialized memory in Athena++ selection #3619

neutrinoceros opened this issue Oct 22, 2021 · 12 comments
Assignees
Labels
bug parallelism MPI-based parallelism tests: running tests Issues with the test setup

Comments

@neutrinoceros
Copy link
Member

Bug report

Bug summary

So there currently seems to be an issue with a single test that I've seen failing repeatedly these days.
Namely, it yt/frontends/athenapp/tests/test_outputs.py::test_disk and the failing assertion is:

assert_allclose(dd.quantities.total_quantity(("gas", "cell_volume")), vol)

So we can see from the recent history that the call to total_quantity returns random results, including

Note that all the examples above are from running #3489, which is clearly unrelated to the failing code, but I've seen it for other PRs as well, it's hard to find another example within the recent history because most jobs we still have logs for were failed for different reasons or cancelled.

I assume this is a parellelism bug given that total_quantity is TotalQuantity object, derived from DerivedQuantity, and workers synchronisation issues seem like a plausible explanation to getting random results to an otherwise deterministic function

see

for sto, ds in parallel_objects(chunks, -1, storage=storage):

I think the breaking test isn't worth blocking PRs and consuming many hours of CI from resubmissions, so I'm going to open a with the problematic line deactivated, but it won't close this issue.

@matthewturk
Copy link
Member

This should not ever actually be in parallel, however.

@neutrinoceros
Copy link
Member Author

Ah. Then I have no clue what's happening.

@matthewturk
Copy link
Member

I'm going to close #3620 but I'd ask that you not delete the branch, as we may want to re-open it. I'll follow up here momentarily.

@neutrinoceros
Copy link
Member Author

ok, works for me !

@matthewturk
Copy link
Member

OK, so what I'm concerned about with disabling this test is that we're going to see it crop up somewhere else. What we've seen in the past is that sometimes an error has come in, and then it will show up in one test, and then by disabling that test, another one breaks.

I'm kind of skeptical that this is the problematic test. I'll start out by saying that a full-on valgrind-level audit is probably not workable, but, we can likely run the tests with the Cython optimizations all off, which could turn up errors that are currently being swallowed.

I think what we should do is to first check if the failing test (which I know doesn't fail every time) is sensitive to the ordering of the tests, then run with all Cython optimizations off and look for runtime errors. Or maybe in reverse order.

@Xarthisius
Copy link
Member

Xarthisius commented Oct 22, 2021

I think what we should do is to first check if the failing test (which I know doesn't fail every time) is sensitive to the ordering of the tests,

AFAICT athena_pp runs in a separate thread every time, i.e. there are no tests that could influence the result except for yt/frontends/athena_pp/tests/test_outputs.py:test_AM06

@matthewturk
Copy link
Member

matthewturk commented Oct 22, 2021 via email

@Xarthisius
Copy link
Member

Any chance we could switch up the order in that?

I'm 99% sure it's already the first test in that thread. They should be processed in order they're written in tests.yaml.

@matthewturk
Copy link
Member

With @Xarthisius 's help, this repro's the problem:

import yt
yt.set_log_level(50)
import numpy as np
from yt.testing import (
    assert_allclose,
    assert_equal,
)
from yt.utilities.answer_testing.framework import (
    data_dir_load,
)

disk = "KeplerianDisk/disk.out1.00000.athdf"

ds = yt.load_sample("KeplerianDisk")
assert_equal(str(ds), "disk.out1.00000")
dd = ds.all_data()
#vol = (ds.domain_right_edge[0] ** 3 - ds.domain_left_edge[0] ** 3) / 3.0
#vol *= np.cos(ds.domain_left_edge[1]) - np.cos(ds.domain_right_edge[1])
#vol *= ds.domain_right_edge[2].v - ds.domain_left_edge[2].v
#assert_allclose(dd.quantities.total_quantity(("gas", "cell_volume")), vol)
a = dd.quantities.total_quantity(("gas", "cell_volume"))
print(a, a.v == 43.23719530295646)

I'll note that this always passes on mine, but, running via:

PYTHONMALLOC=malloc valgrind --tool=memcheck python test_athenapp.py

and ignoring everything before "Constructing meshes" results in the first error being some uninitialized memory used in pow. My guess is that this is related to np.empty being used somewhere, and then a count going awry, and this propagating. I'm going to dig into this further and I'll assign to myself.

@matthewturk matthewturk self-assigned this Oct 22, 2021
@matthewturk
Copy link
Member

Replacing np.empty in geometry_handler.py with np.zeros for the various fwidth etc functions avoids the errors here, but that raises the question of why the counts are getting mismatched. I'll insert a guard check on the total count of the values and see where the mismatch arises, and why it's happening with Athena++.

@neutrinoceros
Copy link
Member Author

@matthewturk @Xarthisius don't hesitate to edit this issue's title or labels as you make progress. If we can rule out parallelism issues then it should be triaged again.

@matthewturk matthewturk changed the title BUG: randomly failing test on Jenkins (parallelism bug?) BUG: Uninitialized memory in Athena++ selection Oct 24, 2021
@jzuhone
Copy link
Contributor

jzuhone commented Jul 5, 2023

This has likely been resolved by #4562.

@jzuhone jzuhone closed this as completed Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parallelism MPI-based parallelism tests: running tests Issues with the test setup
Projects
None yet
Development

No branches or pull requests

4 participants