BUG: Uninitialized memory in Athena++ selection #3619

neutrinoceros · 2021-10-22T15:51:25Z

Bug report

Bug summary

So there currently seems to be an issue with a single test that I've seen failing repeatedly these days.
Namely, it yt/frontends/athenapp/tests/test_outputs.py::test_disk and the failing assertion is:

yt/yt/frontends/athena_pp/tests/test_outputs.py

Line 31 in 9a6def9

assert_allclose(dd.quantities.total_quantity(("gas", "cell_volume")), vol)

So we can see from the recent history that the call to total_quantity returns random results, including

a random float https://tests.yt-project.org/job/yt_py38_git/4106/
another random float https://tests.yt-project.org/job/yt_py38_git/4107/
a nan https://tests.yt-project.org/job/yt_py38_git/4109/

Note that all the examples above are from running #3489, which is clearly unrelated to the failing code, but I've seen it for other PRs as well, it's hard to find another example within the recent history because most jobs we still have logs for were failed for different reasons or cancelled.

I assume this is a parellelism bug given that total_quantity is TotalQuantity object, derived from DerivedQuantity, and workers synchronisation issues seem like a plausible explanation to getting random results to an otherwise deterministic function

see

yt/yt/data_objects/derived_quantities.py

Line 55 in 69ada92

for sto, ds in parallel_objects(chunks, -1, storage=storage):

I think the breaking test isn't worth blocking PRs and consuming many hours of CI from resubmissions, so I'm going to open a with the problematic line deactivated, but it won't close this issue.

The text was updated successfully, but these errors were encountered:

matthewturk · 2021-10-22T17:16:40Z

This should not ever actually be in parallel, however.

neutrinoceros · 2021-10-22T17:32:14Z

Ah. Then I have no clue what's happening.

matthewturk · 2021-10-22T17:48:50Z

I'm going to close #3620 but I'd ask that you not delete the branch, as we may want to re-open it. I'll follow up here momentarily.

neutrinoceros · 2021-10-22T17:52:55Z

ok, works for me !

matthewturk · 2021-10-22T18:01:07Z

OK, so what I'm concerned about with disabling this test is that we're going to see it crop up somewhere else. What we've seen in the past is that sometimes an error has come in, and then it will show up in one test, and then by disabling that test, another one breaks.

I'm kind of skeptical that this is the problematic test. I'll start out by saying that a full-on valgrind-level audit is probably not workable, but, we can likely run the tests with the Cython optimizations all off, which could turn up errors that are currently being swallowed.

I think what we should do is to first check if the failing test (which I know doesn't fail every time) is sensitive to the ordering of the tests, then run with all Cython optimizations off and look for runtime errors. Or maybe in reverse order.

Xarthisius · 2021-10-22T18:06:27Z

I think what we should do is to first check if the failing test (which I know doesn't fail every time) is sensitive to the ordering of the tests,

AFAICT athena_pp runs in a separate thread every time, i.e. there are no tests that could influence the result except for yt/frontends/athena_pp/tests/test_outputs.py:test_AM06

matthewturk · 2021-10-22T18:09:11Z

Any chance we could switch up the order in that?

…

On Fri, Oct 22, 2021, 1:06 PM Kacper Kowalik ***@***.***> wrote: I think what we should do is to first check if the failing test (which I know doesn't fail every time) is sensitive to the ordering of the tests, AFAICT athena_pp runs in a separate thread every time, i.e. there are no tests that could influence the result except for yt/frontends/athena_pp/tests/test_outputs.py — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3619 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAVXO4O5VD64DO26T2ETN3UIGR23ANCNFSM5GQ3QXPQ> .

Xarthisius · 2021-10-22T18:23:03Z

Any chance we could switch up the order in that?

I'm 99% sure it's already the first test in that thread. They should be processed in order they're written in tests.yaml.

matthewturk · 2021-10-22T21:38:19Z

With @Xarthisius 's help, this repro's the problem:

import yt
yt.set_log_level(50)
import numpy as np
from yt.testing import (
    assert_allclose,
    assert_equal,
)
from yt.utilities.answer_testing.framework import (
    data_dir_load,
)

disk = "KeplerianDisk/disk.out1.00000.athdf"

ds = yt.load_sample("KeplerianDisk")
assert_equal(str(ds), "disk.out1.00000")
dd = ds.all_data()
#vol = (ds.domain_right_edge[0] ** 3 - ds.domain_left_edge[0] ** 3) / 3.0
#vol *= np.cos(ds.domain_left_edge[1]) - np.cos(ds.domain_right_edge[1])
#vol *= ds.domain_right_edge[2].v - ds.domain_left_edge[2].v
#assert_allclose(dd.quantities.total_quantity(("gas", "cell_volume")), vol)
a = dd.quantities.total_quantity(("gas", "cell_volume"))
print(a, a.v == 43.23719530295646)

I'll note that this always passes on mine, but, running via:

PYTHONMALLOC=malloc valgrind --tool=memcheck python test_athenapp.py

and ignoring everything before "Constructing meshes" results in the first error being some uninitialized memory used in pow. My guess is that this is related to np.empty being used somewhere, and then a count going awry, and this propagating. I'm going to dig into this further and I'll assign to myself.

matthewturk · 2021-10-22T21:50:51Z

Replacing np.empty in geometry_handler.py with np.zeros for the various fwidth etc functions avoids the errors here, but that raises the question of why the counts are getting mismatched. I'll insert a guard check on the total count of the values and see where the mismatch arises, and why it's happening with Athena++.

neutrinoceros · 2021-10-24T10:50:06Z

@matthewturk @Xarthisius don't hesitate to edit this issue's title or labels as you make progress. If we can rule out parallelism issues then it should be triaged again.

jzuhone · 2023-07-05T19:15:08Z

This has likely been resolved by #4562.

neutrinoceros added bug tests: running tests Issues with the test setup parallelism MPI-based parallelism labels Oct 22, 2021

neutrinoceros mentioned this issue Oct 22, 2021

TST: deactivate a randomly failing test #3620

Closed

neutrinoceros mentioned this issue Oct 22, 2021

ENH: Add accumulator objects for integrating along paths #2490

Closed

3 tasks

matthewturk self-assigned this Oct 22, 2021

matthewturk changed the title ~~BUG: randomly failing test on Jenkins (parallelism bug?)~~ BUG: Uninitialized memory in Athena++ selection Oct 24, 2021

neutrinoceros mentioned this issue Nov 29, 2021

Backport #3410 + #3687 to yt-4.0.x (image filename validation fixes) #3698

Merged

neutrinoceros mentioned this issue Feb 9, 2022

Fix electromagnetic field unit conversions #3787

Closed

2 tasks

jzuhone mentioned this issue Jul 4, 2023

Athena++ stretched grids support #4562

Merged

2 tasks

jzuhone closed this as completed Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Uninitialized memory in Athena++ selection #3619

BUG: Uninitialized memory in Athena++ selection #3619

neutrinoceros commented Oct 22, 2021

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 22, 2021

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 22, 2021

matthewturk commented Oct 22, 2021

Xarthisius commented Oct 22, 2021 •

edited

matthewturk commented Oct 22, 2021 via email

Xarthisius commented Oct 22, 2021

matthewturk commented Oct 22, 2021

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 24, 2021

jzuhone commented Jul 5, 2023

BUG: Uninitialized memory in Athena++ selection #3619

BUG: Uninitialized memory in Athena++ selection #3619

Comments

neutrinoceros commented Oct 22, 2021

Bug report

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 22, 2021

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 22, 2021

matthewturk commented Oct 22, 2021

Xarthisius commented Oct 22, 2021 • edited

matthewturk commented Oct 22, 2021 via email

Xarthisius commented Oct 22, 2021

matthewturk commented Oct 22, 2021

matthewturk commented Oct 22, 2021

neutrinoceros commented Oct 24, 2021

jzuhone commented Jul 5, 2023

Xarthisius commented Oct 22, 2021 •

edited