Fix intermittent EWA test failures #482

djhoese · 2022-11-27T02:50:22Z

Hopefully ends up resolving #481 or at least parts of it.

djhoese · 2022-12-08T19:39:02Z

Wow I didn't even notice that I finally got it to fail. Here's the bottom of the test log:

 FAILED pyresample/test/test_dask_ewa.py::TestDaskEWAResampler::test_xarray_basic_ewa[10-True-float32-input_shape0-input_dims0-LegacyDaskEWAResampler-pyresample.ewa._legacy_dask_ewa] - AssertionError: assert 100 == (51 + 50)
 +  where 100 = <MagicMock name='ll2cr' id='1459486493952'>.call_count
==== 1 failed, 753 passed, 22 skipped, 1004 warnings in 257.70s (0:04:17) =====

And a slightly larger version of that traceback which is further up in the log:

     
            # ll2cr will be called once more because of the computation
>           assert ll2cr.call_count == ll2cr_calls + num_chunks
E           AssertionError: assert 100 == (51 + 50)
E            +  where 100 = <MagicMock name='ll2cr' id='1459486493952'>.call_count

djhoese · 2022-12-10T21:03:04Z

I added some print statements and here's what I get:

 ll2cr call count before resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=1
ll2cr call count after first compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count before second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count after second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count after second compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=100

One thing to note is this is the other shape variant from the failure I posted above (the 2D array case). I guess that makes sense the failure is in the ll2cr portion which is based on the 2D lon/lat arrays, not the 2D or 3D data.

The other thing to note here is that the number of calls from the second time the compute call is done there should be 50 more calls but there are only 49. This is very very odd. This makes me think either dask is deciding that two of the calls are the same (equivalent inputs so just reuse the outputs). For example ll2cr on chunks (0, 0) and chunks (2, 2) are hashing to the same inputs...wait dask shouldn't care per-chunk as it organizes/hashes things for the entire array and then assigns chunk indexes (ex. (my-example-func-ABCDEF12345, 0, 0)).

My other guess is some weird thing like a race condition if mock isn't thread-safe, especially on Windows apparently, and we're missing an increment of the mock object because two executions are trying to increment at the same time.

djhoese · 2022-12-11T20:18:04Z

I think I'm at 5 or so restarts so far and it hasn't failed yet. I'll try for 10 or 15 total.

djhoese · 2022-12-12T16:25:13Z

Great...still failing:

 FAILED pyresample/test/test_dask_ewa.py::TestDaskEWAResampler::test_xarray_basic_ewa[100-True-float64-input_shape1-input_dims1-LegacyDaskEWAResampler-pyresample.ewa._legacy_dask_ewa] - AssertionError: assert 99 == (50 + 50)
 +  where 99 = <MagicMock name='ll2cr' id='2646658360336'>.call_count

Debug prints:

 ll2cr call count before resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after first compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count before second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count after second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count after second compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=99

So for some reason, again mostly on Windows, one of the chunks is not being called.

djhoese · 2022-12-16T17:10:02Z

It took almost 50 restarts to get it to fail again:

 >           assert ll2cr.call_count == ll2cr_calls + num_chunks
E           AssertionError: assert 99 == (50 + 50)
E            +  where 99 = <MagicMock name='ll2cr' id='2367904042528'>.call_count

I added some print statements so lets see if I can decipher some more information from them.

djhoese · 2022-12-16T17:35:06Z

Ok so in my most recent commit I had the ll2cr map_blocks function print out chunk information. I think took that output from the failed test, put it in python, matched a regular expression against the pattern chunk_location': (\d, \d, \d), and got 50 results during the final (second) compute of the test. That tells me the function is being called the proper number of times but something about the multi-threaded nature is making mock unhappy.

djhoese · 2022-12-16T17:44:09Z

Ok so mocks are not technically thread safe. The way we use them is generally OK for basic "replace this functionality with this functionality", but in these EWA-specific tests I'm checking call count and that is not thread-safe. I'm changed to a synchronous dask scheduler for these tests which should fix this. I'll rerun the tests 10 or so times and try to break it then remove any remaining changes/prints in this PR and merge it.

codecov · 2022-12-17T03:39:22Z

Codecov Report

Merging #482 (0745ddd) into main (53ae8a9) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
- Coverage   94.32%   94.32%   -0.01%     
==========================================
  Files          74       74              
  Lines       12890    12889       -1     
==========================================
- Hits        12159    12158       -1     
  Misses        731      731

Flag	Coverage Δ
unittests	`94.32% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pyresample/ewa/_legacy_dask_ewa.py	`84.69% <100.00%> (ø)`
pyresample/ewa/dask_ewa.py	`90.08% <100.00%> (-0.09%)`	⬇️
pyresample/test/test_dask_ewa.py	`98.95% <100.00%> (+<0.01%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

coveralls · 2022-12-17T03:54:43Z

Coverage decreased (-0.0005%) to 93.85% when pulling 0745ddd on djhoese:debug-ewa-ci-failure into 53ae8a9 on pytroll:main.

Start setting up CI for debugging EWA random failures

42b11ec

Add a ton of debug print statements to dask EWA tests

49f52c9

Add explicit dask "meta" information to EWA to avoid additional calls

f28246e

Add block_info debug information to legacy ll2cr map_blocks call

666d3f3

Use synchronous dask scheduler for EWA tests using thread unsafe mocks

2204fe5

Revert shortcuts in CI config

0745ddd

djhoese changed the title ~~Debug random EWA failures~~ Fix intermittent EWA test failures Dec 17, 2022

djhoese added the bug label Dec 17, 2022

djhoese self-assigned this Dec 17, 2022

djhoese merged commit 6a55f31 into pytroll:main Dec 17, 2022

djhoese deleted the debug-ewa-ci-failure branch December 17, 2022 15:41

djhoese mentioned this pull request Dec 17, 2022

Intermittent failures on 32bit architectures #481

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent EWA test failures #482

Fix intermittent EWA test failures #482

djhoese commented Nov 27, 2022

djhoese commented Dec 8, 2022

djhoese commented Dec 10, 2022

djhoese commented Dec 11, 2022

djhoese commented Dec 12, 2022

djhoese commented Dec 16, 2022

djhoese commented Dec 16, 2022

djhoese commented Dec 16, 2022

codecov bot commented Dec 17, 2022 •

edited

Loading

coveralls commented Dec 17, 2022

Fix intermittent EWA test failures #482

Fix intermittent EWA test failures #482

Conversation

djhoese commented Nov 27, 2022

djhoese commented Dec 8, 2022

djhoese commented Dec 10, 2022

djhoese commented Dec 11, 2022

djhoese commented Dec 12, 2022

djhoese commented Dec 16, 2022

djhoese commented Dec 16, 2022

djhoese commented Dec 16, 2022

codecov bot commented Dec 17, 2022 • edited Loading

Codecov Report

coveralls commented Dec 17, 2022

codecov bot commented Dec 17, 2022 •

edited

Loading