Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent EWA test failures #482

Merged
merged 6 commits into from
Dec 17, 2022

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Nov 27, 2022

Hopefully ends up resolving #481 or at least parts of it.

@djhoese
Copy link
Member Author

djhoese commented Dec 8, 2022

Wow I didn't even notice that I finally got it to fail. Here's the bottom of the test log:

 FAILED pyresample/test/test_dask_ewa.py::TestDaskEWAResampler::test_xarray_basic_ewa[10-True-float32-input_shape0-input_dims0-LegacyDaskEWAResampler-pyresample.ewa._legacy_dask_ewa] - AssertionError: assert 100 == (51 + 50)
 +  where 100 = <MagicMock name='ll2cr' id='1459486493952'>.call_count
==== 1 failed, 753 passed, 22 skipped, 1004 warnings in 257.70s (0:04:17) =====

And a slightly larger version of that traceback which is further up in the log:

     
            # ll2cr will be called once more because of the computation
>           assert ll2cr.call_count == ll2cr_calls + num_chunks
E           AssertionError: assert 100 == (51 + 50)
E            +  where 100 = <MagicMock name='ll2cr' id='1459486493952'>.call_count

@djhoese
Copy link
Member Author

djhoese commented Dec 10, 2022

I added some print statements and here's what I get:

 ll2cr call count before resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=1
ll2cr call count after first compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count before second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count after second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=51
ll2cr call count after second compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=100

One thing to note is this is the other shape variant from the failure I posted above (the 2D array case). I guess that makes sense the failure is in the ll2cr portion which is based on the 2D lon/lat arrays, not the 2D or 3D data.

The other thing to note here is that the number of calls from the second time the compute call is done there should be 50 more calls but there are only 49. This is very very odd. This makes me think either dask is deciding that two of the calls are the same (equivalent inputs so just reuse the outputs). For example ll2cr on chunks (0, 0) and chunks (2, 2) are hashing to the same inputs...wait dask shouldn't care per-chunk as it organizes/hashes things for the entire array and then assigns chunk indexes (ex. (my-example-func-ABCDEF12345, 0, 0)).

My other guess is some weird thing like a race condition if mock isn't thread-safe, especially on Windows apparently, and we're missing an increment of the mock object because two executions are trying to increment at the same time.

@djhoese
Copy link
Member Author

djhoese commented Dec 11, 2022

I think I'm at 5 or so restarts so far and it hasn't failed yet. I'll try for 10 or 15 total.

@djhoese
Copy link
Member Author

djhoese commented Dec 12, 2022

Great...still failing:

 FAILED pyresample/test/test_dask_ewa.py::TestDaskEWAResampler::test_xarray_basic_ewa[100-True-float64-input_shape1-input_dims1-LegacyDaskEWAResampler-pyresample.ewa._legacy_dask_ewa] - AssertionError: assert 99 == (50 + 50)
 +  where 99 = <MagicMock name='ll2cr' id='2646658360336'>.call_count

Debug prints:

 ll2cr call count before resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=0
ll2cr call count after first compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count before second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count after second resample call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=50
ll2cr call count after second compute call (<class 'pyresample.ewa._legacy_dask_ewa.LegacyDaskEWAResampler'> | input_shape=(3, 100, 50)): ll2cr.call_count=99

So for some reason, again mostly on Windows, one of the chunks is not being called.

@djhoese
Copy link
Member Author

djhoese commented Dec 16, 2022

It took almost 50 restarts to get it to fail again:

 >           assert ll2cr.call_count == ll2cr_calls + num_chunks
E           AssertionError: assert 99 == (50 + 50)
E            +  where 99 = <MagicMock name='ll2cr' id='2367904042528'>.call_count

I added some print statements so lets see if I can decipher some more information from them.

@djhoese
Copy link
Member Author

djhoese commented Dec 16, 2022

Ok so in my most recent commit I had the ll2cr map_blocks function print out chunk information. I think took that output from the failed test, put it in python, matched a regular expression against the pattern chunk_location': (\d, \d, \d), and got 50 results during the final (second) compute of the test. That tells me the function is being called the proper number of times but something about the multi-threaded nature is making mock unhappy.

@djhoese
Copy link
Member Author

djhoese commented Dec 16, 2022

Ok so mocks are not technically thread safe. The way we use them is generally OK for basic "replace this functionality with this functionality", but in these EWA-specific tests I'm checking call count and that is not thread-safe. I'm changed to a synchronous dask scheduler for these tests which should fix this. I'll rerun the tests 10 or so times and try to break it then remove any remaining changes/prints in this PR and merge it.

@codecov
Copy link

codecov bot commented Dec 17, 2022

Codecov Report

Merging #482 (0745ddd) into main (53ae8a9) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
- Coverage   94.32%   94.32%   -0.01%     
==========================================
  Files          74       74              
  Lines       12890    12889       -1     
==========================================
- Hits        12159    12158       -1     
  Misses        731      731              
Flag Coverage Δ
unittests 94.32% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pyresample/ewa/_legacy_dask_ewa.py 84.69% <100.00%> (ø)
pyresample/ewa/dask_ewa.py 90.08% <100.00%> (-0.09%) ⬇️
pyresample/test/test_dask_ewa.py 98.95% <100.00%> (+<0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0005%) to 93.85% when pulling 0745ddd on djhoese:debug-ewa-ci-failure into 53ae8a9 on pytroll:main.

@djhoese djhoese changed the title Debug random EWA failures Fix intermittent EWA test failures Dec 17, 2022
@djhoese djhoese added the bug label Dec 17, 2022
@djhoese djhoese self-assigned this Dec 17, 2022
@djhoese djhoese merged commit 6a55f31 into pytroll:main Dec 17, 2022
@djhoese djhoese deleted the debug-ewa-ci-failure branch December 17, 2022 15:41
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants