-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent EWA test failures #482
Conversation
Wow I didn't even notice that I finally got it to fail. Here's the bottom of the test log:
And a slightly larger version of that traceback which is further up in the log:
|
I added some print statements and here's what I get:
One thing to note is this is the other shape variant from the failure I posted above (the 2D array case). I guess that makes sense the failure is in the ll2cr portion which is based on the 2D lon/lat arrays, not the 2D or 3D data. The other thing to note here is that the number of calls from the second time the compute call is done there should be 50 more calls but there are only 49. This is very very odd. This makes me think either dask is deciding that two of the calls are the same (equivalent inputs so just reuse the outputs). For example ll2cr on chunks (0, 0) and chunks (2, 2) are hashing to the same inputs...wait dask shouldn't care per-chunk as it organizes/hashes things for the entire array and then assigns chunk indexes (ex. My other guess is some weird thing like a race condition if mock isn't thread-safe, especially on Windows apparently, and we're missing an increment of the mock object because two executions are trying to increment at the same time. |
I think I'm at 5 or so restarts so far and it hasn't failed yet. I'll try for 10 or 15 total. |
Great...still failing:
Debug prints:
So for some reason, again mostly on Windows, one of the chunks is not being called. |
It took almost 50 restarts to get it to fail again:
I added some print statements so lets see if I can decipher some more information from them. |
Ok so in my most recent commit I had the ll2cr map_blocks function print out chunk information. I think took that output from the failed test, put it in python, matched a regular expression against the pattern |
Ok so mocks are not technically thread safe. The way we use them is generally OK for basic "replace this functionality with this functionality", but in these EWA-specific tests I'm checking call count and that is not thread-safe. I'm changed to a synchronous dask scheduler for these tests which should fix this. I'll rerun the tests 10 or so times and try to break it then remove any remaining changes/prints in this PR and merge it. |
Codecov Report
@@ Coverage Diff @@
## main #482 +/- ##
==========================================
- Coverage 94.32% 94.32% -0.01%
==========================================
Files 74 74
Lines 12890 12889 -1
==========================================
- Hits 12159 12158 -1
Misses 731 731
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Hopefully ends up resolving #481 or at least parts of it.