-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
racetest
improvement
#380
racetest
improvement
#380
Conversation
Dieter, thanks a lot for this PR. My first quick impression is positive. I will try to read the code again afresh tonight thoroughly. |
- Factor common logic to spawn and run group of test threads into TestGroup. This way the extra checks and robustness improvements, that Dieter just added to check_race_external_invalidate_vs_disconnect, become available to all tests in racetest module. Rework moved code so that nwork is not fixed beforehand and test threads can be added dynamically. - fix waiting logic in Finished: * on py2 Condition.wait does not return True/False as it does on py3 - we need to manually inspect the state. * fix race for when wait is called with already met condition: previously in such case it was waiting indefinitely and reporting failure on timeout - rename Finished to well-established concept of WaitGroup and adjust its interface accordingly (see https://pkg.go.dev/sync#WaitGroup, https://lab.nexedi.com/nexedi/pygolang/blob/master/golang/sync.py and https://lab.nexedi.com/nexedi/pygolang/blob/39dde7eb/golang/sync.cpp#L153-200) - no need to wrap try/except with additional try/finally, as try/except/finally works out of the box.
Hello Dieter. I have reviewed this PR by way of amending it via f2335127. Could you please have a look? I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications. |
Kirill Smelkov wrote at 2023-4-14 03:02 -0700:
...
I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.
You know that I do not mind when you improve my PRs in place.
I will check your modifications in the next few days
and report back.
|
Thanks, Dieter. P.S. there are two failures with "test did not finish within 60 seconds" on Windows: https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142450, https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142166. Those look to be due to overload of the machines maybe. But I'm not sure 100%. |
…t `self` Add the thread name to failure reports due to exceptions. First update `TestGroup.threadv` before the thread is started.
@navytux I (mostly) like your changes. Nevertheless, I made a few modifications - would you like to check them? |
Dieter, thanks. I'm ok with your amendments - thanks for making them and correcting things on my side.
I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them? |
Kirill Smelkov wrote at 2023-4-17 01:00 -0700:
...
I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?
Indeed, I forgot the `git add`.
Fixed now.
|
Thanks, Dieter. |
- Rename `TestGroup` -> `TestWorkGroup`. I originally named it as just WorkGroup similarly to sync.WorkGroup in pygolang, but later added "Test" prefix to highlight the difference that this class manages working group of threads that take part in tests, instead of group of arbitrary threads. However in that process I dropped the "Work" part which turned the name to be somewhat ambiguous: it is not clear whether "TestGroup" is a group of threads serving one test, or a group of separate tests without any relation to threading. To remove the ambiguity let's restore the "Work" in the name so that both "Test" and "WorkGroup" are there. - Review test_racetest.py a bit: * make it a bit more robust by increasing "ok" timeout from 0.1s to 10s. in my experience 0.1s is too little and will regularly hit "timeout" when CI machines are overloaded. I'm still somewhat uncomfortable with 0.1s timeout we left in tests that exercise timeout handling, but let's leave it as is for now and see how it goes. * add test for cases when only one thread fails out of many, and when several threads fail too. * minor cosmetics. - spellcheck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Dieter.
Thanks once again for adding the tests. I went through the code and reviewed it a bit once again. It looks good overall and I added only relatively minor tweaks in 2d45646. For me it looks close to be ready to be merged.
Would you please now take a look on your side? Hope we are converging.
Kirill
A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237. we need to either increase the wait time from 60 to something, or to see what's up with our windows runners. |
Kirill Smelkov wrote at 2023-4-18 02:22 -0700:
A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237.
we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.
If the only failure messages are "test did not end within ..." and
maybe some few "thread did not finish", then the problem
is only that we give the test insufficient time.
Should we see a lot "thread did not finish" or exceptions, then
the problem may be deeper.
I will look at the failures soon (tomorrow at the latest).
I assume that the occasional `zodbmaster` ZEO failures also
are caused by insufficient time for the `...race...disconnect` test.
|
Thanks, Dieter. As far as I see it was only "AssertionError: test did not finish within 60 seconds" this time. |
Kirill Smelkov wrote at 2023-4-18 04:08 -0700:
Thanks, Dieter. As far as I see it was only _"AssertionError: test did not finish within 60 seconds"_ this time.
I increased the allowed time from 60 t0 120 seconds.
Tests both in this PR as well as the ZEO PR have passed with this value.
Thank you for your cooperation.
|
Thanks, Dieter as well. I appreciate that it is possible to do kind of pair programming we did here for good. |
This PR tries to facilitate the failure analysis for
check_race_external_invalidate_vs_disconnect
."zopefoundation/ZEO#228" suffers occasionally from a failure of the test above. Unfortunately, the failure output is extremely difficult to analyze:
The PR protects the output of exception/traceback information from the concurrent threads with a lock and thereby prevents that the corresponding lines are mixed: exception output from one thread remains contiguous.
The PR also checks explicitly that the test finishes within the expected time. If this fails, a speaking failure is generated. Formerly, a too slow execution was recognized by one of the threads not stopping.
Finally, all threads are given a chance to stop gracefully in case of a problem -- formlerly the first thread which did not stop within the expected time cause the test to be terminated; the resulting server shutdown cause a huge amount of irrelevant
ClientDisconnected
exceptions.