`racetest` improvement #380

d-maurer · 2023-04-13T08:34:07Z

This PR tries to facilitate the failure analysis for check_race_external_invalidate_vs_disconnect.

"zopefoundation/ZEO#228" suffers occasionally from a failure of the test above. Unfortunately, the failure output is extremely difficult to analyze:

exception and traceback lines from concurrent threads are mixed; it is almost impossible to reconstruct the exception/traceback information belonging to an individual thread
if a thread fails to stop, information about potential other problems is lost

The PR protects the output of exception/traceback information from the concurrent threads with a lock and thereby prevents that the corresponding lines are mixed: exception output from one thread remains contiguous.
The PR also checks explicitly that the test finishes within the expected time. If this fails, a speaking failure is generated. Formerly, a too slow execution was recognized by one of the threads not stopping.
Finally, all threads are given a chance to stop gracefully in case of a problem -- formlerly the first thread which did not stop within the expected time cause the test to be terminated; the resulting server shutdown cause a huge amount of irrelevant ClientDisconnected exceptions.

…idate_vs_disconnect`

navytux · 2023-04-13T11:33:06Z

Dieter, thanks a lot for this PR. My first quick impression is positive. I will try to read the code again afresh tonight thoroughly.

- Factor common logic to spawn and run group of test threads into TestGroup. This way the extra checks and robustness improvements, that Dieter just added to check_race_external_invalidate_vs_disconnect, become available to all tests in racetest module. Rework moved code so that nwork is not fixed beforehand and test threads can be added dynamically. - fix waiting logic in Finished: * on py2 Condition.wait does not return True/False as it does on py3 - we need to manually inspect the state. * fix race for when wait is called with already met condition: previously in such case it was waiting indefinitely and reporting failure on timeout - rename Finished to well-established concept of WaitGroup and adjust its interface accordingly (see https://pkg.go.dev/sync#WaitGroup, https://lab.nexedi.com/nexedi/pygolang/blob/master/golang/sync.py and https://lab.nexedi.com/nexedi/pygolang/blob/39dde7eb/golang/sync.cpp#L153-200) - no need to wrap try/except with additional try/finally, as try/except/finally works out of the box.

navytux · 2023-04-14T10:02:07Z

Hello Dieter. I have reviewed this PR by way of amending it via f2335127. Could you please have a look?

I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.

d-maurer · 2023-04-14T10:11:55Z

Kirill Smelkov wrote at 2023-4-14 03:02 -0700:

... I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.

You know that I do not mind when you improve my PRs in place. I will check your modifications in the next few days and report back.

navytux · 2023-04-14T10:16:33Z

Thanks, Dieter.

P.S. there are two failures with "test did not finish within 60 seconds" on Windows: https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142450, https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142166. Those look to be due to overload of the machines maybe. But I'm not sure 100%.

…t `self` Add the thread name to failure reports due to exceptions. First update `TestGroup.threadv` before the thread is started.

d-maurer · 2023-04-15T08:40:13Z

@navytux I (mostly) like your changes. Nevertheless, I made a few modifications - would you like to check them?
We likely should add tests to verify that the new TestGroup behaves as expected in all situations.

navytux · 2023-04-17T08:00:48Z

Dieter, thanks. I'm ok with your amendments - thanks for making them and correcting things on my side.

add TestGroup tests; minor cleanup

I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?

d-maurer · 2023-04-17T08:58:09Z

Kirill Smelkov wrote at 2023-4-17 01:00 -0700:

... I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?

Indeed, I forgot the `git add`. Fixed now.

navytux · 2023-04-17T10:01:34Z

Thanks, Dieter.

- Rename `TestGroup` -> `TestWorkGroup`. I originally named it as just WorkGroup similarly to sync.WorkGroup in pygolang, but later added "Test" prefix to highlight the difference that this class manages working group of threads that take part in tests, instead of group of arbitrary threads. However in that process I dropped the "Work" part which turned the name to be somewhat ambiguous: it is not clear whether "TestGroup" is a group of threads serving one test, or a group of separate tests without any relation to threading. To remove the ambiguity let's restore the "Work" in the name so that both "Test" and "WorkGroup" are there. - Review test_racetest.py a bit: * make it a bit more robust by increasing "ok" timeout from 0.1s to 10s. in my experience 0.1s is too little and will regularly hit "timeout" when CI machines are overloaded. I'm still somewhat uncomfortable with 0.1s timeout we left in tests that exercise timeout handling, but let's leave it as is for now and see how it goes. * add test for cases when only one thread fails out of many, and when several threads fail too. * minor cosmetics. - spellcheck.

navytux

Hello Dieter.

Thanks once again for adding the tests. I went through the code and reviewed it a bit once again. It looks good overall and I added only relatively minor tweaks in 2d45646. For me it looks close to be ready to be merged.

Would you please now take a look on your side? Hope we are converging.

Kirill

navytux · 2023-04-18T09:22:20Z

A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237.

we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.

d-maurer · 2023-04-18T10:15:29Z

Kirill Smelkov wrote at 2023-4-18 02:22 -0700:

A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237. we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.

If the only failure messages are "test did not end within ..." and maybe some few "thread did not finish", then the problem is only that we give the test insufficient time. Should we see a lot "thread did not finish" or exceptions, then the problem may be deeper. I will look at the failures soon (tomorrow at the latest). I assume that the occasional `zodbmaster` ZEO failures also are caused by insufficient time for the `...race...disconnect` test.

navytux · 2023-04-18T11:08:11Z

Thanks, Dieter. As far as I see it was only "AssertionError: test did not finish within 60 seconds" this time.

d-maurer · 2023-04-18T16:01:12Z

Kirill Smelkov wrote at 2023-4-18 04:08 -0700:

Thanks, Dieter. As far as I see it was only _"AssertionError: test did not finish within 60 seconds"_ this time.

I increased the allowed time from 60 t0 120 seconds. Tests both in this PR as well as the ZEO PR have passed with this value. Thank you for your cooperation.

navytux · 2023-04-18T18:16:45Z

Thanks, Dieter as well. I appreciate that it is possible to do kind of pair programming we did here for good.

d-maurer added 4 commits April 13, 2023 10:13

try to facilitate the failure analysis for `check_race_external_inval…

006bb22

…idate_vs_disconnect`

minor improvement

5957607

fix test failures

b81442a

make flake8 happy

017928c

d-maurer mentioned this pull request Apr 14, 2023

Fix #226 using a standard asyncio.Task zopefoundation/ZEO#228

Merged

Adhere in TestGroup to the Python standard to name the method targe…

b90d90c

…t `self` Add the thread name to failure reports due to exceptions. First update `TestGroup.threadv` before the thread is started.

add TestGroup tests; minor cleanup

a0867aa

d-maurer marked this pull request as ready for review April 17, 2023 06:58

d-maurer requested a review from navytux April 17, 2023 06:58

add forgotten tests

48315c6

make flake8 happy

2de5ec1

navytux requested changes Apr 18, 2023

View reviewed changes

navytux approved these changes Apr 18, 2023

View reviewed changes

increase time for the racetests from 60 to 120 seconds

ca41ccd

d-maurer merged commit 30c861d into master Apr 18, 2023

d-maurer deleted the racetest branch April 18, 2023 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`racetest` improvement #380

`racetest` improvement #380

d-maurer commented Apr 13, 2023

navytux commented Apr 13, 2023

navytux commented Apr 14, 2023

d-maurer commented Apr 14, 2023 via email

navytux commented Apr 14, 2023

d-maurer commented Apr 15, 2023

navytux commented Apr 17, 2023

d-maurer commented Apr 17, 2023 via email

navytux commented Apr 17, 2023

navytux left a comment

navytux commented Apr 18, 2023

d-maurer commented Apr 18, 2023 via email

navytux commented Apr 18, 2023

d-maurer commented Apr 18, 2023 via email

navytux commented Apr 18, 2023

racetest improvement #380

racetest improvement #380

Conversation

d-maurer commented Apr 13, 2023

navytux commented Apr 13, 2023

navytux commented Apr 14, 2023

d-maurer commented Apr 14, 2023 via email

navytux commented Apr 14, 2023

d-maurer commented Apr 15, 2023

navytux commented Apr 17, 2023

d-maurer commented Apr 17, 2023 via email

navytux commented Apr 17, 2023

navytux left a comment

Choose a reason for hiding this comment

navytux commented Apr 18, 2023

d-maurer commented Apr 18, 2023 via email

navytux commented Apr 18, 2023

d-maurer commented Apr 18, 2023 via email

navytux commented Apr 18, 2023

`racetest` improvement #380

`racetest` improvement #380