Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

racetest improvement #380

Merged
merged 11 commits into from
Apr 18, 2023
Merged

racetest improvement #380

merged 11 commits into from
Apr 18, 2023

Conversation

d-maurer
Copy link
Contributor

This PR tries to facilitate the failure analysis for check_race_external_invalidate_vs_disconnect.

"zopefoundation/ZEO#228" suffers occasionally from a failure of the test above. Unfortunately, the failure output is extremely difficult to analyze:

  • exception and traceback lines from concurrent threads are mixed; it is almost impossible to reconstruct the exception/traceback information belonging to an individual thread
  • if a thread fails to stop, information about potential other problems is lost

The PR protects the output of exception/traceback information from the concurrent threads with a lock and thereby prevents that the corresponding lines are mixed: exception output from one thread remains contiguous.
The PR also checks explicitly that the test finishes within the expected time. If this fails, a speaking failure is generated. Formerly, a too slow execution was recognized by one of the threads not stopping.
Finally, all threads are given a chance to stop gracefully in case of a problem -- formlerly the first thread which did not stop within the expected time cause the test to be terminated; the resulting server shutdown cause a huge amount of irrelevant ClientDisconnected exceptions.

@navytux
Copy link
Contributor

navytux commented Apr 13, 2023

Dieter, thanks a lot for this PR. My first quick impression is positive. I will try to read the code again afresh tonight thoroughly.

- Factor common logic to spawn and run group of test threads into
  TestGroup. This way the extra checks and robustness improvements, that
  Dieter just added to check_race_external_invalidate_vs_disconnect,
  become available to all tests in racetest module.

  Rework moved code so that nwork is not fixed beforehand and test threads
  can be added dynamically.

- fix waiting logic in Finished:
  * on py2 Condition.wait does not return True/False as it does on py3 -
    we need to manually inspect the state.
  * fix race for when wait is called with already met condition:
    previously in such case it was waiting indefinitely and reporting
    failure on timeout

- rename Finished to well-established concept of WaitGroup and adjust
  its interface accordingly (see
  https://pkg.go.dev/sync#WaitGroup,
  https://lab.nexedi.com/nexedi/pygolang/blob/master/golang/sync.py and
  https://lab.nexedi.com/nexedi/pygolang/blob/39dde7eb/golang/sync.cpp#L153-200)

- no need to wrap try/except with additional try/finally, as
  try/except/finally works out of the box.
@navytux
Copy link
Contributor

navytux commented Apr 14, 2023

Hello Dieter. I have reviewed this PR by way of amending it via f2335127. Could you please have a look?

I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.

@d-maurer
Copy link
Contributor Author

d-maurer commented Apr 14, 2023 via email

@navytux
Copy link
Contributor

navytux commented Apr 14, 2023

Thanks, Dieter.

P.S. there are two failures with "test did not finish within 60 seconds" on Windows: https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142450, https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142166. Those look to be due to overload of the machines maybe. But I'm not sure 100%.

…t `self`

Add the thread name to failure reports due to exceptions.
First update `TestGroup.threadv` before the thread is started.
@d-maurer
Copy link
Contributor Author

@navytux I (mostly) like your changes. Nevertheless, I made a few modifications - would you like to check them?
We likely should add tests to verify that the new TestGroup behaves as expected in all situations.

@d-maurer d-maurer marked this pull request as ready for review April 17, 2023 06:58
@d-maurer d-maurer requested a review from navytux April 17, 2023 06:58
@navytux
Copy link
Contributor

navytux commented Apr 17, 2023

Dieter, thanks. I'm ok with your amendments - thanks for making them and correcting things on my side.

add TestGroup tests; minor cleanup

I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?

@d-maurer
Copy link
Contributor Author

d-maurer commented Apr 17, 2023 via email

@navytux
Copy link
Contributor

navytux commented Apr 17, 2023

Thanks, Dieter.

- Rename `TestGroup` -> `TestWorkGroup`. I originally named it as just
  WorkGroup similarly to sync.WorkGroup in pygolang, but later added
  "Test" prefix to highlight the difference that this class manages
  working group of threads that take part in tests, instead of group of
  arbitrary threads. However in that process I dropped the "Work" part
  which turned the name to be somewhat ambiguous: it is not clear whether
  "TestGroup" is a group of threads serving one test, or a group of
  separate tests without any relation to threading. To remove the
  ambiguity let's restore the "Work" in the name so that both "Test" and
  "WorkGroup" are there.

- Review test_racetest.py a bit:

  * make it a bit more robust by increasing "ok" timeout from 0.1s to 10s.
    in my experience 0.1s is too little and will regularly hit "timeout"
    when CI machines are overloaded. I'm still somewhat uncomfortable
    with 0.1s timeout we left in tests that exercise timeout handling,
    but let's leave it as is for now and see how it goes.

  * add test for cases when only one thread fails out of many, and when
    several threads fail too.

  * minor cosmetics.

- spellcheck.
Copy link
Contributor

@navytux navytux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Dieter.

Thanks once again for adding the tests. I went through the code and reviewed it a bit once again. It looks good overall and I added only relatively minor tweaks in 2d45646. For me it looks close to be ready to be merged.

Would you please now take a look on your side? Hope we are converging.

Kirill

@navytux
Copy link
Contributor

navytux commented Apr 18, 2023

A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237.

we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.

@d-maurer
Copy link
Contributor Author

d-maurer commented Apr 18, 2023 via email

@navytux
Copy link
Contributor

navytux commented Apr 18, 2023

Thanks, Dieter. As far as I see it was only "AssertionError: test did not finish within 60 seconds" this time.

@d-maurer d-maurer merged commit 30c861d into master Apr 18, 2023
@d-maurer d-maurer deleted the racetest branch April 18, 2023 15:21
@d-maurer
Copy link
Contributor Author

d-maurer commented Apr 18, 2023 via email

@navytux
Copy link
Contributor

navytux commented Apr 18, 2023

Thanks, Dieter as well. I appreciate that it is possible to do kind of pair programming we did here for good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants