Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix @stresstest failure #5838

Merged

Conversation

edwintorok
Copy link
Contributor

Noticed by @psafont in https://github.com/xapi-project/xen-api/actions/runs/9901545401/job/27354091939#step:7:149.

The bug is in the test code, I misinterpreted what the meaning of the timeout parameter in Buf_io.really_input means, it is not the overall timeout, but the timeout for each individual read call.
So if the other end decides to send 1 byte at a time, then the overall timeout will be number of bytes * timeout.

This may be an unintentional bug in really_input, however it is unknown whether any callers rely on the current semantics or not, so fix the test instead to check for the actual behaviour of really_input.

There is also a bug in the use of setsockopt. It gets converted to a struct timeval, and the microseconds value truncated.
So if we end up asking for a too small timeout (which as a result of the above change we might!) then that would get truncated to 0 and instead we'd hang indefinetely, because 0 means "no timeout".

@edwintorok edwintorok marked this pull request as draft July 16, 2024 09:08
@edwintorok
Copy link
Contributor Author

Marking as draft there is another failure in the CI (which didn't happen locally):

  --- Failure --------------------------------------------------------------------
  
  Test Dune__exe__Bufio_test.test_buf_io failed (2 shrink steps):
  
  (delay_read: duration: 152ns;every_bytes: 1; delay_write: duration: 152ns;every_bytes: 1; size: 655363; file_kind: "socket", 0.001)
  
  +++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  
  Messages for test Dune__exe__Bufio_test.test_buf_io:
  
  Timed out earlier than requested: 289μs < 1ms
  ================================================================================
  failure (1 tests failed, 0 tests errored, ran 1 tests)

@edwintorok
Copy link
Contributor Author

Looks like the generated delay is too small, we can't perform delays <1us.
I'll update the generator and add a constraint.

@edwintorok edwintorok force-pushed the private/edvint/fix-timeouts branch 3 times, most recently from d2d16b4 to 636dea6 Compare July 16, 2024 15:16
The float argument gets converted to microseconds when making the system call.
If it is <1us then it gets converted to 0, which has special meaning
* for select/epoll it means timeout immediately
* for setsockopt SO_RCVTIMEO it means never timeout

So do not generate tests with such small timeout, instead replace them with no delays.

We cannot call `QCheck2.assume` here (it can only be called from the body of a test),
 otherwise sometimes we might get a failure like this:
```
Test Dune__exe__Unixext_test.test_proxy failed:

ERROR: uncaught exception in generator for test Dune__exe__Unixext_test.test_proxy after 100 steps:
Exception: QCheck2.Failed_precondition
Backtrace: Raised at QCheck2.assume in file "src/core/QCheck2.ml" (inlined), line 57, characters 29-54
Called from Xapi_fd_test__Generate.delay_of_size in file "ocaml/libs/xapi-stdext/lib/xapi-fd-test/generate.ml", line 66, characters 2-34
Called from QCheck2.Tree.bind in file "src/core/QCheck2.ml", line 207, characters 28-31
Called from QCheck2.Gen.bind in file "src/core/QCheck2.ml", line 271, characters 72-80
Called from QCheck2.Tree.bind in file "src/core/QCheck2.ml", line 207, characters 28-31
Called from QCheck2.Test.check_state in file "src/core/QCheck2.ml", line 1749, characters 12-32
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
… must be at least 1

Reject <1us timeouts, they are likely the result of a rounding down bug.

If we try to use a timeout that is >0, but < 1e-6 it'd be truncated to 0.
0 has a special meaning of 'no timeout', rather than immediately timeout.

The generator got already changed to avoid generating such short timeouts.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
@edwintorok
Copy link
Contributor Author

edwintorok commented Jul 16, 2024

--- Failure --------------------------------------------------------------------

Test Dune__exe__Bufio_test.test_buf_io failed (17 shrink steps):

(delay_read: duration: 4.28μs;every_bytes: 7; delay_write: duration: 4.28μs;every_bytes: 7; size: 655363; file_kind: "socket", 0.1)

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test Dune__exe__Bufio_test.test_buf_io:

Timed out earlier than requested: 414μs < 100ms

Still fails, I wonder whether this is because this is the old code using gettimeofday and select.
But why would it time out earlier than expected?

Ah nevermind, I'm still testing the overall timeout, but even though the overall timeout wasn't hit, the per operation timeout might've been.

…n call

really_input takes a timeout parameter, however this is the timeout
for each individual read call, not the entire function.

So if we have a sender that sends a 64KiB buffer 1 byte at a time,
and we set a timeout of 100ms then the operation can take up to 109 minutes.

The timeout generated by quick-check is meant to be the total timeout, so divide it by the
number of read/write calls that we are expected to make, to ensure that the overall timeout is not excessive.

If the timeout ends up being too small then we tell quickcheck to generate another configuration
with `assume`.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
@edwintorok
Copy link
Contributor Author

10 iterations of stresstest in ocaml/libs has passed now:

for i in $(seq 1 10); do dune build @stresstest --profile=release --no-buffer -j 8 --force; done

@edwintorok edwintorok marked this pull request as ready for review July 16, 2024 15:46
@edwintorok edwintorok merged commit b8b5fd0 into xapi-project:master Jul 16, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants