-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rptest: fix test_exceed_broker_limit flake #17932
rptest: fix test_exceed_broker_limit flake #17932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. lgtm
# producer, since otherwise the consumers and the producer race and the producer | ||
# may win in which case it would be one of the consumers that fail to connect | ||
self.redpanda.wait_until( | ||
lambda: connection_count() == 6, 60, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance this becomes flaky? for example a once-in-a-while connection gets dropped and re-opened by franz-go/rpk but the metric doesn't update so quickly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider the possible flakiness here.
For one thing, this just wouldn't work on a system where there are any unaccounted-for connections, e.g., in a cloud test with other stuff connected (say, kminion) this condition might simply fail (connections may never hit 6, they may be above even from the start, or jump from 1 to 4 to 7 or something like that. However, this test is already written in a way that expects a clean system since it counts connections "exactly".
The metric itself updates instantly, but it's based on RP's view of the connections, so if a connection was silently dropped then it could continue to reflect a non-existent connection for a while as you suggest. This doesn't seem that likely as we are making fresh connections and waiting for the number to hit the expected count which generally happens almost instantly.
However, I think it would reduce the chance of future flakiness if I made this >= 6, rather than == 6, at the cost of not being informed about unexpected changes in the connection behavior. I think that's probably closer to the original intent of this test. I'll make that change unless anyone disagrees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to >= 6
in cb22f0a
ConnectionLimitsTest.test_exceed_broker_limit had a spurious failure in CI. This test starts 2 consumers which (should) consume all 6 available connections, then checks that a producer started after that fails to produce (due to connection limit being hit). However the consumer & producer starts are all async, so the producer can race ahead of one of the consumers and grab the connections for itself, failing the test. Change the test to wait for the consumers to connect, by waiting until the connection metric hits 6, then starts the producer. Fixes redpanda-data#17897.
384fc37
to
cb22f0a
Compare
gtest_raft_rpunit failed. this pr touches no .cc file or files associated with unit testing. |
/backport v23.3.x |
ConnectionLimitsTest.test_exceed_broker_limit had a spurious failure in CI. This test starts 2 consumers which (should) consume all 6 available connections, then checks that a producer started after that fails to produce (due to connection limit being hit).
However the consumer & producer starts are all async, so the producer can race ahead of one of the consumers and grab the connections for itself, failing the test.
Change the test to wait for the consumers to connect, by waiting until the connection metric hits 6, then starts the producer.
Fixes #17897.
Backports Required
Release Notes